Picodata

Professional services

Consulting and 24x7 support for Tarantool In-Memory Database

How are transactions synced to disk?

Details of the write ahead log buffering

2 minutes read

Question

What is the Tarantool write ahead buffer size, and how often is it synced? Is there any specific size and can it be tuned depending on the disk type (SSD or HDD)?

Answer

Tarantool doesn't fsync transactions to disk by default, since the [default wal_mode setting is “write”, which means “pass the data to the file system”. Memtx snapshot files have POSIX_FADV_DONTNEED setting: it instructs the filesystem to bypass its cache for this data. Since xlog files are relayed to replicas, the flag is not set for them.

If wal_mode is “sync” the entire file is opened with O_SYNC flag. In a nutshell, it instructs the operating system to write each chunk directly to disk. In our benchmarks, sync mode is ~2-3 times slower than write in a typical workload.

Tarantool transparently compresses all data that is written to disk. It also uses group commit feature to group all transactions committed in the current event loop iteration into a single batch. The batch is compressed, checksummed and passed to the operating system in a single write.

The buffer size is selected automatically depending on the intensity of write workload, to ensure as many transactions as possible are written to disk in a single batch. There are two built-in constants that configure this algorithm:

  • XLOG_TX_AUTOCOMMIT_THRESHOLD, set to 128K - if the current size of transactions in the buffer reaches this limit, the buffer is flushed to disk. A single transaction can be larger than 128K.
  • XLOG_TX_COMPRESS_THRESHOLD, set to 2K - the buffer is compressed before flushing if it is at least that big. On smaller sizes compression takes up CPU but doesn't yield seizable gains.

Recent posts

Categories

About