Building an Append-only Log From Scratch


Append-only log, sometimes called write ahead log, is a fundamental building block in all modem databases. It’s used to persist mutation commands for recovery purposes. A command to change the database’s state will first be recorded in such a log before applying to the database. Should the database crash and lose updates, its state can be restored by applying the commands since the most recent snapshot backup. The log’s append-only nature is partly due to functional design — because we only ever need to add new commands to the end — and partly due to performance need — sequential file access is orders of magnitude faster than random file access.

While there are too many blog posts about this already, I hope to look at it from an implementation point of view. Let’s see how we could build an append-only log from scratch and some of the practical considerations in real production systems. I have the full implementation on Github [1]. It’s implemented in Golang. Feel free to check out the repo and play around with the code.

Record Format

First and foremost, we need to define our record structure in the log. Without loss of generality, our record is just a byte array. We need to store the length as well to aid reading. To detect data corruption, which happens in real life, we should introduce a checksum field.

type Record struct {
len uint32
data []byte
crc uint32

The next thing is to marshal the record into bytes so that we can write out to the file. The len occupies 4 bytes, the data occupies len bytes, and the crc occupies the final 4 bytes. It’s also straightforward to transform the marshaled record back to the structure. We read in the first 4 bytes to find out the len, and read in the next len bytes for the data. In the end, we verify that the len and data together yield the same checksum as the stored crc.

Upon start, the program needs to open a dedicated log file. It can either create a new one or open an existing one in append mode. The append-only log has an AppendRecord method, by which callers can submit log records.

Durability vs Throughput

Once a log record is passed in through the AppendRecord method, we could immediately write it to the log file. However, real world systems typically have an application level buffer to batch the log records, and only flush them to file when enough have accumulated. This raises a question of durability. If the server crashes before the log records are flushed to disk, updates will be lost. But if we flush too often, the throughput will be limited. It’s a trade-off that often lends itself in system configurations.

The possibility of losing updates may sound scary. But in fact, even if we immediately write every record to file, there are still chances of losing updates because the operating system often caches data in memory before actually writing to disk. The caching time in a modem OS is around 30 seconds. Unless we explicitly invoke fsync on that file descriptor, upon which the OS has to flush the data to disk. Calling fsync for every record is extremely expensive. The throughput could be thousands of times worse. I did a quick benchmark using my laptop. The diff is indeed astounding.

| Test           | Iterations | Cost           |
| BenchmarkSync | 1000000000 | 0.363 ns/op |
| BenchmarkAsync | 1000000000 | 0.000796 ns/op |

Even if we’re willing to pay that performance penalty, our updates are still not 100% safe. The disk drivers often have their own layer of cache and they don’t usually offer a control interface to applications. Thus, there is no such thing as absolutely safe. Even if we use a custom OS that surfaces a control interface to force sync in the disk driver, the disk could run at fault. This is not to say we just don’t care since there is no guarantee, but instead we should evaluate our use cases and analyze what’s at stake and how reliable we need the database to be against the remote chance of failure. Real world systems often end up doing fsync every N times to maintain a balance between durability guarantee and throughput.

Log File Structure

One other thing to note is that so far we just keep writing records consecutively in the log. In reality, we’d want to write in a way that respects page boundary. File content is loaded into memory in chunks for read/write. And paging is the canonical approach in many operating systems. Aligning with the page boundary enables the OS to better operate the file. Page size should be a configurable parameter in our append-only log for portability. If there is not enough space at the end of a page to even store the len, just don’t bother. Pad the remaining bytes with 0 and start a new page. This avoids the awkward situation that we need to keep two pages in memory just to read an integer. Now, you may wonder what if a record is longer than a page size. That’s fine. We can go across pages. A smart trick LevelDB uses is to have a record type that designates whether the record in the page is a FULL record, a FIRST record, a MIDDLE record, or a LAST record. I have another post that talks about many other features in LevelDB [2].

type Record struct {
type uint8 // FULL, FIRST, MIDDLE, LAST
len uint32
data []byte
crc uint32

When a record fits in a page, its type is always FULL. When we need to go across pages, we break the record into segments, each represented by a standalone record. We can then use the record type to denote the start (FIRST), the end (LAST), and the parts in between (MIDDLE).

Log Cleanup

One last thing to mention is that we don’t want the log to go on indefinitely. That’s just waste of disk space. We need to truncate the log when we know the earlier part is no longer needed. In practice, we don’t go back and modify the existing log. We just start a new log file after a new snapshot backup. Then we can discard the old log files.




Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store