[Notes]: LSM Trees and Storage Engines

Notes

Log-structured merge-trees (LSM trees) are a data structure used for write-heavy workloads.
LSM trees are the basis of many NoSQL/NewSQL databases e.g. Cassandra, LevelDB, RocksDB, etc.
Because of their optimization towards write-heavy workloads, LSM trees also lend themselves to time series databases like InfluxDB.
Whereas B+ trees (used in such storage engines as InnoDB) must maintain their structure via random writes, LSM trees gain their write speed by being structured in such a way as to take advantage of sequential writes.
To lend itself to sequential writes, the LSM tree is an append-only structure.
At a high level, an LSM tree is composed of different levels/tiers where each level is a different individual data structure that each have different backing storage formats that are increasingly slower.
- For example: tier 0 of an LSM tree can be backed by RAM, tier 1 by SSD, and tier 2 by spinning drives.
- This tiered storage architecture also works nicely with cloud native trend, where each tier of storage can be concretely backed by a different cloud offering depending on application needs.
- Data is first written to the first tier and data is asynchronously pushed down to lower layers as the first tier fills. Reads are performed similarly.

Although referred to as a data structure, an LSM tree can more accurately be described as a technique for composing multiple data structures to achieve desired performance characteristics.
Despite the LSM paper saying that “[a]n LSM-tree is composed of two or more tree-like component data structures”, the components that make up an LSM tree do not actually have to be tree-like at all.
- They can be any sorted data structure, which allows you to tune performance characteristics of the LSM tree tiers to your application’s requirements.
- As an example, LevelDB uses a skiplist in its first tier. The original paper suggests a (2,3) tree or an AVL tree in the first tier and a B tree in the second.
- Fun aside, Redis utilizes skiplists as the backing data structure for their sorted set data type
The canonical example of an LSM tree is a two-tiered structure composed of an in-memory tier and a on-disk tier. The in-memory tier is usually referred to as a memtable. Modern on-disk component examples utilize a structure inspired by Google’s BigTable called an SSTable (Sorted Strings Table).
An interesting point from John Pradeep Vincent on runtimes with GC - “For java implementations, the memtable is usually stored off-heap (direct memory) to avoid GC load.”

As mentioned above, this is the primary in-memory component of an LSM tree.
Since this first point of contact for writes is in-memory, a storage engine would still utilize a write ahead log (WAL) to ensure durability.
Once this tier is filled to a certain size threshold, its contents are flushed to the next tier of the LSM tree.

SSTables are just one implementation for the on-disk tier of an LSM tree but are pretty popular in existing literature, so here’s some SSTable specific notes.
SSTables, as their name implies, stores data sorted by their key. It is a file (or multiple files) that generally have two parts: an index block and a data block.
- The index block stores mappings from keys to offsets within the data block
A new SSTable (file or set of files) is created when the contents of the memtable are flushed or when a compaction operation occurs.
Because these SStables are effectively dumps of the memtable, there can be overlapping key ranges in different SSTable files. In order to improve read performance, a compaction operation is occasionally conducted.
As compaction operations occur and new SSTables are continually created from memtable flushes, a layered structure begins to manifest from the SSTables where each layer has more recent data and shadows the lower layers. Keeping the layers shallow is important to maintaining the performance of the LSM tree and hence compaction is a key component of the LSM tree.

With the above structure of a memtable and SSTables, the worst-case read path would be: memtable -> a linear scan–in reverse chronological order–of each level of SSTables
To augement the speed of reads, there are a couple of data structures that are commonly introduced into the read path. Two of them being an extra index on top of the SSTables (different from the index file specific to a single SSTable) and a Bloom filter.
- John Pradeep Vincent puts it best for a Bloom filter: due to the false positives being possible “the presence of a bloom filter improves the read performance for keys that are missing in the system but for the keys that are present in the system, the read is still expensive as we have to look into the memtable, index and the SSTable.”
In terms of general database tricks to make read and write paths faster, memory mapping the disk files or using a buffer pool are also techiniques that can be utilized

Writes always go to the in-memory component first and can trigger a flush of the memory file contents to disk.
With the auxiliary data structures mentioned above, writes would also need to upadate the WAL, the Bloom filter, and the index for SSTables.
Again the LSM tree is an append only structure, so updates will write an updated entry that will shadow the previous entry and deletions will write a tombstone indicating item deletion.

Compaction for an LSM tree is often compared to the garbage collector (GC) featured in many programming languages.
Just as with garbage collection, there are myriad strategies that can be implemented for compaction. For example: concurrent/asynchronous vs synchronous, manually triggered vs threshold triggered.
Recalling that an LSM tree is an append-only structure, compaction is then critical for cleaning up stale records and deletion tombstones.