Finally upgraded the blog! It will be located at https://www.nerdondon.com now. Gonna keep this around cuz I’m sentimental like that.

Prelude

These are just some of my notes and references on Bloom filters while doing research for building an LSM tree based database.

Notes

  • A Bloom filter is a probablistic data structure that allows checks for set membership. These checks for set membership can return false positives but will never produce false negatives. More clearly: “if a negative match is returned, the element is guaranteed not to be a member of the set” (Alex Petrov - Database Internals).

  • Used to lower I/O costs incurred by the tiered structure of LSM-trees

Useful references

Prelude

These are just some of my notes and references from implementing a skip list in Rust. Creating a skip list implementation that has concurrency features is my personal intro project to Rust so there will also be some notes on lock-free and wait-free techniques. The overarching goal is to utilize this skip list as the in-memory component of an LSM-tree based storage engine, that I will maybe, hopefully, aspirationally, try to make. LevelDB in Rust if you will.

Notes

  • Used as the in-memory component of LevelDB’s log structured merge tree (LSM tree) implementation

  • Skip list vs self-balancing search trees (e.g. Red-black trees)

    • A skip list is more amenable to concurrency because less nodes are required to be updated during an insert

    • On update, a self-balancing tree may require multiple ancestors to be changed when the balancing operations (e.g. rotations in a Red-black tree)

  • LevelDB is doesn’t have concurrent writes? Source appears to require an external lock mechanism for writes.

  • When used as the in-memory component of an LSM tree, the skip list should support sorting by reverse chronological order. This can be done by having a synthetic key with a timestamp/revision number or via reverse iteration by back pointers. This is because the LSM tree is an append only structure.

Some concurrency stuff

Useful references

Notes

  • Log-structured merge-trees (LSM trees) are a data structure used for write-heavy workloads.

  • LSM trees are the basis of many NoSQL/NewSQL databases e.g. Cassandra, LevelDB, RocksDB, etc.

  • Because of their optimization towards write-heavy workloads, LSM trees also lend themselves to time series databases like InfluxDB.

  • Whereas B+ trees (used in such storage engines as InnoDB) must maintain their structure via random writes, LSM trees gain their write speed by being structured in such a way as to take advantage of sequential writes.

  • To lend itself to sequential writes, the LSM tree is an append-only structure.

  • At a high level, an LSM tree is composed of different levels/tiers where each level is a different individual data structure that each have different backing storage formats that are increasingly slower.

    • For example: tier 0 of an LSM tree can be backed by RAM, tier 1 by SSD, and tier 2 by spinning drives.
    • This tiered storage architecture also works nicely with cloud native trend, where each tier of storage can be concretely backed by a different cloud offering depending on application needs.
    • Data is first written to the first tier and data is asynchronously pushed down to lower layers as the first tier fills. Reads are performed similarly.

Structure

  • Although referred to as a data structure, an LSM tree can more accurately be described as a technique for composing multiple data structures to achieve desired performance characteristics.

  • Despite the LSM paper saying that “[a]n LSM-tree is composed of two or more tree-like component data structures”, the components that make up an LSM tree do not actually have to be tree-like at all.

    • They can be any sorted data structure, which allows you to tune performance characteristics of the LSM tree tiers to your application’s requirements.
    • As an example, LevelDB uses a skiplist in its first tier. The original paper suggests a (2,3) tree or an AVL tree in the first tier and a B tree in the second.
    • Fun aside, Redis utilizes skiplists as the backing data structure for their sorted set data type
  • The canonical example of an LSM tree is a two-tiered structure composed of an in-memory tier and a on-disk tier. The in-memory tier is usually referred to as a memtable. Modern on-disk component examples utilize a structure inspired by Google’s BigTable called an SSTable (Sorted Strings Table).

  • An interesting point from John Pradeep Vincent on runtimes with GC - “For java implementations, the memtable is usually stored off-heap (direct memory) to avoid GC load.”

Memtable

  • As mentioned above, this is the primary in-memory component of an LSM tree.

  • Since this first point of contact for writes is in-memory, a storage engine would still utilize a write ahead log (WAL) to ensure durability.

  • Once this tier is filled to a certain size threshold, its contents are flushed to the next tier of the LSM tree.

SSTable (Sorted Strings Table)

  • SSTables are just one implementation for the on-disk tier of an LSM tree but are pretty popular in existing literature, so here’s some SSTable specific notes.

  • SSTables, as their name implies, stores data sorted by their key. It is a file (or multiple files) that generally have two parts: an index block and a data block.

    • The index block stores mappings from keys to offsets within the data block
  • A new SSTable (file or set of files) is created when the contents of the memtable are flushed or when a compaction operation occurs.

  • Because these SStables are effectively dumps of the memtable, there can be overlapping key ranges in different SSTable files. In order to improve read performance, a compaction operation is occasionally conducted.

  • As compaction operations occur and new SSTables are continually created from memtable flushes, a layered structure begins to manifest from the SSTables where each layer has more recent data and shadows the lower layers. Keeping the layers shallow is important to maintaining the performance of the LSM tree and hence compaction is a key component of the LSM tree.

Operations

Read

  • With the above structure of a memtable and SSTables, the worst-case read path would be: memtable -> a linear scan–in reverse chronological order–of each level of SSTables

  • To augement the speed of reads, there are a couple of data structures that are commonly introduced into the read path. Two of them being an extra index on top of the SSTables (different from the index file specific to a single SSTable) and a Bloom filter.

    • John Pradeep Vincent puts it best for a Bloom filter: due to the false positives being possible “the presence of a bloom filter improves the read performance for keys that are missing in the system but for the keys that are present in the system, the read is still expensive as we have to look into the memtable, index and the SSTable.”
  • In terms of general database tricks to make read and write paths faster, memory mapping the disk files or using a buffer pool are also techiniques that can be utilized

Write

  • Writes always go to the in-memory component first and can trigger a flush of the memory file contents to disk.

  • With the auxiliary data structures mentioned above, writes would also need to upadate the WAL, the Bloom filter, and the index for SSTables.

  • Again the LSM tree is an append only structure, so updates will write an updated entry that will shadow the previous entry and deletions will write a tombstone indicating item deletion.

Compaction

  • Compaction for an LSM tree is often compared to the garbage collector (GC) featured in many programming languages.

  • Just as with garbage collection, there are myriad strategies that can be implemented for compaction. For example: concurrent/asynchronous vs synchronous, manually triggered vs threshold triggered.

  • Recalling that an LSM tree is an append-only structure, compaction is then critical for cleaning up stale records and deletion tombstones.

Other topics for expansion

Potential topics that I would like to expand on if I revisit this post or ideas for subsequent posts.

  • Discussion on read and write amplication and comparing and contrasting those values for LSM trees vs B trees

  • Expand on Niv Dayan video on knobs to turn to change scaling properties of an LSM tree implementation

  • Cassandra and BigTable SSTable implementation details

  • Compare against update in place data structures like a copy-on-write (CoW) tree

Useful references

Prelude

Some notes from the Envoy Internals Deep Dive talk by Matt Klein. The talk was given at KubeCon EU 2018. Watch on YouTube here.

Actual Notes :D

Architecture

  • Envoy is designed with an out of process architecture specifically to address the polyglot nature of microservices

  • At its core, Envoy is a byte proxy. This gives it extensibility to supoort a multitude of protocols.

  • Create with versatility from the start so that Envoy can act as service, middle, and edge proxy

  • Hot restart is key feature

  • One part of Envoy is built as a pipeline with filters as building blocks. Filters are extension points where users can add functionality.

  • The other part of the architecture is where the central management functionality lives. Things like the cluster/service managers or stats engine.

  • The prevailing school of thought for proxies before the 2000s was to have serve a single connection per thread. Envoy is takes a different approach where it serves multiple connections per thread via an event loop.

  • The “main” thread of Envoy handles the management functionality and worker threads handle the connections and filter pipeline. The key to keeping Envoy simple is the latter part of the previous statement. It means that worker threads do not have to communicate with each other and only have to acquire locks in a small set of cases.

  • Read-Copy-Update (RCU) is used to govern parallelization within Envoy. “Designed for read heavy, write infrequent workloads”. Helps reduce the need for locking by making the read path lock free.

  • Envoy keeps a vector pointers in thread local storage (TLS) that is mirrored across the main and worker threads. RCU is used to update slots in the vector and posts messages to worker threads to update information e.g. route tables.

Hot Restarts

  • “Full binary reload without dropping any connections”

  • Allow two processes of Envoy to run at the same time and utilize a shared memory region. Stats and locks are stored here.

  • The two proceses communicate through an RPC protocol over Unix domain sockets (UDS) and perform socket passing. After the second Envoy process spins up, it will communicate that it is good to go and the first process will begin to drain connections.

Stats

  • There’s a 2 level caching mechanism for stats. There is a TLS cache, a central cache on the main thread, and the actual entries either in shared memory or in process memory.

Useful references