I’ve heard a lot about LevelDB, now I had a chance to skim through the code on a whim, and it really deserve its reputation. If you’re interested in Storage Engine, if you want to use C++ gracefully, if you want to learn how to organize codes, I recommend it strongly. Let alone the authors are Sanjay Ghemawat and Jeff Dean.
If I don’t take some notes after reading it, I’m sure I’ll quickly forget about it, given my poor memory. So I wanted to write something about the beauty of LevelDB, but I didn’t want to follow the usual path of starting with an architectural overview and ending with a module analysis. During the days of reading the code, I was trying to figure out how to start. When I was about to finish reading the code, what struck me most was the subtlety of LevelDB’s data structures: fitting the context, building from scratch, tailoring well, and coding excellent. Then, let’s start the LevelDB series with these little corner pieces.
In this series, I would like to share three classic data structures that are widely used, namely Skip List for reading and writing fast in memory , Bloom Filter for accelerating sstable lookup and LRUCache for caching sstable blocks. This is the first article, Skip List.
LevelDB is an embed KV storage engine, which in essence provides only three interfaces: Put(key, val), Get(key)->val, Delete(key) operations on data items (key, val). In terms of implementation, LevelDB doesn’t delete the data immediately when it receives a delete request, but instead writes a special tag for the key indicating the key doesn’t exist when it is accessed, thus converting Delete to Put and thus reducing the three interfaces to two. After this cutting, all remains is the performance trade-off between Put and Get, and LevelDB’s choice is: Tradeoff some Get
performance for strong Put
performance, and then go back to optimize the Get
.
We know that in the Memory hierarchy, memory access is much faster than that on disk, so LevelDB was designed as:
- Writes (Put): Let all writes fall into memory and then batch them to disk once they reach a threshold.
- Reads (Get): As data accumulates by written over time, only a small size of data in memory and most of the them reside on disk. When a read request comes in, and not hit in memory, you need have to search it on disk.
In order to ensure write performance while optimizing read performance, the in-memory storage structure needs to support both efficient inserts(for write) and lookups(for read).
When I know about LevelDB for the first, I naturally thought that the in-memory structure (memtable) was implemented as a BST(balanced search tree), such as a red-black tree, AVL tree, etc., in which case, it could guarantee that the time complexity of both insertion and lookup is log (n), but then I read the source code and realized that it is an Skip List. The advantage of skip list over balanced tree is that they simplify the implementation greatly while maintaining a good lookup and insert performance.
Besides, the data structure needs to support efficient range scan in order to dump data to disk periodically. In summary, the in-memory data structure (memtable) of LevelDB requires:
- Efficient lookup
- Efficient insertion
- Efficient range scan
Theory
Paper Skip Lists: A Probabilistic Alternative to Balanced Trees was proposed by William Pugh in 1990. As can be seen from the title, the author aimed to design a data structure that could replace balanced trees, as he mentioned in the beginning:
Skip lists are a data structure that can be used in place of balanced trees.
Skip lists use probabilistic balancing rather than strictly enforced balancing and as a result the algorithms for insertion and deletion in skip lists are much simpler and significantly faster than equivalent algorithms for balanced trees.
There is a key word in this passage: Probabilistic Balancing, which will be left untouched for now. In the following part, we will measure SkipList from two perspectives: linked list vs skip list, and skip list vs balanced trees.
From LinkedList to SkipList
Essentially, a SkipList is a LinkedList with some extra pointers. To understand what it means, let’s consider about the way of optimizing the lookup performance of an ordered linked list.
Suppose we have an ordered linked list, which is known to have an O (n) complexity of lookup and insertion. We cannot perform binary lookups on linked list but on arrays, because we cannot use access data in linked list for O(1) complexity , thus cannot cut off half of candidates size quickly every seek. So how can we improve a linked list to make it supporting binary search?
Construct a map from indexes to nodes? This would allow binary search, but each insertion would cause a whole adjustment of the map, which has an O(n) complexity.
Add pointers so that you can access any other node in each node directly? That would require constructing a fully connected map, which would take up too much space. Besides, each node insertion will still cause O(n) pointers added.
SkipList gives an answer for this question: skip sampling to build indexes, and reduce the density bottom-up. We will explain it in more detail below using a figure from the paper.
As shown above, we initially have an ordered linked list in fig a with a dummy head node, which has a O (N) complexity for lookup. Then, we perform a skip-step sampling and connected the sampled nodes and obtain fig b, which has a O ( N/2 + 1) complexity for lookup. Later, we sample again based the nodes sampled last time, and connect the nodes with pointers to obtain fig c. The lookup complexity then becomes O (N/4 + 2). Thereafter, we repeat this sampling and connecting again and again, until there is only one node left, as shown in fig e. The lookup complexity at this point can be seen to be O (logN) .
We got the cost of a bunch of pointers added. One question raised, how many pointers are there? Let’s consider this problem step by step. As can be seen from the figure, will add the same pointers with sampled nodes each time. And our sampling strategy is based on previous set each time, with N nodes initially, and one node remains finally. Sum them, we will get (N/2 + N/4 + … + 1) = N. That is to say, we add same pointers with the nodes in the linked list to get a skip list.
We hide a question here: what is the insertion time complexity of this structure. For now, we touch the nature of the problem gradually: how should we perform the insertion?
From BalancedTrees to SkipList
In practice, we implement the dictionary or ordered array as BST(binary search trees). During insertion, if the data comes with good randomness in the key space, then the BST can gain good query performance. On the other hand, if the data comes sequentially, the search tree will become severely unbalanced, in which case, both lookup and insertion performance will degrade significantly.
To maintain the performance, we need to adjust the imbalance when we find it, hence we have the AVL tree and the red-black tree. As you know, the rotation strategy of AVL and the color labeling strategy of red-black trees are both too complex to remember. Hence, it is more impossible to write them from scratch in an interview.
The SkipList, on the other hand, uses a very clever transformation that greatly simplifies the implementation of insertion, while guaranteeing the same lookup performance. We cannot make sure that all insertion have good randomness, or balanced, in key space; but we can control the balance of the some other dimensions. For example, the probabilistic balance of the distribution of the number of pointers to each node in the SkipList.
Probabilistic Balance
To make it clearer, let’s go through the structure of a SkipList and the concepts involved. Each node in a SkipList has [1, MaxLevel] pointers, and a node with k pointers is called a level k node; the maximum levels of all nodes is the maximum levels(MaxLevel) in the SkipList. Besides, A SkipList has a dummy header node with MaxLevel pointers.
It becomes a problem to insert a node, if we use the idea introduced in understanding SkipList. For one thing, how many pointers does the inserted node need to have? How can we ensure that the lookup performance does not degrade after the insertion (i.e. maintain a balanced sampling)?
To solve this problem, Pugh has made a clever transformation: decompose the global, static index building into a separate, dynamic procedure. The key point here is to maintain the query performance by keeping the probability distribution of the global node levels in the SkipList. By analyzing the sampling process above, we can find that 50% of the nodes in SkipList are level 1 nodes, 25% of them are level 2 nodes, 12.5% of them are level 3 nodes, and so on. If we can maintain the same probability distribution for the nodes in our constructed SkipList, we can guarantee that it has the same query performance. This seems intuitively fine, and we will not dive into the mathematics behind it, you could can read the paper if you are interested in it.
With such a transformation, the two problems raised above are solved:
- The level of new nodes to insert is independently determined each time by calculating a value, such that the node levels in global satisfy the geometric distribution.
- No additional node adjustment is required for the insertion. We only need to find the position where it should to be placed and then modify its previous and next node pointers.
The time complexity of inserting a node is thus the sum of the time complexity of finding, O(logN), and the complexity of modifying a pointer, O(1) , i.e. also O(logN). The deletion process is similar to insertion, and can be transformed into binary search and modifying pointers , so the complexity is also O(logN).
LevelDB Source Code
After such a long way, we come to the section for the implementation detail of SkipList in LevelDB, which is optimized for being accessed concurrently, and provide the following guarantees:
- Write: Requires a lock on the user side when modifying the SkipList.
- Read: When reading the SkipList (point lookup or range scan), there is no need to add locks, only make sure it should not be destroyed by other threads.
In other words, the user only needs to deal with write-write conflicts and the LevelDB SkipList will handle read-write conflicts.
This is because LevelDB maintains the following invariants:
- SkipList nodes could only be added but not deleted unless the SkipList is destroyed entirely, as the SkipList does not expose a delete interface at all.
- Fields of inserted nodes are all immutable except for the
next
pointer, and only the insertion will change the SkipList.