How to implement a database?

4 min readDec 17, 2022

There is a question on Zhihu: How to implement a database? With the most commonly used thinking in computers to analyze and divide problems, we can consider how to implement a database from two dimensions: logic and physics .

Logical View

From the logical view, we can split the solution to two parts: data model and data organization.

Data Model

We have to define what kind of data model first when considering implementing a database. This is not a problem few years ago, in which period database means relational database, and even Oracle/SQLServer/MySQL/PostgreSQL. But with the increasing of data size and enrichment of use cases, there is no database can fit all.

Nowadays, there are many kinds of data models and their representative products:

the Relational model ( Oracle)
Document model ( MongoDB）
Search Engine（Elasticsearch)
KV model ( Redis)
Wide Column model ( Cassandra)
Graph model (Neo4j)
Time Series model ( InfluxDB)

For your interest, more models and their products can be found in the DB-Engines ranking.

Data Organization

Essentially, database is used for data storage and retrieval. And in a programmer’s view, it means how to organize data in the computer memory hierarchy.

If we have learned courses about operating system or computer architecture, we must know that:

If the memory unit closer to CPU, it will become faster, smaller and more expensive, such as register, cache and memory.
If the memory unit farther to CPU, it will become slower, larger yet cheaper, such as flash, hard drives and tape.

In recent years, some new products have been emerging, such as persistent memory and cloud storage.

Persistent Memory. The representative product is intel optane persistent memory. It lays between main memory and flash storage approximately in the memory hierarchy. But it has not been widely used in the industry yet, because of unclear position. It’s not larger enough than main memory, and not cheaper than the flash storage.
Cloud Storage. The representative product is AWS S3. It’s a kind of object storage with simple interface. AWS has been making it scalable, cheap and enough bandwidth, which leads to a great success in the market. All cloud database have to consider wether to replace the underly storage with S3, since it has been part of the cloud infrastructure.

Physical View

The database could be split to retrieval engine and storage engine, physically. Intuitively, their responsibility are:

Storage Engine: organize data in the secondary storage and load data from it to main memory
Retrieval Engine: parse the use inputs and transform them to reads and writes of storage engine.

Query Engine

Every data model has its most suitable or popular way to query. For relational model, we all know it is SQL. Then in the following context, we use SQL as an example to analyze what components a typical query engine will have.

SQL, just as general-purpose programming language, need have a front-end:

Parser: parse tokens from query sentences and organize them to an query tree
Validator: validate the query tree with schema

But, as a declarative language, SQL has much more flexibilities to optimize its query execution:

Planner: convert meaningful name to meaningless id according to the catalog.
Optimizer: with the help of relational algebra, we can optimize the query execution with rule-based ways or cost-based ways.
Executor: update or retrieve data in the storage engine according to the optimized plan tree.

Tree is a commonly used data structure in the data system. Most data retrieve could be represented as an set of operators applied on a dataset, thus creating an tree structure:

Leaf Nodes: data sets which have different granularities, such as column, row or table.
Intermediate Nodes: all kinds of operators, such as selection, project, join, top, limit and so on.

In a broad sense, big data systems like Hadoop, Spark and Flink also have similar mechanisms. But they either have less operators or more focus on streaming.

Storage Engine

Storage engine is used to organize data in memory and secondary storage.

First, for the secondary storage, we should consider many tradeoff according to the user Scenarios:

TP or AP: using row-oriented encoding or column-based encoding
read/write ratio: using B+ tree(update in-place) or LSM-Tree(update by appending)
security: using encryption or not

Secondly, we should consider how to move data between main memory and secondary storage:

For the memory is much smaller than the secondary storage in size, we can only load part of data of database to the memory. Then it is algorithm for we to decide which data to load and which to evict. That is — Buffer Pool and LRU.
Since the IO is much slower than the CPU and locality of reference for memory access, we should only fetch a chunk of data instead of only the bytes needed every time. That is — block or page.

At last, how the data laid in memory relates both query engine and storage engine:

row-based or column-based. For the latter, we could use SIMD to accelerate the execution.
dense or sparse. It depends on how many null values in our dataset.
isomorphic or heterogeneous. If database should support dynamic type query and nested types.

For now, we only consider how the data is organized in single machine. If our target dataset size or throughput requirements exceed the stand-alone machine, we should turn to the distributed system — connect many machines with network and provide service as a whole.

Using distributed architecture, it will introduce unreliable network and clocks. It is another long storage for how to handle them correctly. For the sake of simplicity, let’s stop here.