Miniaturization in a Snail Shell: Implementing a 256KB Miniature File System
This article is primarily “compiled” from Chapter 40 of the book “Operating Systems: Three Easy Pieces,” which is highly recommended for those who are puzzled by operating systems. This filesystem is based on a very small hard disk space, with a focus on data structures and read-write processes, deriving each fundamental aspect from scratch to help you quickly build an intuitive understanding of file systems.
File systems are fundamentally built upon block storage. However, it is worth noting that some modern distributed file systems, such as JuiceFS, are underpinned by object store. Regardless of whether it is block storage or object store, the essence lies in the data addressing and exchanging through data blocks.
We will first explore the data structures of a complete file system on the hard disk, that is, the layout; and then, by going through the processes of opening, closing, reading, and writing, we will combine each submodule to cover the key points of a file system.
Overall Layout
Let’s assume our block size is 4KB, and we have a very small hard disk with only 64 blocks (resulting in a total size of 64 * 4KB = 256KB), which is dedicated solely to the file system. Since hard disks are addressed by blocks, the addressable space ranges from 0 to 63.
Based on this mini hard disk, let’s gradually deduce this minimalist file system together. The primary purpose of a file system is undoubtedly to store user data. To this end, we reserve a section of the disk as a Data Region. Let’s assume we use the last 56 blocks for the data region.
Next, we need to save some metadata for each file in the system, such as:
- File name
- File size
- File owner
- Access permissions
- Creation and modification times And so on.
The data blocks that store this metadata are commonly referred to as inodes (index nodes). As follows, we allocate 5 blocks for the inodes.
Now that we have a data area and a file metadata area, in a normally functioning file system, it is also necessary to track which data blocks are in use and which are not. This type of data structure is what we call allocation structures. A common industry method is the free list, which strings together all free blocks in the form of a linked list. However, for simplicity, we use a simpler data structure here: a bitmap. One is used for the data area, known as the data bitmap; another is used for the inode table, known as the inode bitmap.
The concept of a bitmap is straightforward: it uses one bit of data for each inode or data block to mark whether it is free — 0 indicates free, and 1 indicates that there is data. A 4KB bitmap can track up to 32K objects at most. For convenience, we allocate an entire block for both the inode table and the data pool (even though it’s not fully utilized), resulting in the following diagram.
It can be seen that our basic approach is to arrange the data from the back to the front, with one block remaining in the end. This block is intentionally left as the superblock of the file system. The superblock, as the entry point of a file system, typically stores some file system-level metadata, such as the number of inodes and data blocks in this file system (80 and 56, respectively), the starting block offset of the inode table (3), and so on.
When the file system is mounted, the operating system first reads the superblock (which is why it is placed at the beginning) and then initializes a series of parameters based on it, mounting the data volume into the file system tree. With these basic pieces of information, when a file in this volume is accessed, its location can be progressively identified, which is the read-write process we will discuss later.
Inode (Index Node)
An inode is short for index node, which serves as the index node for files and directories. For simplicity, we use an array to organize the index nodes, with each inode associated with a number (inumber), which is its index (offset) in the array.
As mentioned earlier, each inode occupies 256 bytes. Given an inumber, we can calculate its offset on the hard disk (12KB + inumber * 256
). However, since the exchange between internal and external memory is done in blocks, we can further calculate the disk block in which it resides.
The inode primarily stores the file name, some metadata (permission controls, various events, some flag bits), and data block indices. The data block indices are also considered metadata, but they are singled out because they are very important.
We use a relatively simple indexing method: an indirect pointer. This means that what is stored in the inode is not a direct pointer to the data block, but a pointer to a block of pointers (also allocated in the data area, but it stores secondary pointers). If the file is large enough, it may also lead to tertiary pointers (as for whether our small system needs this, everyone can estimate).
However, our statistics show that in most file systems, small files are the majority. How small are they? So small that they can be stored in a single data block.
Therefore, in practice, we use a mixed method of direct pointers and indirect pointers in the inode for representation. In our file system, this means using 12 direct pointers and 1 indirect pointer. So, as long as the file size does not exceed 12 data blocks, direct pointers can be used directly. Indirect pointers are only used when the file is too large, and a new data block is allocated in the data area to store the indirect pointers.
Our data area is quite small, with only 56 blocks, assuming we use 4 bytes for indexing. Therefore, the secondary pointers can support up to (12 + 1024) · 4K
, which is a file size of 4144KB.
Another commonly used method in practice is extents. This involves representing each contiguous data area with a starting pointer and size, and then linking all the data extents of a file together. However, with multiple extents, performance can be poor when trying to access the last data extent or for random access (since the pointer to the next extent is stored within the previous extent). To optimize access speed, it is common to keep the index chain of these extents in memory. The early file system FAT of Windows operated in this manner.
Directory Organization
In our file system, the directory organization is quite straightforward — it is similar to files, where each directory also occupies an inode. However, the data block pointed to by the inode does not store file content but instead stores all the information about the files and folders contained in that directory, typically represented as List<entry name, inode number>
. Of course, to convert this into actual encoding, additional information such as the length of the file name must be stored (since file names are of variable length).
Let’s consider a simple example. Suppose we have a directory named “dir” (with an inode number of 5), which contains three files (“foor”, “bar”, and “foobar”), corresponding to inode numbers 12, 13, and 24, respectively. The information stored in the data block of this directory would be as follows:
Here, reclen
(record length) represents the size of the space occupied by the file name, and strlen
represents the actual length. The dots represent two pointers, one pointing to the current directory and the other to the parent directory. Recording reclen
may seem redundant, but it is necessary to consider the issue of file deletion (a special inum, such as 0, can be used to mark deletion). If a file or directory under the folder is deleted, it leaves a void in the storage. The presence of reclen
allows the void left by deletion to be reused for subsequently added files.
It should be noted that organizing the files in a directory linearly is the simplest method. In practice, there are other methods as well. For example, in XFS, if there are many files or subdirectories in a directory, a B+ tree is used for organization. This allows for quick determination of whether there is a file with the same name during insertion.
Free Space Management
When we need to create a new file or directory entry, we must obtain a block of available space from the file system. Therefore, how to efficiently manage free space is a significant issue. We manage this using two bitmaps, which have the advantage of simplicity but also the drawback of requiring a linear scan to find all free bit positions each time, and it can only be done at the block granularity, leaving any remaining space within a block unmanaged.
Read-Write Path
After understanding the data structures on the disk, let’s then connect the different data structures through the read-write process. We assume that the file system has already been mounted, meaning that the superblock is already in memory.
Reading a File
Our operation is quite simple: we open a file (such as /foo/bar
), read it, and then close it. For the sake of simplicity, let's assume the file size occupies one block, that is, 4KB.
When a system call open("/foo/bar", O_RDONLY)
is initiated, the file system needs to first locate the inode corresponding to the file "bar" to obtain its metadata and data location information. However, we currently only have the file path, so what do we do?
The answer is: traverse from the root directory downwards. The inode number of the root directory is either stored in the superblock or is fixed (for example, 2, as most Unix file systems start with inode 2). That is, it must be known in advance (well known).
Thus, the file system loads the inode of the root directory into memory from the hard disk, then finds the data block it points to through the inode’s pointer, and subsequently locates the foo
directory and its corresponding inode among all the subdirectories and folders it contains. By recursively repeating the above process, the final step of the open system call is to load the inode of bar
into memory, perform a permission check (comparing the process user permissions with the inode access permission controls), allocate a file descriptor and place it in the process's open file table, and then return it to the user.
Once the file is opened, a read()
system call can then be initiated to actually read the data. When reading, the first block is located based on the file's inode information (unless the offset was previously modified with lseek()
before reading), and then the data is read. At the same time, some of the inode's metadata may be changed, such as the access time. Subsequently, the offset of the file descriptor in the process is updated, and the reading continues until, at some point, the close()
system call is invoked to close the file descriptor.
When a process closes a file, the required work is much less; it only needs to release the file descriptor, and there is no actual disk I/O operation.
Finally, let’s review the process of reading a file. Starting from the inode number of the root directory, we alternately read the inodes and the corresponding data blocks until we finally find the file to be searched. Then, we need to perform data reading and also update the access time and other metadata of its inode for writing back. The following table briefly summarizes this process, showing that the entire read path does not involve the allocation structures — the data bitmap and the inode bitmap.
In terms of depth, if the hierarchy of our target search path is very extensive, this process will grow linearly; in terms of breadth, if the directories involved in the middle of the search contain a large number of directory entries, that is, the file tree is “wide,” then it may be necessary to read more than one data block each time a search is conducted in the directory.
Writing to the Hard Disk
The process of writing a file is similar to reading a file, which also involves opening the file (finding the corresponding file from the root directory), then starting to write, and finally closing it. However, unlike reading a file, writing requires the allocation of new data blocks, which involves our previous bitmap. Typically, a single write operation requires at least five I/O operations:
- Read the data bitmap (to find a free block and mark it as used in memory)
- Write back the data bitmap (to make it visible to other processes)
- Read the inode (to add a new data location pointer)
- Write back the inode
- Write data to the found free block
This is only for writing to an already existing file. If a file that does not yet exist is to be created and written to, the process is even more complex: an inode must also be created, which introduces a series of new I/O operations:
- A read of the inode bitmap (to find a free inode)
- A write-back of the inode bitmap (to mark an inode as used)
- A write to the inode itself (for initialization)
- A read and write to the data block corresponding to the parent directory’s directory entry (to add the newly created file and inode pair)
- A read and write to the parent directory’s inode (to update the modification date)
If the data blocks of the parent directory are insufficient, new space must be allocated, which would again require reading the data bitmap and data block.
Caching and Buffering
The analysis of the read-write process above shows that even such simple operations involve a large number of I/O operations, which is unacceptable in practice. To address this issue, most industrial file systems make full use of memory by caching (storing) important (i.e., frequently accessed) data blocks in memory; at the same time, to avoid frequent disk writes, modifications are first applied to a memory buffer, and then written to the disk in batches.
Early file systems introduced a fixed-size cache, which, when full, would use replacement algorithms like LRU (Least Recently Used) to evict pages. The downside was that it wasted space when the cache was not full and could lead to frequent page swapping when it was full. This approach is known as static partitioning. Most modern file systems use dynamic partitioning technology. For example, by pooling virtual memory pages and file system pages together, a unified page cache is created, allowing for more flexible allocation between the two.
The write process also benefits from the cache during the first half, which involves reading to locate the data blocks based on the path. However, for the writing part, we can use a buffer (writing buffering) to delay the disk write. Delaying the disk write has many advantages, such as the possibility of making multiple modifications to the bitmap with only a single write; accumulating a batch of changes can improve the utilization rate of I/O bandwidth; if a file is added and then deleted, the disk write might be saved altogether.
However, these performance improvements come at a cost — unexpected system crashes can lead to data loss. Therefore, although most modern file systems have enabled read and write buffering, they also provide a way for users to bypass the cache and write directly to the disk through direct I/O. Applications that are sensitive to data loss can utilize the corresponding system call fsync()
to flush data to disk immediately.
Summary
So far, we have implemented a very simple file system. Although it is small, it has all the necessary components. From this, we can discern some basic concepts of file system design:
- Use inodes to store metadata at the file granularity; use data blocks to store actual file data.
- Directories are a special type of file, but instead of storing file content, they store directory entries.
- In addition to inodes and data blocks, other data structures are also needed, such as bitmaps to track free blocks.
Starting from this basic file system, we can actually see many points that can be chosen and optimized. If interested, one can build upon the foundation of this article to explore the design of some industrial file systems.