本文是第三章SSTable and LSM-Trees部分的读书笔记。
这部分包括的内容为
- 介绍如何用Hash Index与log组成key-value数据库,为了引入以SSTable作为log的实现
- SSTable的特点
- 如何处理读写请求,如何保证crash-safe
- compact和merge的策略
B-Trees的优点在于读更快, LSM-Trees的优点在于写更快
Hash Index + log
内存中存索引,磁盘上存log
- 分成多个segment,只往最新的segment写入
- 为了节省磁盘空间。定期删除重复key(
compaction
) - 因为compaction会让segment变小,可以在compaction的同时进行
merging
- compaction和merging不会修改原文件,而是在新的文件上执行。那么在这期间,依然可以处理读操作
- 每个segment对应一个hash table,查找的时候先查新的segment的hash table,再查旧的segment的hash table。
需要面对的问题
- log的格式
- 如何删除记录。方式:插入一条“删除”的记录,merging的时候,该记录会通知merging线程删掉对应的数据。If you want to delete a key and its associated value, you have to append a special deletion record to the data file(sometimes called a tombstone). When log segments are merged, the tombstone tells the merging process to discard any previous values for the deleted key.
- crach recovery
- Partially written records。写到一半就故障了怎么办,一种方案是引入checksum
- 并发控制。通常是只有一个writer,有多个reader,不需要锁
SSTable的特点
一个SSTable等价于一个log segment
SSTable特点:
- segment内有序。
- segment内key唯一。
- 合并segment非常的高效(因为segment内部是有序的)
- 如果使用hash索引,不用在内存中为每个key分配一个index,而是一个index对应很多key-value对,这样就
减少了内存
中需要维护的数据结构的大小 - 可以
压缩
一系列的key-value对后再写入segment,从而节省磁盘空间
hash索引的例子:in-memory index
+ Sorted segment file(SSTable) on disk
处理读写请求与保证crash-safe
处理写请求
- 内存中维护一个平衡树结构memtable来处理写请求。When a write comes in, add it to an in-memory balanced tree data structure. This in-memory tree is sometimes called a memtable
- 内存中的memtable大小超过一定阀值后,写入到磁盘存储为SSTable文件。When the memtable gets bigger than some threshold—typically a few megabytes —write it out to disk as an SSTable file. This can be done efficiently because the tree already maintains the key-value pairs sorted by key. The new SSTable file becomes the most recent segment of the database. While the SSTable is being written out to disk, writes can continue to a new memtable instance.
处理读请求
- 处理读请求的时候,首先从memtable中找,然后从最新的磁盘上的SSTable文件找,没找到就依次找旧的SSTable文件。In order to serve a read request, first try to find the key in the memtable, then in the most recent on-disk segment, then in the next-older segment, etc.
- 因为读数据的时候需要一个一个的SSTable找,所以如果
数据不存在
,就需要找遍所有的segment,这会导致速度极其慢。一个通常的做法是额外维护一个bloom filter
merging和compaction
- 有个后台线程一直运行,负责合并(merging)和压缩(compaction),并且负责抛弃重复或被删除的值。From time to time, run a merging and compaction process in the background to combine segment files and to discard overwritten or deleted values.
为了保证服务重启后memtable中的内容不会丢失,额外使用一个WAL日志来记录memtable中的内容,每次memtable被写入到磁盘后,该WAL日志就删除掉
compact和merge的策略
size-tiered
. 新的segment一直和旧的segment合并。newer and smaller SSTables are successively merged into older and larger SSTables.leveled
. the key range is split up into smaller SSTables and older data is moved into separate “levels,” which allows the compaction to proceed more incrementally and use less disk space.
参考:
Designing Data-Intensive Applications https://book.douban.com/subject/26197294/