一、levelDB基础概念
在levelDB的实现中,level(分层)无疑是最为关键的一个核心概念,在分层基础上定义和实现的compact也是该算法的基本操作。在levelDB之前,总有一个先入为主的错误观念:就是只要分层的结构,都有类似于B+树的查询结构:上层节点有指针指向子节点,并且可以在N层确定从本层的哪个节点的继续向子节点查找。但是在levelDB中这些描述并不成立,在不同的层级之间,N层并没有指针来为下一层的查询提供信息。从本质上来说,每一层的所有table从左到右只是组成了一个有序链表,而不同的层级本质上代表了其中数据的新老关系,越顶层的数据越新,这和地面化石堆积的原理相似。这意味着当查找一个元素的时候,必须要从跟开始逐层查找才能查询到最新数据。
有了这个概念,就可以理解其中关键的compact过程:在compact的时候,其实就是一个老化数据自然下沉的过程。对于第N层中的一个SSTable,它的下沉要保证它下沉到N+1层之后,N+1层的所有SSTable依然是从左到右游戏排列,并且每个SSTable的键值互不相交。这意味着需要对这个SSTable进行拆分,其内容化整为零向更低一层扩散,这个操作也会导致N+1中某些SSTable被打散重组。
那么继续回归到开始,一个原始的SSTable中的内容从哪里来呢?其它它们本身没有任何血缘关系,被放置在同一个SSTable,完全是由于它们的修改顺序连续,当这些修改生成的数据打到一定大小之后就被统一打包到这个SSTable文件。这也就意味着一个SSTable内部是有序的,但是键值的区间返回是任意的。
二、LevelDB自带文档对于实现的说明
这里明确说明:除了第一层的文件可能包含有重叠键值,其它层级的所有文件的键值区间都不会重叠。当L层的一个文件大于配置的限量大小之后,这个文件会和L+1层所有和该文件键值重叠的文件一起进行merge,从而组成一个新的L+1层文件集合。这个merge使用了bulk读取和写入,因此磁盘操作效率很高。
Files in the young level may contain overlapping keys. However files in other levels have distinct non-overlapping key ranges. Consider level number L where L >= 1. When the combined size of files in level-L exceeds (10^L) MB (i.e., 10MB for level-1, 100MB for level-2, ...), one file in level-L, and all of the overlapping files in level-(L+1) are merged to form a set of new files for level-(L+1). These merges have the effect of gradually migrating new updates from the young level to the largest level using only bulk reads and writes (i.e., minimizing expensive seeks).
关于compact:压缩过程从L层选择一个文件L,以及L+1层所有和该文件有键值重叠的文件文件作为输入。这里还特意提到:如果L层文件只和L+1层的某个文件的部分键值重叠,那么L+1层的这个文件内容全部都会当作输入来处理,并且在压缩完成之后整个丢弃。这个起始也就是说levelDB操作的基本粒度是SSTable,不可能从一个文件中取出部分键值而保留另一部分。
When the size of level L exceeds its limit, we compact it in a background thread. The compaction picks a file from level L and all overlapping files from the next level L+1. Note that if a level-L file overlaps only part of a level-(L+1) file, the entire file at level-(L+1) is used as an input to the compaction and will be discarded after the compaction. Aside: because level-0 is special (files in it may overlap each other), we treat compactions from level-0 to level-1 specially: a level-0 compaction may pick more than one level-0 file in case some of these files overlap each other.
三、levelDB的查找Get
这里实现有几点
1、一定是从第零层开始依次从上到下遍历所有层级。这是由于层级本身代表了这些数据的新老关系,之后最顶层的数据才是最后的修改结果。
2、除了第零层之外,其它所有层级可以使用二分查找,说明这些层级文件的键值区间互不重叠。
3、找到之后立即返回。这和第1点原因相同,第一次找到的就是最新的修改结果。
leveldb-masterdbversion_set.cc
Status Version::Get(const ReadOptions& options,
const LookupKey& k,
std::string* value,
GetStats* stats) {
……
for (int level = 0; level < config::kNumLevels; level++) {
size_t num_files = files_[level].size();
if (num_files == 0) continue;
// Get the list of files to search in this level
FileMetaData* const* files = &files_[level][0];
if (level == 0) {
// Level-0 files may overlap each other. Find all files that
// overlap user_key and process them in order from newest to oldest.
tmp.reserve(num_files);
for (uint32_t i = 0; i < num_files; i++) {
FileMetaData* f = files[i];
if (ucmp->Compare(user_key, f->smallest.user_key()) >= 0 &&
ucmp->Compare(user_key, f->largest.user_key()) <= 0) {
tmp.push_back(f);
}
}
if (tmp.empty()) continue;
std::sort(tmp.begin(), tmp.end(), NewestFirst);
files = &tmp[0];
num_files = tmp.size();
} else {
// Binary search to find earliest index whose largest key >= ikey.
uint32_t index = FindFile(vset_->icmp_, files_[level], ikey);
if (index >= num_files) {
files = NULL;
num_files = 0;
} else {
tmp2 = files[index];
if (ucmp->Compare(user_key, tmp2->smallest.user_key()) < 0) {
// All of "tmp2" is past any data for user_key
files = NULL;
num_files = 0;
} else {
files = &tmp2;
num_files = 1;
}
}
}
for (uint32_t i = 0; i < num_files; ++i) {
if (last_file_read != NULL && stats->seek_file == NULL) {
// We have had more than one seek for this read. Charge the 1st file.
stats->seek_file = last_file_read;
stats->seek_file_level = last_file_read_level;
}
FileMetaData* f = files[i];
last_file_read = f;
last_file_read_level = level;
Saver saver;
saver.state = kNotFound;
saver.ucmp = ucmp;
saver.user_key = user_key;
saver.value = value;
s = vset_->table_cache_->Get(options, f->number, f->file_size,
ikey, &saver, SaveValue);
if (!s.ok()) {
return s;
}
switch (saver.state) {
case kNotFound:
break; // Keep searching in other files
case kFound:
return s;
case kDeleted:
s = Status::NotFound(Slice()); // Use empty error message for speed
return s;
case kCorrupt:
s = Status::Corruption("corrupted key for ", user_key);
return s;
}
}
}
……
}
四、LevleDB的compact
compact本质上就是被merge的SSTable和一组有序文件之间的多路归并排序,这种归并本身其实非常简单,使用一个循环即可,代码位于:
leveldb-master ablemerger.cc
virtual void Next() {
assert(Valid());
// Ensure that all children are positioned after key().
// If we are moving in the forward direction, it is already
// true for all of the non-current_ children since current_ is
// the smallest child and key() == current_->key(). Otherwise,
// we explicitly position the non-current_ children.
if (direction_ != kForward) {
for (int i = 0; i < n_; i++) {
IteratorWrapper* child = &children_[i];
if (child != current_) {
child->Seek(key());
if (child->Valid() &&
comparator_->Compare(key(), child->key()) == 0) {
child->Next();
}
}
}
direction_ = kForward;
}
current_->Next();
FindSmallest();
}
从这里可以看到,所有的多路(也就是children列表)循环比较链表开始的最大元素,挑选最小一个
void MergingIterator::FindSmallest() {
IteratorWrapper* smallest = NULL;
for (int i = 0; i < n_; i++) {
IteratorWrapper* child = &children_[i];
if (child->Valid()) {
if (smallest == NULL) {
smallest = child;
} else if (comparator_->Compare(child->key(), smallest->key()) < 0) {
smallest = child;
}
}
}
current_ = smallest;
}
五、compact过程和Get过程的互斥
当执行一次查询操作时,需要的是在每个数据库结构中的当前版本信息,也就是versions_->current()函数返回的当前版本指针。这个地方的更新其实只是需要一个指针的更新,在使用之前已经使用该数据库实例的互斥锁锁定,所以在变化(指针更新)的过程中不会有影响。
Status DBImpl::Get(const ReadOptions& options,
const Slice& key,
std::string* value) {
Status s;
MutexLock l(&mutex_);
SequenceNumber snapshot;
if (options.snapshot != NULL) {
snapshot = reinterpret_cast<const SnapshotImpl*>(options.snapshot)->number_;
} else {
snapshot = versions_->LastSequence();
}
MemTable* mem = mem_;
MemTable* imm = imm_;
Version* current = versions_->current();
……