• LevelDB的源码阅读(四) Compaction操作


    leveldb的数据存储采用LSM的思想,将随机写入变为顺序写入,记录写入操作日志,一旦日志被以追加写的形式写入硬盘,就返回写入成功,由后台线程将写入日志作用于原有的磁盘文件生成新的磁盘数据.Leveldb在内存中维护一个数据结构memtable,采用skiplist来实现,保存当前写入的数据,当数据达到一定规模后变为不可写的内存表immutable table.新的写入操作会写入新的memtable,而immutable table会被后台线程写入到数据文件.Leveldb的数据文件是按层存放的,默认配置的最高层级是7,即level0,level1,…,level7.内存中的immutable总是写入level0,除level0之外的各个层leveli的所有数据文件的key范围都是互相不相交的.当满足一定条件时,leveli的数据文件会和leveli+1的数据文件进行merge,产生新的leveli+1层级的文件,这个磁盘文件的merge过程和immutable的dump过程叫做Compaction,在leveldb中是由一个单独的后台线程来完成的.

    进行Compaction操作的条件如下:

    1.产生了新的immutable table需要写入数据文件

    2.某个level的数据规模过大

    3.某个文件被无效查询的次数过多(在文件i中查询key,没有找到key,这次查询称为文件i的无效查询)

    4.手动compaction

    满足以上条件会启动Compaction过程,接下来分析详细的Compaction过程.

     Leveldb进行Compaction的入口函数是db文件夹下db_impl.cc文件中的DBImpl::MaybeScheduleCompaction,该函数在每次leveldb进行读写操作时都有可能被调用.源码内容如下:

    void DBImpl::MaybeScheduleCompaction() {
      mutex_.AssertHeld();
      if (bg_compaction_scheduled_) {
        // Already scheduled
      } else if (shutting_down_.Acquire_Load()) {
        // DB is being deleted; no more background compactions
      } else if (!bg_error_.ok()) {
        // Already got an error; no more changes
      } else if (imm_ == NULL &&
                 manual_compaction_ == NULL &&
                 !versions_->NeedsCompaction()) {
        // No work to be done
      } else {
        bg_compaction_scheduled_ = true;
        env_->Schedule(&DBImpl::BGWork, this);  //新建后台任务并进行调度
      }
    }

    首先调用db文件夹下version_set.h中的NeedsCompaction()判断是否需要启动Compact任务.源码内容如下:

    // Returns true iff some level needs a compaction.
      bool NeedsCompaction() const {
        Version* v = current_;
        return (v->compaction_score_ >= 1) || (v->file_to_compact_ != NULL);
      }

    version_set.cc中compaction_score_ 的计算如下:

    void VersionSet::Finalize(Version* v) {
      // Precomputed best level for next compaction
      int best_level = -1;
      double best_score = -1;
    
      for (int level = 0; level < config::kNumLevels-1; level++) {
        double score;
        if (level == 0) {
          // We treat level-0 specially by bounding the number of files
          // instead of number of bytes for two reasons:
          //
          // (1) With larger write-buffer sizes, it is nice not to do too
          // many level-0 compactions.
          //
          // (2) The files in level-0 are merged on every read and
          // therefore we wish to avoid too many files when the individual
          // file size is small (perhaps because of a small write-buffer
          // setting, or very high compression ratios, or lots of
          // overwrites/deletions).
          score = v->files_[level].size() /
              static_cast<double>(config::kL0_CompactionTrigger);
        } else {
          // Compute the ratio of current size to size limit.
          const uint64_t level_bytes = TotalFileSize(v->files_[level]);
          score = static_cast<double>(level_bytes) / MaxBytesForLevel(level);
        }
    
        if (score > best_score) {
          best_level = level;
          best_score = score;
        }
      }
    
      v->compaction_level_ = best_level;
      v->compaction_score_ = best_score;
    }

    注意,这里同时预计算了进行compaction的最佳level.

    确认需要启动compaction之后,调用util文件夹下env_posix.cc文件中的PosixEnv::Schedule函数启动Compact过程.

    void PosixEnv::Schedule(void (*function)(void*), void* arg) {
      PthreadCall("lock", pthread_mutex_lock(&mu_));
    
      // Start background thread if necessary
      if (!started_bgthread_) {
        started_bgthread_ = true;
        PthreadCall(
            "create thread",
            pthread_create(&bgthread_, NULL,  &PosixEnv::BGThreadWrapper, this));
      }
    
      // If the queue is currently empty, the background thread may currently be
      // waiting.
      if (queue_.empty()) {
        PthreadCall("signal", pthread_cond_signal(&bgsignal_));
      }
    
      // Add to priority queue
      queue_.push_back(BGItem());
      queue_.back().function = function;
      queue_.back().arg = arg;
    
      PthreadCall("unlock", pthread_mutex_unlock(&mu_));
    }

     如果没有后台线程,则创建后台线程,否则新建一个后台执行任务BGItem压入后台线程任务队列,然后调用PosixEnv::BGThreadWrapper唤醒后台线程: 

    static void* BGThreadWrapper(void* arg) {
        reinterpret_cast<PosixEnv*>(arg)->BGThread();
        return NULL;
      }

     BGThreadWrapper调用PosixEnv::BGThread,不断地从后台任务队列中拿到任务,然后执行任务

    void PosixEnv::BGThread() {
      while (true) {
        // Wait until there is an item that is ready to run
        PthreadCall("lock", pthread_mutex_lock(&mu_));
        while (queue_.empty()) {
          PthreadCall("wait", pthread_cond_wait(&bgsignal_, &mu_));
        }
    
        void (*function)(void*) = queue_.front().function;
        void* arg = queue_.front().arg;
        queue_.pop_front();
    
        PthreadCall("unlock", pthread_mutex_unlock(&mu_));
        (*function)(arg);
      }
    }

    回到DBImpl::MaybeScheduleCompaction,方便理解起见这里再重复一遍源码

    void DBImpl::MaybeScheduleCompaction() {
      mutex_.AssertHeld();
      if (bg_compaction_scheduled_) {
        // Already scheduled
      } else if (shutting_down_.Acquire_Load()) {
        // DB is being deleted; no more background compactions
      } else if (!bg_error_.ok()) {
        // Already got an error; no more changes
      } else if (imm_ == NULL &&
                 manual_compaction_ == NULL &&
                 !versions_->NeedsCompaction()) {
        // No work to be done
      } else {
        bg_compaction_scheduled_ = true;
        env_->Schedule(&DBImpl::BGWork, this);  //新建后台任务并进行调度
      }
    }

    之前分析了env_->Schedule进行的调度过程,现在来分析实际进行后台任务的DBImpl::BGWork.DBImpl::BGWork在db文件夹下db_impl.cc文件中.

    void DBImpl::BGWork(void* db) {
      reinterpret_cast<DBImpl*>(db)->BackgroundCall();
    }

    DBImpl::BGWork调用DBImpl::BackgroundCall(),合并完成后可能导致有的level的文件数过多,因此会再次调用MaybeScheduleCompaction()判断是否需要继续进行合并.

    void DBImpl::BackgroundCall() {
      MutexLock l(&mutex_);
      assert(bg_compaction_scheduled_);
      if (shutting_down_.Acquire_Load()) {
        // No more background work when shutting down.
      } else if (!bg_error_.ok()) {
        // No more background work after a background error.
      } else {
        BackgroundCompaction();
      }
    
      bg_compaction_scheduled_ = false;
    
      // Previous compaction may have produced too many files in a level,
      // so reschedule another compaction if needed.
      MaybeScheduleCompaction();
      bg_cv_.SignalAll();
    }

    DBImpl::BackgroundCall()调用 BackgroundCompaction(),在BackgroundCompaction()中分别完成三种不同的Compaction:对Memtable进行合并、 trivial Compaction(直接将文件移动到下一层)以及一般的合并,调用DoCompactionWork()实现.

    void DBImpl::BackgroundCompaction() {
      mutex_.AssertHeld();
    
      if (imm_ != NULL) {
        CompactMemTable();//1、对Memtable进行合并  
        return;
      }
    
      Compaction* c;
      bool is_manual = (manual_compaction_ != NULL);//manual_compaction默认为NULL,则is_manual默认为false  
      InternalKey manual_end;
      if (is_manual) { //取得手动compaction对象
        ManualCompaction* m = manual_compaction_;
        c = versions_->CompactRange(m->level, m->begin, m->end);
        m->done = (c == NULL);
        if (c != NULL) {
          manual_end = c->input(0, c->num_input_files(0) - 1)->largest;
        }
        Log(options_.info_log,
            "Manual compaction at level-%d from %s .. %s; will stop at %s
    ",
            m->level,
            (m->begin ? m->begin->DebugString().c_str() : "(begin)"),
            (m->end ? m->end->DebugString().c_str() : "(end)"),
            (m->done ? "(end)" : manual_end.DebugString().c_str()));
      } else {   //取得自动compaction对象
        c = versions_->PickCompaction();
      }
    
      Status status;
      if (c == NULL) {
        // Nothing to do
      } else if (!is_manual && c->IsTrivialMove()) {//2、IsTrivialMove 返回 True,trivial Compaction,则直接将文件移入 level + 1 层即可
        // Move file to next level
        assert(c->num_input_files(0) == 1);
        FileMetaData* f = c->input(0, 0);
        c->edit()->DeleteFile(c->level(), f->number);
        c->edit()->AddFile(c->level() + 1, f->number, f->file_size,
                           f->smallest, f->largest);
        status = versions_->LogAndApply(c->edit(), &mutex_);
        if (!status.ok()) {
          RecordBackgroundError(status);
        }
        VersionSet::LevelSummaryStorage tmp;
        Log(options_.info_log, "Moved #%lld to level-%d %lld bytes %s: %s
    ",
            static_cast<unsigned long long>(f->number),
            c->level() + 1,
            static_cast<unsigned long long>(f->file_size),
            status.ToString().c_str(),
            versions_->LevelSummary(&tmp));
      } else { //3、一般的合并  
        CompactionState* compact = new CompactionState(c);
        status = DoCompactionWork(compact); //进行compaction
        if (!status.ok()) {
          RecordBackgroundError(status);
        }
        CleanupCompaction(compact);
        c->ReleaseInputs();      // input的文件引用计数减少1
        DeleteObsoleteFiles();   //删除无用文件
      }
      delete c;
    
      if (status.ok()) {
        // Done
      } else if (shutting_down_.Acquire_Load()) {
        // Ignore compaction errors found during shutting down
      } else {
        Log(options_.info_log,
            "Compaction error: %s", status.ToString().c_str());
      }
    
      if (is_manual) {
        ManualCompaction* m = manual_compaction_;   //标记手动compaction任务完成
        if (!status.ok()) {
          m->done = true;
        }
        if (!m->done) {
          // We only compacted part of the requested range.  Update *m
          // to the range that is left to be compacted.
          m->tmp_storage = manual_end;
          m->begin = &m->tmp_storage;
        }
        manual_compaction_ = NULL;
      }
    }

     首行mutex_.AssertHeld(),Mutex的AssertHeld函数实现默认为空,在很多函数的实现内有调用,其作用如下: 

    As you have observed it does nothing in the default implementation. The function seems to be a placeholder for checking whether a particular thread holds a mutex and optionally abort if it doesn’t. This would be equivalent to the normal asserts we use for variables but applied on mutexes.
    I think the reason it is not implemented yet is we don’t have an equivalent light weight function to assert whether a thread holds a lock in pthread_mutex_t used in the default implementation. Some platforms which has that capability could fill this implementation as part of porting process. Searching online I did find some implementation for this function in the windows port of leveldb. I can see one way to implement it using a wrapper class over pthread_mutex_t and setting some sort of a thread id variable to indicate which thread(s) currently holds the mutex, but it will have to be carefully implemented given the race conditions that can arise.

    Memtable的合并

    Compaction首先检查imm_,及时将已写满的memtable写入磁盘sstable文件,对Memtable的合并,调用DBImpl::CompactMemTable()完成:

    void DBImpl::CompactMemTable() {
      mutex_.AssertHeld();
      assert(imm_ != NULL);//imm_不能为空
      VersionEdit edit;
      Version* base = versions_->current();
      base->Ref();
      Status s = WriteLevel0Table(imm_, &edit, base);//将Memtable转化为.sst文件,写入level0 sst table,并写入到edit中
      base->Unref();  
      if (s.ok()) {
        edit.SetPrevLogNumber(0);
        edit.SetLogNumber(logfile_number_);  // Earlier logs no longer needed
        s = versions_->LogAndApply(&edit, &mutex_);//应用edit中记录的变化,来生成新的版本 
      }
    
      if (s.ok()) {
          // Commit to the new state
        imm_->Unref();
        imm_ = NULL;
        has_imm_.Release_Store(NULL);
        DeleteObsoleteFiles();  
      } else {
        RecordBackgroundError(s);
      }
    }

    其中CompactMemTable()主要调用了两个函数:WriteLevel0Table()和versions_->LogAndApply()

    CompactMemTable()首先调用WriteLevel0Table(),源码内容如下:

    Status DBImpl::WriteLevel0Table(MemTable* mem, VersionEdit* edit,
                                    Version* base) {
      mutex_.AssertHeld();
      FileMetaData meta;
      meta.number = versions_->NewFileNumber();//获取新生成的.sst文件的编号
      pending_outputs_.insert(meta.number);
      Iterator* iter = mem->NewIterator();//用于遍历Memtable中的数据
    
      Status s;
      {
        mutex_.Unlock();
        s = BuildTable(dbname_, env_, options_, table_cache_, iter, &meta);//创建.sst文件,并将其相关信息记录在meta中
        mutex_.Lock();
      }
    
      delete iter;  //iter用完之后一定要删除
      pending_outputs_.erase(meta.number);
    
      int level = 0;
      if (s.ok() && meta.file_size > 0) {
        const Slice min_user_key = meta.smallest.user_key();
        const Slice max_user_key = meta.largest.user_key();
        if (base != NULL) {
          level = base->PickLevelForMemTableOutput(min_user_key, max_user_key);//为合并的输出文件选择合适的level
        }
        edit->AddFile(level, meta.number, meta.file_size,meta.smallest, meta.largest);//将生成的.sst文件加入到该level
      }
      return s;
    }

    WriteLevel0Table()首先调用BuildTable()将Immutable Memtable中所有的数据写入到一个.sst文件中,并将.sst文件的信息(文件编号,Key值范围,文件大小)记录到变量meta中.由于Memtable是基于Skiplist的,是一个有序表,因此在写入.sst文件时,Key值也是从小到大来排列的.可以发现,将Memtable中的数据转换为SSTable时,是将所有记录都写入SSTable的,要删除的记录也一样.删除操作会在更高level的Compaction中完成.因此level 0中可能会存在Key值相同的记录. 

    Status BuildTable(const std::string& dbname,
                      Env* env,
                      const Options& options,
                      TableCache* table_cache,
                      Iterator* iter,
                      FileMetaData* meta) {
      Status s;
      meta->file_size = 0;
      iter->SeekToFirst();  
      std::string fname = TableFileName(dbname, meta->number);//获得新建表名字
      if (iter->Valid()) {
        WritableFile* file;    
        s = env->NewWritableFile(fname, &file);   //建立新的表文件,后续写入数据
        if (!s.ok()) {
          return s;
        }
        TableBuilder* builder = new TableBuilder(options, file); //建立TableBuilder
        meta->smallest.DecodeFrom(iter->key());
        for (; iter->Valid(); iter->Next()) {    //将key/value对加入builder
          Slice key = iter->key();
          meta->largest.DecodeFrom(key);
          builder->Add(key, iter->value());
        }
    
        // Finish and check for builder errors
        s = builder->Finish(); //构建indexhandler,metahandler,写入文件
        if (s.ok()) {
          meta->file_size = builder->FileSize();
          assert(meta->file_size > 0);
        }
        delete builder;
    
        // Finish and check for file errors
        if (s.ok()) {
          s = file->Sync();  //写入文件
        }
        if (s.ok()) {
          s = file->Close();
        }
        delete file;
        file = NULL;
    
        if (s.ok()) {
          // Verify that the table is usable
          Iterator* it = table_cache->NewIterator(ReadOptions(),
                                                  meta->number,
                                                  meta->file_size); //将表结构加入表缓存
          s = it->status();
          delete it;
        }
      }
    
      // Check for input iterator errors
      if (!iter->status().ok()) {
        s = iter->status();
      }
    
      if (s.ok() && meta->file_size > 0) {
        // Keep it
      } else {
        env->DeleteFile(fname);
      }
      return s;
    }

     该函数利用iter向TableBuilder中加入key/value对,然后写入文件并同步,将新生成的Table结构加入tablecache以备后用.

    table_builder文件在table文件夹下,其中TableBuilder::Add函数流程如下:

    void TableBuilder::Add(const Slice& key, const Slice& value) {
      Rep* r = rep_;
      assert(!r->closed);
      if (!ok()) return;
      if (r->num_entries > 0) {
        assert(r->options.comparator->Compare(key, Slice(r->last_key)) > 0);
      }
    
      if (r->pending_index_entry) {//新的block开始
        assert(r->data_block.empty());
        r->options.comparator->FindShortestSeparator(&r->last_key, key);
        std::string handle_encoding;
        r->pending_handle.EncodeTo(&handle_encoding);
        r->index_block.Add(r->last_key, Slice(handle_encoding));
        r->pending_index_entry = false;
      }
      //计算filter
      if (r->filter_block != NULL) {
        r->filter_block->AddKey(key);
      }
      //加入blockbuilder
      r->last_key.assign(key.data(), key.size());
      r->num_entries++;
      r->data_block.Add(key, value);
      // block大于配置的尺寸(默认为4k)则结束该block,输出后开启新的Block。
      const size_t estimated_block_size = r->data_block.CurrentSizeEstimate();
      if (estimated_block_size >= r->options.block_size) {
        Flush();
      }
    }

     将Block结构写入文件的TableBuilder::WriteBlock函数流程如下:

    void TableBuilder::WriteBlock(BlockBuilder* block, BlockHandle* handle) {
      // File format contains a sequence of blocks where each block has:
      //    block_data: uint8[n]
      //    type: uint8
      //    crc: uint32
      assert(ok());
      Rep* r = rep_;
      Slice raw = block->Finish(); //取得block格式化数据
    
      Slice block_contents;
        //获取是否压缩配置选项
      CompressionType type = r->options.compression;
      // TODO(postrelease): Support more compression options: zlib?
      switch (type) {
        case kNoCompression:
          block_contents = raw;
          break;
    
        case kSnappyCompression: {
          std::string* compressed = &r->compressed_output;
          if (port::Snappy_Compress(raw.data(), raw.size(), compressed) &&
              compressed->size() < raw.size() - (raw.size() / 8u)) {
            block_contents = *compressed;
          } else {
            // Snappy not supported, or compressed less than 12.5%, so just
            // store uncompressed form
            block_contents = raw;
            type = kNoCompression;
          }
          break;
        }
      }
      //进行压缩后,然后写入文件,blockdata+type+crc32
      WriteRawBlock(block_contents, type, handle);
      r->compressed_output.clear();
      block->Reset();
    }

    而TableBuilder::Finish的函数定义如下:

    Status TableBuilder::Finish() {
      Rep* r = rep_;
      Flush();//将block数据写入,可能不是满的block
      assert(!r->closed);
      r->closed = true;
    
      BlockHandle filter_block_handle, metaindex_block_handle, index_block_handle;
    
      // Write filter block
      if (ok() && r->filter_block != NULL) {
        WriteRawBlock(r->filter_block->Finish(), kNoCompression,
                      &filter_block_handle);
      }
    
      // Write metaindex block
      if (ok()) {
        BlockBuilder meta_index_block(&r->options);
        if (r->filter_block != NULL) {
          // Add mapping from "filter.Name" to location of filter data
          std::string key = "filter.";
          key.append(r->options.filter_policy->Name());
          std::string handle_encoding;
          filter_block_handle.EncodeTo(&handle_encoding);
          meta_index_block.Add(key, handle_encoding);
        }
    
        // TODO(postrelease): Add stats and other meta blocks
        WriteBlock(&meta_index_block, &metaindex_block_handle);
      }
    
      // Write index block
      if (ok()) {
        if (r->pending_index_entry) {
          r->options.comparator->FindShortSuccessor(&r->last_key);
          std::string handle_encoding;
          r->pending_handle.EncodeTo(&handle_encoding);
          r->index_block.Add(r->last_key, Slice(handle_encoding));
          r->pending_index_entry = false;
        }
        WriteBlock(&r->index_block, &index_block_handle);
      }
    
      // Write footer
      if (ok()) {
        Footer footer;
        footer.set_metaindex_handle(metaindex_block_handle);
        footer.set_index_handle(index_block_handle);
        std::string footer_encoding;
        footer.EncodeTo(&footer_encoding);
        r->status = r->file->Append(footer_encoding);
        if (r->status.ok()) {
          r->offset += footer_encoding.size();
        }
      }
      return r->status;
    }

    以上代码中调用的flush源码内容如下:

    void TableBuilder::Flush() {
      Rep* r = rep_;
      assert(!r->closed);
      if (!ok()) return;
      if (r->data_block.empty()) return;
      assert(!r->pending_index_entry);
      WriteBlock(&r->data_block, &r->pending_handle);
      if (ok()) {
        r->pending_index_entry = true;
        r->status = r->file->Flush();
      }
      if (r->filter_block != NULL) {
        r->filter_block->StartBlock(r->offset);
      }
    }

    然后WriteLevel0Table()调用PickLevelForMemTableOutput()为Memtable合并的输出文件选择合适的level,并调用edit->AddFile()将生成的.sst文件加入到该level中.

    WriteLevel0Table()结束后,CompactMemTable()调用db文件夹下version_set.cc文件中的versions_->LogAndApply()基于当前版本和更改edit来得到一个新版本.之后会对versions_->LogAndApply()进行分析.

    Trivial Compaction

    由之前的分析可知,is_manual默认为false,会调用PickCompaction()来选出要进行合并的level和相应的输入文件.当c->IsTrivialMove()满足时,则直接将文件移动到下一level.

      c = versions_->PickCompaction();
    
      Status status;
      if (c == NULL) {
        // Nothing to do
      } else if (!is_manual && c->IsTrivialMove()) {
        // Move file to next level
        assert(c->num_input_files(0) == 1);
        FileMetaData* f = c->input(0, 0);
        c->edit()->DeleteFile(c->level(), f->number);  //将文件从该层删除
        c->edit()->AddFile(c->level() + 1, f->number, f->file_size,   //将该文件加入到下一level
                           f->smallest, f->largest);
        status = versions_->LogAndApply(c->edit(), &mutex_);  //应用更改,创建新的Version
      } 

    首先调用db文件夹下version_set.cc文件中的VersionSet::PickCompaction()为接下来的Compaction操作准备输入数据,由之前对Compaction的数据结构分析可知,Compaction操作有两种触发方式:某一level的文件数太多和某一文件的查找次数超过允许值,在进行合并时,将优先考虑文件数过多的情况. 

    Compaction* VersionSet::PickCompaction() {
      Compaction* c;
      int level;
    
      const bool size_compaction = (current_->compaction_score_ >= 1);//文件数过多
      const bool seek_compaction = (current_->file_to_compact_ != NULL);//某一文件的查找次数太多
      if (size_compaction) {//文件数太多优先考虑
        level = current_->compaction_level_;  //要进行Compaction的level
        c = new Compaction(level);
        //每一层有一个compact_pointer,用于记录compaction key,这样可以进行循环compaction
        for (size_t i = 0; i < current_->files_[level].size(); i++) { //从待合并的level中选择合适的文件完成合并操作
          FileMetaData* f = current_->files_[level][i];  //level层中的第i个文件
          if (compact_pointer_[level].empty() || //compact_pointer_中记录的是下次合并的起始Key值,为空时都可以进行合并
              icmp_.Compare(f->largest.Encode(), compact_pointer_[level]) > 0) { //或者f的最大Key值大于起始值
            c->inputs_[0].push_back(f);//则该文件可以参与合并,将其加入到level输入文件中
            break;
          }
        }
        if (c->inputs_[0].empty()) { //若level输入为空,则将level的第一个文件加入到输入中
          c->inputs_[0].push_back(current_->files_[level][0]);
        }
      } else if (seek_compaction) {//然后考虑查找次数过多的情况
        level = current_->file_to_compact_level_;
        c = new Compaction(level);
        c->inputs_[0].push_back(current_->file_to_compact_);//将待合并的文件作为level层的输入
      } else {
        return NULL;
      }
    
      c->input_version_ = current_;
      c->input_version_->Ref();
    
      //level 0中的Key值是可以重复的,因此Key值范围可能相互覆盖,把所有重叠都找出来,一起做compaction
      if (level == 0) {
        InternalKey smallest, largest;
        GetRange(c->inputs_[0], &smallest, &largest);//待合并的level层的文件的Key值范围
        current_->GetOverlappingInputs(0, &smallest, &largest, &c->inputs_[0]);
        assert(!c->inputs_[0].empty());
      }
      SetupOtherInputs(c);//获取待合并的level+1层的输入
      return c;
    }

     然后判断是否为trivial Compaction,当为trivial Compaction时,只需要简单的将level层的文件移动到level +1 层即可

    bool Compaction::IsTrivialMove() const {
      return (num_input_files(0) == 1 &&   //level层只有1个文件
              num_input_files(1) == 0 &&   //level+1层没有文件
              TotalFileSize(grandparents_) <= kMaxGrandParentOverlapBytes);//level+2层文件总大小不超过最大覆盖范围,否则会导致后面的merge需要很大的开销
    }

    最终完成完成Compaction操作

    c->edit()->DeleteFile(c->level(), f->number);
    c->edit()->AddFile(c->level() + 1, f->number, f->file_size,f->smallest, f->largest);
    status = versions_->LogAndApply(c->edit(), &mutex_);  

    一般的合并

    一般的合并调用DBImpl::DoCompactionWork()完成,compact是调用VersionSet::PickCompacttion()得到的,与之前的trivial Compaction相同.不同level之间,可能存在Key值相同的记录,但是记录的seq不同.由之前的分析可知,最新的数据存放在较低的level中,其对应的seq也一定比level+1中的记录的seq要大,因此当出现相同Key值的记录时,只需要记录第一条记录,后面的都可以丢弃.level 0中也可能存在Key值相同的数据,其后面的seq也不同.数据越新,其对应的seq越大,且记录在level 0中的记录是按照user_key递增,seq递减的方式存储的,则相同user_key对应的记录是聚集在一起的,且按照seq递减的方式存放的.在更高层的Compaction时,只需要处理第一条出现的user_key相同的记录即可,后面的相同user_key的记录都可以丢弃.因此合并后的level +1层的文件中不会存在Key值相同的记录.删除记录的操作也会在此时完成,删除数据的记录会被丢弃,而不会被写入到更高level的文件中. 

    Status DBImpl::DoCompactionWork(CompactionState* compact) {
      if (snapshots_.empty()) {
        compact->smallest_snapshot = versions_->LastSequence();
      } else {
        compact->smallest_snapshot = snapshots_.oldest()->number_;
      }
      mutex_.Unlock();
      //生成iterator:遍历要compaction的数据
      Iterator* input = versions_->MakeInputIterator(compact->compaction);//用于遍历待合并的每一个文件
      input->SeekToFirst();
      Status status;
      ParsedInternalKey ikey;
      std::string current_user_key;
      bool has_current_user_key = false;
      SequenceNumber last_sequence_for_key = kMaxSequenceNumber;
      for (; input->Valid() && !shutting_down_.Acquire_Load(); ) {
        if (has_imm_.NoBarrier_Load() != NULL) {  //immutable memtable的优先级最高
          mutex_.Lock();
          if (imm_ != NULL) {   //当imm_非空时,合并Memtable
            CompactMemTable();
            bg_cv_.SignalAll();  // Wakeup MakeRoomForWrite() if necessary
          }
          mutex_.Unlock();
        }
    
        Slice key = input->key();
        if (compact->compaction->ShouldStopBefore(key) &&   //是否需要停止Compaction,中途输出compaction的结果,避免compaction结果和level N+2 files有过多的重叠
            compact->builder != NULL) {
          status = FinishCompactionOutputFile(compact, input);
        }
    
        bool drop = false;
        if (!ParseInternalKey(key, &ikey)) {
          current_user_key.clear();
          has_current_user_key = false;
          last_sequence_for_key = kMaxSequenceNumber;
        } else {
          if (!has_current_user_key ||    //获取当前的user_key和sequence
              user_comparator()->Compare(ikey.user_key,
              Slice(current_user_key)) != 0) { //可能存在Key值相同但seq不同的记录
            // 此时是这个Key第一次出现
            current_user_key.assign(ikey.user_key.data(), ikey.user_key.size());
            has_current_user_key = true;
            last_sequence_for_key = kMaxSequenceNumber;//则将其seq设为最大值,表示第一次出现
          }
    
          if (last_sequence_for_key <= compact->smallest_snapshot) {//表示key已经出现过,否则seq应为KMaxSequenceNumber
            drop = true;    // (A)   //之前已经存在Key值相同的记录,丢弃
          } else if (ikey.type == kTypeDeletion &&   //要删除该记录
                  ikey.sequence <= compact->smallest_snapshot &&  //记录的序号比数据库之前的最小序号还小
                  compact->compaction->IsBaseLevelForKey(ikey.user_key)) { //高的level中没有数据
            drop = true;   //此时要丢弃该记录
          }
          last_sequence_for_key = ikey.sequence;//上次出现的记录对应的sequence,用于判断后面出现相同Key值的情况
        }
    
        if (!drop) {   //如果不需要丢弃该记录
          if (compact->builder == NULL) {
            status = OpenCompactionOutputFile(compact);//若需要,则创建一个.sst文件,用于存放合并后的数据
          }
          if (compact->builder->NumEntries() == 0) {
            compact->current_output()->smallest.DecodeFrom(key);
          }
          compact->current_output()->largest.DecodeFrom(key);
          compact->builder->Add(key, input->value());//将记录写入.sst文件
    
          if (compact->builder->FileSize() >=
              compact->compaction->MaxOutputFileSize()) {   //当.sst文件超过最大值时
            status = FinishCompactionOutputFile(compact, input);//完成Compaction输出文件
          }
        }
        input->Next();  //处理下一个文件
      }
    
      if (status.ok() && compact->builder != NULL) {
        status = FinishCompactionOutputFile(compact, input);
      }
      if (status.ok()) {
        status = input->status();
      }
      delete input;
      input = NULL;
      
     //更新compaction的一些统计数据
      CompactionStats stats;
      stats.micros = env_->NowMicros() - start_micros - imm_micros;
      for (int which = 0; which < 2; which++) {
        for (int i = 0; i < compact->compaction->num_input_files(which); i++) {
          stats.bytes_read += compact->compaction->input(which, i)->file_size;
        }
      }
      for (size_t i = 0; i < compact->outputs.size(); i++) {
        stats.bytes_written += compact->outputs[i].file_size;
      }
    
      mutex_.Lock();
      stats_[compact->compaction->level() + 1].Add(stats);
    
      if (status.ok()) {
        status = InstallCompactionResults(compact);//完成合并
      }
      if (!status.ok()) {
        RecordBackgroundError(status);
      }
      VersionSet::LevelSummaryStorage tmp;
      Log(options_.info_log,
          "compacted to: %s", versions_->LevelSummary(&tmp));
      return status;
    
    }

     首先将可以留下的记录写入到.sst文件中,并将相关信息保存在变量compact中,然后调用InstallCompactionResults()将所做的改动加入到VersionEdit中,再调用LogAndApply()来得到新的版本. 

    Status DBImpl::InstallCompactionResults(CompactionState* compact) {
      mutex_.AssertHeld();
      Log(options_.info_log,  "Compacted %d@%d + %d@%d files => %lld bytes",
          compact->compaction->num_input_files(0),
          compact->compaction->level(),
          compact->compaction->num_input_files(1),
          compact->compaction->level() + 1,
          static_cast<long long>(compact->total_bytes));
    
      // Add compaction outputs
      compact->compaction->AddInputDeletions(compact->compaction->edit());
      const int level = compact->compaction->level();
      for (size_t i = 0; i < compact->outputs.size(); i++) {
        const CompactionState::Output& out = compact->outputs[i];
        compact->compaction->edit()->AddFile(
            level + 1,
            out.number, out.file_size, out.smallest, out.largest);
      }
      return versions_->LogAndApply(compact->compaction->edit(), &mutex_);
    }

    LogAndApply()

    在上面三种不同的Compaction操作中,最终当对当前版本的更改VersionEdit全部完成后,都会调用VersionSet::LogAndApply()来应用更改,创建新版本.edit中保存了level和level+1层要删除和增加的文件.

    Status VersionSet::LogAndApply(VersionEdit* edit, port::Mutex* mu) {
    
      Version* v = new Version(this);  //创建一个新Version
      {
        Builder builder(this, current_);//基于当前Version创建一个builder变量
        builder.Apply(edit);//将edit中记录的要增加、删除的文件加入到builder类中
        builder.SaveTo(v);//然后将edit中的记录保存到新创建的Version中,这样就得到了一个新的版本
      }
      Finalize(v);//根据各层文件数来判断是否还需要进行Compaction
    
      std::string new_manifest_file;
      Status s;
      if (descriptor_log_ == NULL) {   //只会在第一次调用时进入
        assert(descriptor_file_ == NULL);
        new_manifest_file = DescriptorFileName(dbname_, manifest_file_number_);//创建一个新的Manifest文件
        edit->SetNextFile(next_file_number_);
        s = env_->NewWritableFile(new_manifest_file, &descriptor_file_);
        if (s.ok()) {
          descriptor_log_ = new log::Writer(descriptor_file_);
          s = WriteSnapshot(descriptor_log_);//快照,系统开始时完整记录数据库的所有信息
        }
      }
      {
        mu->Unlock();
        if (s.ok()) {
          std::string record;
          edit->EncodeTo(&record);
          s = descriptor_log_->AddRecord(record);//将数据库的变化记录到Manifest文件中
          if (s.ok()) {
            s = descriptor_file_->Sync();
          }
        }
        if (s.ok() && !new_manifest_file.empty()) {
          s = SetCurrentFile(env_, dbname_, manifest_file_number_);
        }
        mu->Lock();
      }
    
      if (s.ok()) {
        AppendVersion(v);  //将新得到的Version插入到所有Version形成的双向链表的尾部
        log_number_ = edit->log_number_;
        prev_log_number_ = edit->prev_log_number_;
      }
      }
      return s;
    }

    为了重启之后能恢复数据库之前的状态,就需要将数据库的历史变化信息记录下来,这些信息都是记录在Manifest文件中的.为了节省空间和时间,leveldb采用的是在系统开始完整的所有数据库的信息(WriteSnapShot()),以后则只记录数据库的变化,即VersionEdit中的信息(descriptor_log_->AddRecord()).恢复时,只需要根据Manifest中的信息就可以一步步的恢复到上次的状态.

    VersionSet::LogAndApply首先创建一个新的Version,然后调用builder.Apply(edit)将edit中所有要删除、增加的文件编号记录下来,其源码如下:

      // Apply all of the edits in *edit to the current state.
      void Apply(VersionEdit* edit) {
        // 更新每一层下次合并的起始Key值
        for (size_t i = 0; i < edit->compact_pointers_.size(); i++) {
          const int level = edit->compact_pointers_[i].first;
          vset_->compact_pointer_[level] =
              edit->compact_pointers_[i].second.Encode().ToString();
        }
        //将所有要删除的文件加入到levels_[level].deleted_files变量中
        const VersionEdit::DeletedFileSet& del = edit->deleted_files_;
        for (VersionEdit::DeletedFileSet::const_iterator iter = del.begin();
             iter != del.end();++iter) {
          const int level = iter->first;
          const uint64_t number = iter->second;
          levels_[level].deleted_files.insert(number);
        }
        // 将所有新增加的文件加入到levels_[level].added_files中
        for (size_t i = 0; i < edit->new_files_.size(); i++) {
          const int level = edit->new_files_[i].first;
          FileMetaData* f = new FileMetaData(edit->new_files_[i].second);
          f->refs = 1;
          f->allowed_seeks = (f->file_size / 16384);
          if (f->allowed_seeks < 100) f->allowed_seeks = 100;
          levels_[level].deleted_files.erase(f->number);
          levels_[level].added_files->insert(f);
        }
      }

    然后VersionSet::LogAndApply再调用builder.SaveTo(v)将更改保存到新的Version中,其源码如下:

      void SaveTo(Version* v) {
        BySmallestKey cmp;
        cmp.internal_comparator = &vset_->icmp_;
        for (int level = 0; level < config::kNumLevels; level++) {
          const std::vector<FileMetaData*>& base_files = base_->files_[level];//当前Version中原有的各个level的.sst文件
          std::vector<FileMetaData*>::const_iterator base_iter = base_files.begin();
          std::vector<FileMetaData*>::const_iterator base_end = base_files.end();
          const FileSet* added = levels_[level].added_files;//对应level新增加的文件
          v->files_[level].reserve(base_files.size() + added->size());
          for (FileSet::const_iterator added_iter = added->begin();
               added_iter != added->end();++added_iter) {
            // 将原有文件中编号比added小的加入到新的Version
            for (std::vector<FileMetaData*>::const_iterator bpos
                     = std::upper_bound(base_iter, base_end, *added_iter, cmp);
                 base_iter != bpos;++base_iter) {
              MaybeAddFile(v, level, *base_iter);
            }
            MaybeAddFile(v, level, *added_iter);//再将新增的文件依次加入到新的Version
          }
          for (; base_iter != base_end; ++base_iter) {
            MaybeAddFile(v, level, *base_iter);//再将原有文件中剩余的部分加入到新的Version
          }
        }
      }

    bpos = std::upper_bound(base_iter,base_end,*added_iter,cmp); // 返回base_iter到base_end之间,第一个大于*added_iter的iter.假设原有文件的编号为1、3、4、6、8,新增文件的编号为2、5、7,则第一次循环时,bpos为3对应的迭代器,因此base_iter只遍历一个元素,即将编号1加入到新的Version中.总体对新增文件来说,就是首先加入base中编号比它小的,然后再将其加入,然后再继续比那里下一个新增文件,因此最终得到的文件编号顺序是 1、2、3、4、5、6、7、8,即每一层的.sst文件都是按照编号从小到大排列的.这样就得到了新的Version的每一层的所有文件.

    参考文献:

    1.http://blog.csdn.net/u012658346/article/details/45787233

    2.http://blog.csdn.net/u012658346/article/details/45788939

    3.http://blog.csdn.net/joeyon1985/article/details/47154249

    4.http://www.blogjava.net/sandy/archive/2012/03/15/leveldb6.html

    5.http://www.pandademo.com/2016/04/compaction-of-sstable-leveldb-part-1-source-dissect-9/

  • 相关阅读:
    linux自动清理30天之前的文件
    Oracle树查询及相关函数
    jackson循环引用导致序列化stackOverFlow的解决
    java核心技术36讲笔记
    Quartz学习
    Quartz学习
    java核心技术36讲
    git常用命令
    CTCall简介(后续会继续补充)
    自定义导航栏,隐藏导航栏底部的灰色线条
  • 原文地址:https://www.cnblogs.com/xueqiuqiu/p/8302545.html
Copyright © 2020-2023  润新知