• Xapian的内存索引


      关键字:xapian、内存索引

      xapian除了提供用于生产环境的磁盘索引,也提供了内存索引(InMemoryDatabase)。内存索引。我们可以通过观察内存索引的设计,来了解xapian的设计思路。

    1 用途

      官方文档说法:

      “inmemory, This type is a database held entirely in memory. It was originally written for testing purposes only, but may prove useful for building up temporary small databases.”

      早期版本的源码说明:

      “This backend stores a database entirely in memory.  When the database is closed these indexed contents are lost.This is useful for searching through relatively small amounts of data (such as a single large file) which hasn't previously been indexed.”

      早期版本的源码注释:

      “This is a prototype database, mainly used for debugging and testing”

      

      总的来说,这是一个原型DB,最初只用来做测试和debug用,没有持久化,关闭DB数据就丢失,可以用来处理小量数据的搜索,并且这部分数据可以在内存中实时建索引。

    2 使用内存索引

    /***************************************************************************
    *
    * @file ram_xapian.cpp
    * @author cswuyg
    * @date 2019/02/21
    * @brief
    *
    **************************************************************************/
    // inmemory index use deprecated class, disalbe the compile error.
    #pragma warning(disable : 4996)
    #include <iostream>
    #include "xapian.h"
    
    #pragma comment(lib, "libxapian.a")
    #pragma comment(lib, "zdll.lib")
    
    const char* const K_DB_PATH = "index_data";
    const char* const K_DOC_UNIQUE_ID = "007";
    
    Xapian::WritableDatabase createIndex() {
        std::cout << "--index start--" << std::endl;
        Xapian::WritableDatabase db = Xapian::InMemory::open();
    
        Xapian::Document doc;
        doc.add_posting("T世界", 1);
        doc.add_posting("T体育", 1);
        doc.add_posting("T比赛", 1);
        doc.set_data("世界体育比赛");
        doc.add_boolean_term(K_DOC_UNIQUE_ID);
    
        Xapian::docid innerId = db.replace_document(K_DOC_UNIQUE_ID, doc);
    
        std::cout << "add doc innerId=" << innerId << std::endl;
    
        db.commit();
    
        std::cout << "--index finish--" << std::endl;
        return db;
    }
    
    void queryIndex(Xapian::WritableDatabase db) {
        std::cout << "--search start--" << std::endl;
        Xapian::Query termOne = Xapian::Query("T世界");
        Xapian::Query termTwo = Xapian::Query("T比赛");
        Xapian::Query termThree = Xapian::Query("T体育");
        auto query = Xapian::Query(Xapian::Query::OP_OR, Xapian::Query(Xapian::Query::OP_OR, termOne, termTwo), termThree);
        std::cout << "query=" << query.get_description() << std::endl;
    
        Xapian::Enquire enquire(db);
        enquire.set_query(query);
        Xapian::MSet result = enquire.get_mset(0, 10);
        std::cout << "find results count=" << result.get_matches_estimated() << std::endl;
    
        for (auto it = result.begin(); it != result.end(); ++it) {
            Xapian::Document doc = it.get_document();
            std::string data = doc.get_data();
            Xapian::weight docScoreWeight = it.get_weight();
            Xapian::percent docScorePercent = it.get_percent();
    
            std::cout << "doc=" << data << ",weight=" << docScoreWeight << ",percent=" << docScorePercent << std::endl;
        }
    
        std::cout << "--search finish--" << std::endl;
    }
    
    int main() {
        auto db = createIndex();
        queryIndex(db);
        return 0;
    }

    github: https://github.com/cswuyg/xapian_exercise/tree/master/ram_xapian

    3 数据结构

    内存索引包含一系列数据结构,通过这些数据结构,可以一窥xapian的索引设计思路。

    内存索引数据结构如下图所示:

    几个主要的操作类封装

    InMemoryPostList:内存中的postlist,单个term,操作的就是倒排链表;    

    InMemoryAllDocsPostList:内存中的postlist,整个DB,操作的实际上是termlist表(doc表);

    InMemoryTermList: 某个doc的term列表;

    InMemoryDatabase: 内存DB;

    InMemoryAllTermsList: 内存中的termlist,实际上是整个DB的postlists;

    InMemoryDocument:单个doc的操作封装  ;

    InMemoryPositionList:内存中的position列表操作封装

  • 相关阅读:
    [VC++入门]C++中常用的运算符及微软自定义类型
    搜索引擎蜘蛛爬虫原理
    Enterprise Library 5.0
    Installshield 12 中文系列教程之 定义安装必要条件
    installshield脚本
    c# 事物处理
    InStallShield网络资源参考
    Could not execute query against OLE DB provider 'OraOLEDB.Oracle'
    frameset小结
    最痛心的距离
  • 原文地址:https://www.cnblogs.com/cswuyg/p/10417727.html
Copyright © 2020-2023  润新知