• lucene底层数据结构——FST,针对field使用列存储,delta encode压缩doc ids数组,LZ4压缩算法


    参考:

    http://www.slideshare.net/lucenerevolution/what-is-inaluceneagrandfinal

    http://www.slideshare.net/jpountz/how-does-lucene-store-your-data

    http://www.infoq.com/cn/articles/database-timestamp-02?utm_source=infoq&utm_medium=related_content_link&utm_campaign=relatedContent_articles_clk

    摘录一些重要的:

    看一下Lucene的倒排索引是怎么构成的。

    我们来看一个实际的例子,假设有如下的数据:

    docid

    年龄

    性别

    1

    18

    2

    20

    3

    18

     

    这里每一行是一个document。每个document都有一个docid。那么给这些document建立的倒排索引就是:

    年龄

    18

    [1,3]

    20

    [2]

    性别

     

    [1,2]

    [3]

    可以看到,倒排索引是per field的,一个字段有一个自己的倒排索引。18,20这些叫做 term,而[1,3]就是posting list。Posting list就是一个int的数组,存储了所有符合某个term的文档id。那么什么是term dictionary 和 term index?

    那么什么是term dictionary 和 term index?

    假设我们有很多个term,比如:

    Carla,Sara,Elin,Ada,Patty,Kate,Selena

    如果按照这样的顺序排列,找出某个特定的term一定很慢,因为term没有排序,需要全部过滤一遍才能找出特定的term。排序之后就变成了:

    Ada,Carla,Elin,Kate,Patty,Sara,Selena

    这样我们可以用二分查找的方式,比全遍历更快地找出目标的term。这个就是 term dictionary。有了term dictionary之后,可以用 logN 次磁盘查找得到目标。但是磁盘的随机读操作仍然是非常昂贵的(一次random access大概需要10ms的时间)。所以尽量少的读磁盘,有必要把一些数据缓存到内存里。但是整个term dictionary本身又太大了,无法完整地放到内存里。于是就有了term index。term index有点像一本字典的大的章节表。比如:

    A开头的term ……………. Xxx页

    C开头的term ……………. Xxx页

    E开头的term ……………. Xxx页

    如果所有的term都是英文字符的话,可能这个term index就真的是26个英文字符表构成的了。但是实际的情况是,term未必都是英文字符,term可以是任意的byte数组。而且26个英文字符也未必是每一个字符都有均等的term,比如x字符开头的term可能一个都没有,而s开头的term又特别多。实际的term index是一棵trie 树:

    例子是一个包含 "A", "to", "tea", "ted", "ten", "i", "in", 和 "inn" 的 trie 树。这棵树不会包含所有的term,它包含的是term的一些前缀。通过term index可以快速地定位到term dictionary的某个offset,然后从这个位置再往后顺序查找。再加上一些压缩技术(搜索 Lucene Finite State Transducers) term index 的尺寸可以只有所有term的尺寸的几十分之一,使得用内存缓存整个term index变成可能。整体上来说就是这样的效果。

    现在我们可以回答“为什么Elasticsearch/Lucene检索可以比mysql快了。Mysql只有term dictionary这一层,是以b-tree排序的方式存储在磁盘上的。检索一个term需要若干次的random access的磁盘操作。而Lucene在term dictionary的基础上添加了term index来加速检索,term index以树的形式缓存在内存中。从term index查到对应的term dictionary的block位置之后,再去磁盘上找term,大大减少了磁盘的random access次数。

    额外值得一提的两点是:term index在内存中是以FST(finite state transducers)的形式保存的,其特点是非常节省内存。Term dictionary在磁盘上是以分block的方式保存的,一个block内部利用公共前缀压缩,比如都是Ab开头的单词就可以把Ab省去。这样term dictionary可以比b-tree更节约磁盘空间。

     --------------------------------------------------------

    lucene并非使用Tree structure
    – sorted for range queries
    – O(log(n)) search

    而是如下核心的数据结构,FST,delta encode压缩数组,列存储,LZ4压缩算法
    ●Terms index: map a term prefix to a block in the dict ○ FST: automaton with weighted arcs, compact thanks to shared prefixes/suffixes 核心数据结构,本质是前后缀共享的状态机,类似trie来搜索用户输入的某个单词是否能搜到,搜到的话就跳转到Terms dictionary里去,搜到的结果是单词在terms dict里的offset(本质是数组的偏移量)
    Lookup the term in the terms index
    – In-memory FST storing terms prefixes
    – Gives the offset to look at in the terms dictionary
    – Can fast-fail if no terms have this prefix
    ●Terms dictionary: statistics + pointer in postings lists, Store terms and documents in arrays – binary search
    • Jump to the given offset in the terms dictionary
    – compressed based on shared prefixes, similarly to a burst trie
    – called the “BlockTree terms dict”
    • read sequentially until the term is found
    ●Postings lists: encodes matching docs in sorted order ○ + positions + offsets 倒排的文档ID都在此
    • Jump to the given offset in the postings lists
    • Encoded using modified FOR (Frame of Reference) delta
    – 1. delta-encode
    – 2. split into block of N=128 values
    – 3. bit packing per block
    – 4. if remaining docs, encode with vInt
    ●Stored fields
    • In-memory index for a subset of the doc ids
    – memory-efficient thanks to monotonic compression
    – searched using binary search
    • Stored fields
    – stored sequentially
    – compressed (LZ4) in 16+KB blocks

    Query execution:
    • 2 disk seeks per field for search
    • 1 disk seek per doc for stored fields
    • It is common that the terms dict / postings lists fits into the file-system cache
    • “Pulse” optimization
    – For unique terms (freq=1), postings are inlined in the terms dict
    – Only 1 disk seek
    – Will always be used for your primary keys

    插入新数据:
    Insertion = write a new segment 一直写信segment可以防止使用锁
    • Merge segments when there are too many of them
    – concatenate docs, merge terms dicts and postings lists (merge sort!)
    删除:
    Deletion = turn a bit off
    • Ignore deleted documents when searching and merging (reclaims space)
    • Merge policies favor segments with many deletions

    优缺点:
    Updates require writing a new segment
    – single-doc updates are costly, bulk updates preferred
    – writes are sequential
    • Segments are never modified in place
    – filesystem-cache-friendly
    lock-free!
    • Terms are deduplicated
    – saves space for high-freq terms
    • Docs are uniquely identified by an ord
    – useful for cross-API communication
    – Lucene can use several indexes in a single query
    • Terms are uniquely identified by an ord
    – important for sorting: compare longs, not strings
    – important for faceting (more on this later)

    针对field使用列存储:
    Per doc and per field single numeric values, stored in a column-stride fashion
    • Useful for sorting and custom scoring
    • Norms are numeric doc values

    一些设计原则:
    • Save file handles
    – don’t use one file per field or per doc
    • Avoid disk seeks whenever possible
    – disk seek on spinning disk is ~10 ms
    • BUT don’t ignore the filesystem cache
    – random access in small files is fine
    • Light compression helps
    – less I/O
    – smaller indexes
    – filesystem-cache-friendly

    针对Compression techniques的数据结构:FSTs LZ4

  • 相关阅读:
    ###STL学习--标准模板库
    打包C#程序
    ###学习《Effective C++》
    [ZJOI2007]棋盘制作
    [NOI2012]美食节
    [SCOI2012]奇怪的游戏
    5120: [2017国家集训队测试]无限之环
    序列取数
    1028: [JSOI2007]麻将
    1011: [HNOI2008]遥远的行星
  • 原文地址:https://www.cnblogs.com/bonelee/p/6226386.html
Copyright © 2020-2023  润新知