folly/GroupVarint.h
folly/GroupVarint.h
is an implementation of variable-length encoding for 32- and 64-bit integers using the Group Varint encoding scheme as described in Jeff Dean's WSDM 2009 talk and in Information Retrieval: Implementing and Evaluating Search Engines.
Briefly, a group of four 32-bit integers is encoded as a sequence of variable length, between 5 and 17 bytes; the first byte encodes the length (in bytes) of each integer in the group. A group of five 64-bit integers is encoded as a sequence of variable length, between 7 and 42 bytes; the first two bytes encode the length (in bytes) of each integer in the group.
GroupVarint.h
defines a few classes:
-
GroupVarint<T>
, whereT
isuint32_t
oruint64_t
:Basic encoding / decoding interface, mainly aimed at encoding / decoding one group at a time.
-
GroupVarintEncoder<T, Output>
, whereT
isuint32_t
oruint64_t
, andOutput
is a functor that acceptsStringPiece
objects as arguments:Streaming encoder: add values one at a time, and they will be flushed to the output one group at a time. Handles the case where the last group is incomplete (the number of integers to encode isn't a multiple of the group size)
-
GroupVarintDecoder<T>
, whereT
isuint32_t
oruint64_t
:Streaming decoder: extract values one at a time. Handles the case where the last group is incomplete.
The 32-bit implementation is significantly faster than the 64-bit implementation; on platforms supporting the SSSE3 instruction set, we use the PSHUFB instruction to speed up lookup, as described in SIMD-Based Decoding of Posting Lists(CIKM 2011).
For more details, see the header file folly/GroupVarint.h
and the associated test file folly/test/GroupVarintTest.cpp
.