# SSTable
The `tantivy-sstable` crate is yet another sstable crate.
It has been designed to be used in `quickwit`:
- as an alternative to the default tantivy fst dictionary.
- as a way to store the column index for dynamic fast fields.
The benefit compared to the fst crate is locality.
Searching a key in the fst crate requires downloading the entire dictionary.
Once the sstable index is downloaded, running a `get` in the sstable
crate only requires a single fetch.
Right now, the block index and the default block size have been thought
for quickwit, and the performance of a get is very bad.
# Sorted strings?
SSTable stands for Sorted String Table.
Strings have to be insert in sorted order.
That sorted order is used in different ways:
- it makes gets and streaming ranges of keys
possible.
- it allows incremental encoding of the keys
- the front compression is leveraged to optimize
the intersection with an automaton
# On disk format
Overview of the SSTable format. Unless noted otherwise, numbers are little-endian.
### SSTable
```
+-------+-------+-----+--------+
|----( # of blocks)---|
```
- Block(`SSTBlock`): list of independent block, terminated by a single empty block.
- Footer(`SSTFooter`)
### SSTBlock
```
+----------+----------+--------+-------+-------+-----+
| |----( # of deltas)---|
|------(maybe compressed)------|
```
- BlockLen(u32): length of the block, including the compress byte.
- Compress(u8): indicate whether block is compressed. 0 if not compressed, 1 if compressed.
- Values: an application defined format storing a sequence of value, capable of determining it own length
- Delta
### Delta
```
+---------+--------+
```
- KeepAdd
- Suffix: KeepAdd.add bytes of key suffix
### KeepAdd
KeepAdd can be represented in two different representation, a very compact 1byte one which is enough for most usage, and a longer variable-len one when required
When keep < 16 and add < 16
```
+-----+------+
```
- Add(u4): number of bytes to push
- Keep(u4): number of bytes to pop
Otherwise:
```
+------+------+-----+
```
- Add(VInt): number of bytes to push
- Keep(VInt): number of bytes to pop
Note: as the SSTable does not support redundant keys, there is no ambiguity between both representation. Add is always guaranteed to be non-zero, except for the very first key of an SSTable, where Keep is guaranteed to be zero.
### SSTFooter
```
+-----+----------------+-------------+-------------+---------+---------+
```
- Fst(Fst): finite state transducer mapping keys to a block number
- BlockAddrStore(BlockAddrStore): store mapping a block number to its BlockAddr
- StoreOffset(u64): Offset to start of the BlockAddrStore. If zero, see the SingleBlockSStable section
- IndexOffset(u64): Offset to the start of the SSTFooter
- NumTerm(u64): number of terms in the sstable
- Version(u32): Currently equal to 3
### Fst
Fst is in the format of tantivy\_fst
### BlockAddrStore
+---------+-----------+-----------+-----+-----------+-----------+-----+
|---------(N blocks)----------|---------(N blocks)----------|
- MetaLen(u64): length of the BlockMeta section
- BlockMeta(BlockAddrBlockMetadata): metadata to seek through BlockData
- BlockData(CompactedBlockAddr): bitpacked per block metadata
### BlockAddrBlockMetadata
+--------+------------+--------------+------------+--------------+-------------------+-----------------+----------+
- Offset(u64): offset of the corresponding BlockData in the datastream
- RangeStart(u64): the start position of the first block
- FirstOrdinal(u64): the first ordinal of the first block
- RangeSlope(u32): slope predicted for start range evolution (see computation in BlockData)
- OrdinalSlope(u64): slope predicted for first ordinal evolution (see computation in BlockData)
- FirstOrdinalNBits(u8): number of bits per ordinal in datastream (see computation in BlockData)
- RangeStartNBits(u8): number of bits per range start in datastream (see computation in BlockData)
### BlockData
+-----------------+-------------------+---------------+
|------(BlockLen repetitions)---------|
- RangeStartDelta(var): RangeStartNBits *bits* of little endian number. See below for decoding
- FirstOrdinalDelta(var): FirstOrdinalNBits *bits* of little endian number. See below for decoding
- FinalRangeEnd(var): RangeStartNBits *bits* of integer. See below for decoding
converting a BlockData of index Index and a BlockAddrBlockMetadata to an actual block address is done as follow:
range\_prediction := RangeStart + Index * RangeSlop;
range\_derivation := RangeStartDelta - (1 << (RangeStartNBits-1));
range\_start := range\_prediction + range\_derivation
The same computation can be done for ordinal.
Note that `range_derivation` can take negative value. `RangeStartDelta` is just its translation to a positive range.
## SingleBlockSStable
The format used for the index is meant to be compact, however it has a constant cost of around 70
bytes, which isn't negligible for a table containing very few keys.
To limit the impact of that constant cost, single block sstable omit the Fst and BlockAddrStore from
their index. Instead a block with first ordinal of 0, range start of 0 and range end of IndexOffset
is implicitly used for every operations.