Module repdef

Source
Expand description

Utilities for rep-def levels

Repetition and definition levels are a way to encode multipile validity / offsets arrays into a single buffer. They are a form of “zipping” buffers together that takes advantage of the fact that, if the outermost array is invalid, then the validity of the inner items is irrelevant.

Note: the concept of repetition & definition levels comes from the Dremel paper and has been implemented in Apache Parquet. However, the implementation here is not necessarily compatible with Parquet. For example, we use 0 to represent the “inner-most” item and Parquet uses 0 to represent the “outer-most” item.

§Repetition Levels

With repetition levels we convert a sparse array of offsets into a dense array of levels. These levels are marked non-zero whenever a new list begins. In other words, given the list array with 3 rows [{<0,1>, <>, <2>}, {<3>}, {}], [], [{<4>}] we would have three offsets arrays:

Outer-most ([]): [0, 3, 3, 4] Middle ({}): [0, 3, 4, 4, 5] Inner (<>): [0, 2, 2, 3, 4, 5] Values : [0, 1, 2, 3, 4]

We can convert these into repetition levels as follows:

ValuesRepetition
03
10
-1
21
32
-2
-3
40

Note: We actually have MORE repetition levels than values. This is because the repetition levels need to be able to represent empty lists.

§Definition Levels

Definition levels are simpler. We can think of them as zipping together various validity (from different levels of nesting) into a single buffer. For example, we could zip the arrays [1, 1, 0, 0] and [1, 0, 1, 0] into [11, 10, 01, 00]. However, 00 and 01 are redundant. If the outer level is null then the validity of the inner levels is irrelevant. To save space we instead encode a “level” which is the “depth” of the null. Let’s look at a more complete example:

Array: [{“middle”: {“inner”: 1]}}, NULL, {“middle”: NULL}, {“middle”: {“inner”: NULL}}]

In Arrow we would have the following validity arrays: Outer validity : 1, 0, 1, 1 Middle validity: 1, ?, 0, 1 Inner validity : 1, ?, ?, 0 Values : 1, ?, ?, ?

The ? values are undefined in the Arrow format. We can convert these into definition levels as follows:

ValuesDefinition
10
-3
-2
-1

§Compression

Note that we only need 2 bits of definition levels to represent 3 levels of nesting. Definition levels are always more compact than the input validity arrays.

Repetition levels are more complex. If there are very large lists then a sparse array of offsets (which has one element per list) might be more compact than a dense array of repetition levels (which has one element per list value, possibly even more if there are empty lists).

However, both repetition levels and definition levels are typically very compressible with RLE.

However, in Lance we don’t always take advantage of that compression because we want to be able to zip rep-def levels together with our values. This gives us fewer IOPS when accessing row values.

Structs§

BinaryControlWordIterator
A ControlWordIterator when there are both repetition and definition levels
CompositeRepDefUnraveler
As we decode we may extract rep/def information from multiple pages (or multiple chunks within a page).
ControlWordDesc
Describes the properties of a control word
NilaryControlWordIterator
A ControlWordIterator when there are no repetition or definition levels
RepDefBuilder
A structure used to collect validity buffers and offsets from arrow arrays and eventually create repetition and definition levels
RepDefSlicer
Slices a level buffer into pieces
RepDefUnraveler
Starts with serialized repetition and definition levels and unravels them into validity buffers and offsets buffers
SerializedRepDefs
Represents repetition and definition levels that have been serialized into a pair of (optional) level buffers
SpecialRecord
UnaryControlWordIterator
A ControlWordIterator when there are only definition levels or only repetition levels

Enums§

ControlWordIterator
An iterator that generates control words from repetition and definition levels
ControlWordParser
A parser to unwrap control words into repetition and definition levels
DefinitionInterpretation
This tells us how an array handles definition. Given a stack of these and a nested array and a set of definition levels we can calculate how we should interpret the definition levels.

Functions§

build_control_word_iterator
Builds a ControlWordIterator from repetition and definition levels by first calculating the width needed and then creating the iterator with the appropriate width

Type Aliases§

LevelBuffer