lance_encoding

Module repdef

Source
Expand description

Utilities for rep-def levels

Repetition and definition levels are a way to encode multipile validity / offsets arrays into a single buffer. They are a form of “zipping” buffers together that takes advantage of the fact that, if the outermost array is invalid, then the validity of the inner items is irrelevant.

Note: the concept of repetition & definition levels comes from the Dremel paper and has been implemented in Apache Parquet. However, the implementation here is not necessarily compatible with Parquet. For example, we use 0 to represent the “inner-most” item and Parquet uses 0 to represent the “outer-most” item.

§Repetition Levels

With repetition levels we convert a sparse array of offsets into a dense array of levels. These levels are marked non-zero whenever a new list begins. In other words, given the list array with 3 rows [{<0,1>, <>, <2>}, {<3>}, {}], [], [{<4>}] we would have three offsets arrays:

Outer-most ([]): [0, 3, 3, 4] Middle ({}): [0, 3, 4, 4, 5] Inner (<>): [0, 2, 2, 3, 4, 5] Values : [0, 1, 2, 3, 4]

We can convert these into repetition levels as follows:

ValuesRepetition
03
10
-1
21
32
-2
-3
40

Note: We actually have MORE repetition levels than values. This is because the repetition levels need to be able to represent empty lists.

§Definition Levels

Definition levels are simpler. We can think of them as zipping together various validity (from different levels of nesting) into a single buffer. For example, we could zip the arrays [1, 1, 0, 0] and [1, 0, 1, 0] into [11, 10, 01, 00]. However, 00 and 01 are redundant. If the outer level is null then the validity of the inner levels is irrelevant. To save space we instead encode a “level” which is the “depth” of the null. Let’s look at a more complete example:

Array: [{“middle”: {“inner”: 1]}}, NULL, {“middle”: NULL}, {“middle”: {“inner”: NULL}}]

In Arrow we would have the following validity arrays: Outer validity : 1, 0, 1, 1 Middle validity: 1, ?, 0, 1 Inner validity : 1, ?, ?, 0 Values : 1, ?, ?, ?

The ? values are undefined in the Arrow format. We can convert these into definition levels as follows:

ValuesDefinition
10
-3
-2
-1

§Compression

Note that we only need 2 bits of definition levels to represent 3 levels of nesting. Definition levels are always more compact than the input validity arrays.

Repetition levels are more complex. If there are very large lists then a sparse array of offsets (which has one element per list) might be more compact than a dense array of repetition levels (which has one element per list value, possibly even more if there are empty lists).

However, both repetition levels and definition levels are typically very compressible with RLE.

However, in Lance we don’t always take advantage of that compression because we want to be able to zip rep-def levels together with our values. This gives us fewer IOPS when accessing row values.

Structs§

Enums§

  • An iterator that generates control words from repetition and definition levels
  • A parser to unwrap control words into repetition and definition levels

Functions§

Type Aliases§