Expand description
Utilities for rep-def levels
Repetition and definition levels are a way to encode multipile validity / offsets arrays into a single buffer. They are a form of “zipping” buffers together that takes advantage of the fact that, if the outermost array is invalid, then the validity of the inner items is irrelevant.
Note: the concept of repetition & definition levels comes from the Dremel paper and has been implemented in Apache Parquet. However, the implementation here is not necessarily compatible with Parquet. For example, we use 0 to represent the “inner-most” item and Parquet uses 0 to represent the “outer-most” item.
§Repetition Levels
With repetition levels we convert a sparse array of offsets into a dense array of levels. These levels are marked non-zero whenever a new list begins. In other words, given the list array with 3 rows [{<0,1>, <>, <2>}, {<3>}, {}], [], [{<4>}] we would have three offsets arrays:
Outer-most ([]): [0, 3, 3, 4] Middle ({}): [0, 3, 4, 4, 5] Inner (<>): [0, 2, 2, 3, 4, 5] Values : [0, 1, 2, 3, 4]
We can convert these into repetition levels as follows:
Values | Repetition |
---|---|
0 | 3 |
1 | 0 |
- | 1 |
2 | 1 |
3 | 2 |
- | 2 |
- | 3 |
4 | 0 |
Note: We actually have MORE repetition levels than values. This is because the repetition levels need to be able to represent empty lists.
§Definition Levels
Definition levels are simpler. We can think of them as zipping together various validity (from different levels of nesting) into a single buffer. For example, we could zip the arrays [1, 1, 0, 0] and [1, 0, 1, 0] into [11, 10, 01, 00]. However, 00 and 01 are redundant. If the outer level is null then the validity of the inner levels is irrelevant. To save space we instead encode a “level” which is the “depth” of the null. Let’s look at a more complete example:
Array: [{“middle”: {“inner”: 1]}}, NULL, {“middle”: NULL}, {“middle”: {“inner”: NULL}}]
In Arrow we would have the following validity arrays: Outer validity : 1, 0, 1, 1 Middle validity: 1, ?, 0, 1 Inner validity : 1, ?, ?, 0 Values : 1, ?, ?, ?
The ? values are undefined in the Arrow format. We can convert these into definition levels as follows:
Values | Definition |
---|---|
1 | 0 |
- | 3 |
- | 2 |
- | 1 |
§Compression
Note that we only need 2 bits of definition levels to represent 3 levels of nesting. Definition levels are always more compact than the input validity arrays.
Repetition levels are more complex. If there are very large lists then a sparse array of offsets (which has one element per list) might be more compact than a dense array of repetition levels (which has one element per list value, possibly even more if there are empty lists).
However, both repetition levels and definition levels are typically very compressible with RLE.
However, in Lance we don’t always take advantage of that compression because we want to be able to zip rep-def levels together with our values. This gives us fewer IOPS when accessing row values.
Structs§
- A
ControlWordIterator
when there are both repetition and definition levels - A
ControlWordIterator
when there are no repetition or definition levels - A structure used to collect validity buffers and offsets from arrow arrays and eventually create repetition and definition levels
- Starts with serialized repetition and definition levels and unravels them into validity buffers and offsets buffers
- Represents repetition and definition levels that have been serialized into a pair of (optional) level buffers
- A
ControlWordIterator
when there are only definition levels or only repetition levels
Enums§
- An iterator that generates control words from repetition and definition levels
- A parser to unwrap control words into repetition and definition levels
Functions§
- Builds a
ControlWordIterator
from repetition and definition levels by first calculating the width needed and then creating the iterator with the appropriate width