pub struct ArrowBytesMap<O, V>{ /* private fields */ }
Expand description
Optimized map for storing Arrow “bytes” types (String
, LargeString
,
Binary
, and LargeBinary
) values that can produce the set of keys on
output as GenericBinaryArray
without copies.
Equivalent to HashSet<String, V>
but with better performance for arrow
data.
§Generic Arguments
O
: OffsetSize (String/LargeString)V
: payload type
§Description
This is a specialized HashMap with the following properties:
-
Optimized for storing and emitting Arrow byte types (e.g.
StringArray
/BinaryArray
) very efficiently by minimizing copying of the string values themselves, both when inserting and when emitting the final array. -
Retains the insertion order of entries in the final array. The values are in the same order as they were inserted.
Note this structure can be used as a HashSet
by specifying the value type
as ()
, as is done by ArrowBytesSet
.
This map is used by the special COUNT DISTINCT
aggregate function to
store the distinct values, and by the GROUP BY
operator to store
group values when they are a single string array.
§Example
The following diagram shows how the map would store the four strings “Foo”, NULL, “Bar”, “TheQuickBrownFox”:
-
hashtable
stores entries for each distinct string that has been inserted. The entries contain the payload as well as information about the value (either an offset or the actual bytes, seeEntry
docs for more details) -
offsets
stores offsets intobuffer
for each distinct string value, following the same convention as the offsets in aStringArray
orLargeStringArray
. -
buffer
stores the actual byte data -
null
: stores the index and payload of the null value, in this case the second value (index 1)
┌───────────────────────────────────┐ ┌─────┐ ┌────┐
│ ... │ │ 0 │ │FooB│
│ ┌──────────────────────────────┐ │ │ 0 │ │arTh│
│ │ <Entry for "Bar"> │ │ │ 3 │ │eQui│
│ │ len: 3 │ │ │ 3 │ │ckBr│
│ │ offset_or_inline: "Bar" │ │ │ 6 │ │ownF│
│ │ payload:... │ │ │ │ │ox │
│ └──────────────────────────────┘ │ │ │ │ │
│ ... │ └─────┘ └────┘
│ ┌──────────────────────────────┐ │
│ │<Entry for "TheQuickBrownFox">│ │ offsets buffer
│ │ len: 16 │ │
│ │ offset_or_inline: 6 │ │ ┌───────────────┐
│ │ payload: ... │ │ │ Some(1) │
│ └──────────────────────────────┘ │ │ payload: ... │
│ ... │ └───────────────┘
└───────────────────────────────────┘
null
HashTable
§Entry Format
Entries stored in a ArrowBytesMap
represents a value that is either
stored inline or in the buffer
This helps the case where there are many short (less than 8 bytes) strings that are the same (e.g. “MA”, “CA”, “NY”, “TX”, etc)
┌──────────────────┐
─ ─ ─ ─ ─ ─ ─▶│... │
│ │TheQuickBrownFox │
│... │
│ │ │
└──────────────────┘
│ buffer of u8
│
┌────────────────┬───────────────┬───────────────┐
Storing │ │ starting byte │ length, in │
"TheQuickBrownFox" │ hash value │ offset in │ bytes (not │
(long string) │ │ buffer │ characters) │
└────────────────┴───────────────┴───────────────┘
8 bytes 8 bytes 4 or 8
┌───────────────┬─┬─┬─┬─┬─┬─┬─┬─┬───────────────┐
Storing "foobar" │ │ │ │ │ │ │ │ │ │ length, in │
(short string) │ hash value │?│?│f│o│o│b│a│r│ bytes (not │
│ │ │ │ │ │ │ │ │ │ characters) │
└───────────────┴─┴─┴─┴─┴─┴─┴─┴─┴───────────────┘
8 bytes 8 bytes 4 or 8
Implementations§
source§impl<O: OffsetSizeTrait, V> ArrowBytesMap<O, V>
impl<O: OffsetSizeTrait, V> ArrowBytesMap<O, V>
pub fn new(output_type: OutputType) -> Self
sourcepub fn take(&mut self) -> Self
pub fn take(&mut self) -> Self
Return the contents of this map and replace it with a new empty map with the same output type
sourcepub fn insert_if_new<MP, OP>(
&mut self,
values: &ArrayRef,
make_payload_fn: MP,
observe_payload_fn: OP,
)
pub fn insert_if_new<MP, OP>( &mut self, values: &ArrayRef, make_payload_fn: MP, observe_payload_fn: OP, )
Inserts each value from values
into the map, invoking payload_fn
for
each value if not already present, deferring the allocation of the
payload until it is needed.
Note that this is different than a normal map that would replace the existing entry
§Arguments:
values
: array whose values are inserted
make_payload_fn
: invoked for each value that is not already present
to create the payload, in order of the values in values
observe_payload_fn
: invoked once, for each value in values
, that was
already present in the map, with corresponding payload value.
§Returns
The payload value for the entry, either the existing value or the newly inserted value
§Safety:
Note that make_payload_fn
and observe_payload_fn
are only invoked
with valid values from values
, not for the NULL
value.
sourcepub fn into_state(self) -> ArrayRef
pub fn into_state(self) -> ArrayRef
Converts this set into a StringArray
, LargeStringArray
,
BinaryArray
, or LargeBinaryArray
containing each distinct value
that was inserted. This is done without copying the values.
The values are guaranteed to be returned in the same order in which they were first seen.
sourcepub fn non_null_len(&self) -> usize
pub fn non_null_len(&self) -> usize
Number of non null entries