Module shuffler

Source
Expand description

Disk-based shuffle a stream of RecordBatch into each IVF partition.

  1. write the entire stream to a file
  2. count the number of rows in each partition
  3. read the data back into memory and shuffle into grouped vectors

Problems for the future:

  1. while groupby column will stay the same, we may want to include extra data columns in the future
  2. shuffling into memory is fast but we should add disk buffer to support bigger datasets

Structs§

IvfShuffler

Functions§

shuffle_dataset
Disk-based shuffle for a stream of RecordBatch into each IVF partition. Sub-quantizer will be applied if provided.
shuffle_vectors