Expand description
Disk-based shuffle a stream of RecordBatch into each IVF partition.
- write the entire stream to a file
- count the number of rows in each partition
- read the data back into memory and shuffle into grouped vectors
Problems for the future:
- while groupby column will stay the same, we may want to include extra data columns in the future
- shuffling into memory is fast but we should add disk buffer to support bigger datasets
Structs§
Functions§
- Disk-based shuffle for a stream of RecordBatch into each IVF partition. Sub-quantizer will be applied if provided.