pub async fn shuffle_dataset(
data: impl RecordBatchStream + Unpin + 'static,
column: &str,
ivf: Arc<IvfTransformer>,
precomputed_partitions: Option<HashMap<u64, u32>>,
num_partitions: u32,
shuffle_partition_batches: usize,
shuffle_partition_concurrency: usize,
precomputed_shuffle_buffers: Option<(Path, Vec<String>)>,
) -> Result<Vec<impl Stream<Item = Result<RecordBatch>>>>
Expand description
Disk-based shuffle for a stream of RecordBatch into each IVF partition. Sub-quantizer will be applied if provided.
§Parameters
data: input data stream. column: column name of the vector column. ivf: IVF model. num_partitions: number of IVF partitions. num_sub_vectors: number of PQ sub-vectors.
§Returns
Result<Vec<impl Stream<Item = Result