Function shuffle_dataset

Source
pub async fn shuffle_dataset(
    data: impl RecordBatchStream + Unpin + 'static,
    column: &str,
    ivf: Arc<IvfTransformer>,
    precomputed_partitions: Option<HashMap<u64, u32>>,
    num_partitions: u32,
    shuffle_partition_batches: usize,
    shuffle_partition_concurrency: usize,
    precomputed_shuffle_buffers: Option<(Path, Vec<String>)>,
) -> Result<Vec<impl Stream<Item = Result<RecordBatch>>>>
Expand description

Disk-based shuffle for a stream of RecordBatch into each IVF partition. Sub-quantizer will be applied if provided.

§Parameters

data: input data stream. column: column name of the vector column. ivf: IVF model. num_partitions: number of IVF partitions. num_sub_vectors: number of PQ sub-vectors.

§Returns

Result<Vec<impl Stream<Item = Result>>>: a vector of streams of shuffled partitioned data. Each stream corresponds to a partition and is sorted within the stream. Consumer of these streams is expected to merge the streams into a single stream by k-list merge algo.