lance_encoding::encoder

Trait FieldEncoder

Source
pub trait FieldEncoder: Send {
    // Required methods
    fn maybe_encode(
        &mut self,
        array: ArrayRef,
        external_buffers: &mut OutOfLineBuffers,
        repdef: RepDefBuilder,
        row_number: u64,
        num_rows: u64,
    ) -> Result<Vec<EncodeTask>>;
    fn flush(
        &mut self,
        external_buffers: &mut OutOfLineBuffers,
    ) -> Result<Vec<EncodeTask>>;
    fn finish(
        &mut self,
        external_buffers: &mut OutOfLineBuffers,
    ) -> BoxFuture<'_, Result<Vec<EncodedColumn>>>;
    fn num_columns(&self) -> u32;
}
Expand description

Top level encoding trait to code any Arrow array type into one or more pages.

The field encoder implements buffering and encoding of a single input column but it may map to multiple output columns. For example, a list array or struct array will be encoded into multiple columns.

Also, fields may be encoded at different speeds. For example, given a struct column with three fields (a boolean field, an int32 field, and a 4096-dimension tensor field) the tensor field is likely to emit encoded pages much more frequently than the boolean field.

Required Methods§

Source

fn maybe_encode( &mut self, array: ArrayRef, external_buffers: &mut OutOfLineBuffers, repdef: RepDefBuilder, row_number: u64, num_rows: u64, ) -> Result<Vec<EncodeTask>>

Buffer the data and, if there is enough data in the buffer to form a page, return an encoding task to encode the data.

This may return more than one task because a single column may be mapped to multiple output columns. For example, if encoding a struct column with three children then up to three tasks may be returned from each call to maybe_encode.

It may also return multiple tasks for a single column if the input array is larger than a single disk page.

It could also return an empty Vec if there is not enough data yet to encode any pages.

The row_number must be passed which is the top-level row number currently being encoded This is stored in any pages produced by this call so that we can know the priority of the page.

The num_rows is the number of top level rows. It is initially the same as array.len() however it is passed seprately because array will become flattened over time (if there is repetition) and we need to know the original number of rows for various purposes.

Source

fn flush( &mut self, external_buffers: &mut OutOfLineBuffers, ) -> Result<Vec<EncodeTask>>

Flush any remaining data from the buffers into encoding tasks

Each encode task produces a single page. The order of these pages will be maintained in the file (we do not worry about order between columns but all pages in the same column should maintain order)

This may be called intermittently throughout encoding but will always be called once at the end of encoding just before calling finish

Source

fn finish( &mut self, external_buffers: &mut OutOfLineBuffers, ) -> BoxFuture<'_, Result<Vec<EncodedColumn>>>

Finish encoding and return column metadata

This is called only once, after all encode tasks have completed

This returns a Vec because a single field may have created multiple columns

Source

fn num_columns(&self) -> u32

The number of output columns this encoding will create

Implementors§