pub trait FieldEncoder: Send {
// Required methods
fn maybe_encode(
&mut self,
array: ArrayRef,
external_buffers: &mut OutOfLineBuffers,
repdef: RepDefBuilder,
row_number: u64,
num_rows: u64,
) -> Result<Vec<EncodeTask>>;
fn flush(
&mut self,
external_buffers: &mut OutOfLineBuffers,
) -> Result<Vec<EncodeTask>>;
fn finish(
&mut self,
external_buffers: &mut OutOfLineBuffers,
) -> BoxFuture<'_, Result<Vec<EncodedColumn>>>;
fn num_columns(&self) -> u32;
}
Expand description
Top level encoding trait to code any Arrow array type into one or more pages.
The field encoder implements buffering and encoding of a single input column but it may map to multiple output columns. For example, a list array or struct array will be encoded into multiple columns.
Also, fields may be encoded at different speeds. For example, given a struct column with three fields (a boolean field, an int32 field, and a 4096-dimension tensor field) the tensor field is likely to emit encoded pages much more frequently than the boolean field.
Required Methods§
Sourcefn maybe_encode(
&mut self,
array: ArrayRef,
external_buffers: &mut OutOfLineBuffers,
repdef: RepDefBuilder,
row_number: u64,
num_rows: u64,
) -> Result<Vec<EncodeTask>>
fn maybe_encode( &mut self, array: ArrayRef, external_buffers: &mut OutOfLineBuffers, repdef: RepDefBuilder, row_number: u64, num_rows: u64, ) -> Result<Vec<EncodeTask>>
Buffer the data and, if there is enough data in the buffer to form a page, return an encoding task to encode the data.
This may return more than one task because a single column may be mapped to multiple output columns. For example, if encoding a struct column with three children then up to three tasks may be returned from each call to maybe_encode.
It may also return multiple tasks for a single column if the input array is larger than a single disk page.
It could also return an empty Vec if there is not enough data yet to encode any pages.
The row_number
must be passed which is the top-level row number currently being encoded
This is stored in any pages produced by this call so that we can know the priority of the
page.
The num_rows
is the number of top level rows. It is initially the same as array.len()
however it is passed seprately because array will become flattened over time (if there is
repetition) and we need to know the original number of rows for various purposes.
Sourcefn flush(
&mut self,
external_buffers: &mut OutOfLineBuffers,
) -> Result<Vec<EncodeTask>>
fn flush( &mut self, external_buffers: &mut OutOfLineBuffers, ) -> Result<Vec<EncodeTask>>
Flush any remaining data from the buffers into encoding tasks
Each encode task produces a single page. The order of these pages will be maintained in the file (we do not worry about order between columns but all pages in the same column should maintain order)
This may be called intermittently throughout encoding but will always be called once at the end of encoding just before calling finish
Sourcefn finish(
&mut self,
external_buffers: &mut OutOfLineBuffers,
) -> BoxFuture<'_, Result<Vec<EncodedColumn>>>
fn finish( &mut self, external_buffers: &mut OutOfLineBuffers, ) -> BoxFuture<'_, Result<Vec<EncodedColumn>>>
Finish encoding and return column metadata
This is called only once, after all encode tasks have completed
This returns a Vec because a single field may have created multiple columns
Sourcefn num_columns(&self) -> u32
fn num_columns(&self) -> u32
The number of output columns this encoding will create