Struct datafusion::datasource::physical_plan::parquet::ParquetExec

source ·

pub struct ParquetExec { /* private fields */ }

Expand description

Execution plan for reading one or more Parquet files.

            ▲
            │
            │  Produce a stream of
            │  RecordBatches
            │
┌───────────────────────┐
│                       │
│      ParquetExec      │
│                       │
└───────────────────────┘
            ▲
            │  Asynchronously read from one
            │  or more parquet files via
            │  ObjectStore interface
            │
            │
  .───────────────────.
 │                     )
 │`───────────────────'│
 │    ObjectStore      │
 │.───────────────────.│
 │                     )
  `───────────────────'

§Example: Create a `ParquetExec`

// Create a ParquetExec for reading `file1.parquet` with a file size of 100MB
let file_scan_config = FileScanConfig::new(object_store_url, file_schema)
   .with_file(PartitionedFile::new("file1.parquet", 100*1024*1024));
let exec = ParquetExec::builder(file_scan_config)
  // Provide a predicate for filtering row groups/pages
  .with_predicate(predicate)
  .build();

§Features

Supports the following optimizations:

Concurrent reads: Can read from one or more files in parallel as multiple partitions, including concurrently reading multiple row groups from a single file.
Predicate push down: skips row groups and pages based on min/max/null_counts in the row group metadata, the page index and bloom filters.
Projection pushdown: reads and decodes only the columns required.
Limit pushdown: stop execution early after some number of rows are read.
Custom readers: customize reading parquet files, e.g. to cache metadata, coalesce I/O operations, etc. See ParquetFileReaderFactory for more details.
Schema adapters: read parquet files with different schemas into a unified table schema. This can be used to implement “schema evolution”. See SchemaAdapterFactory for more details.
metadata_size_hint: controls the number of bytes read from the end of the file in the initial I/O when the default ParquetFileReaderFactory. If a custom reader is used, it supplies the metadata directly and this parameter is ignored. ParquetExecBuilder::with_metadata_size_hint for more details.
User provided ParquetAccessPlans to skip row groups and/or pages based on external information. See “Implementing External Indexes” below

§Implementing External Indexes

It is possible to restrict the row groups and selections within those row groups that the ParquetExec will consider by providing an initial ParquetAccessPlan as extensions on PartitionedFile. This can be used to implement external indexes on top of parquet files and select only portions of the files.

The ParquetExec will try and reduce any provided ParquetAccessPlan further based on the contents of ParquetMetadata and other settings.

§Example of providing a ParquetAccessPlan

// create an access plan to scan row group 0, 1 and 3 and skip row groups 2 and 4
let mut access_plan = ParquetAccessPlan::new_all(5);
access_plan.skip(2);
access_plan.skip(4);
// provide the plan as extension to the FileScanConfig
let partitioned_file = PartitionedFile::new("my_file.parquet", 1234)
  .with_extensions(Arc::new(access_plan));
// create a ParquetExec to scan this file
let file_scan_config = FileScanConfig::new(ObjectStoreUrl::local_filesystem(), schema())
    .with_file(partitioned_file);
// this parquet exec will not even try to read row groups 2 and 4. Additional
// pruning based on predicates may also happen
let exec = ParquetExec::builder(file_scan_config).build();

For a complete example, see the [advanced_parquet_index example]).

§Execution Overview

Step 1: ParquetExec::execute is called, returning a FileStream configured to open parquet files with a ParquetOpener.
Step 2: When the stream is polled, the ParquetOpener is called to open the file.
Step 3: The ParquetOpener gets the ParquetMetaData (file metadata) via ParquetFileReaderFactory, creating a ParquetAccessPlan by applying predicates to metadata. The plan and projections are used to determine what pages must be read.
Step 4: The stream begins reading data, fetching the required pages and incrementally decoding them.
Step 5: As each [RecordBatch] is read, it may be adapted by a SchemaAdapter to match the table schema. By default missing columns are filled with nulls, but this can be customized via SchemaAdapterFactory.

Struct datafusion::datasource::physical_plan::parquet::ParquetExecCopy item path

§Example: Create a ParquetExec

§Features

§Implementing External Indexes

§Example of providing a ParquetAccessPlan

§Execution Overview

Implementations§

impl ParquetExec

pub fn new( base_config: FileScanConfig, predicate: Option<Arc<dyn PhysicalExpr>>, metadata_size_hint: Option<usize>, table_parquet_options: TableParquetOptions, ) -> Self

pub fn builder(file_scan_config: FileScanConfig) -> ParquetExecBuilder

pub fn base_config(&self) -> &FileScanConfig

pub fn table_parquet_options(&self) -> &TableParquetOptions

pub fn predicate(&self) -> Option<&Arc<dyn PhysicalExpr>>

pub fn pruning_predicate(&self) -> Option<&Arc<PruningPredicate>>

pub fn with_parquet_file_reader_factory( self, parquet_file_reader_factory: Arc<dyn ParquetFileReaderFactory>, ) -> Self

pub fn with_schema_adapter_factory( self, schema_adapter_factory: Arc<dyn SchemaAdapterFactory>, ) -> Self

pub fn with_pushdown_filters(self, pushdown_filters: bool) -> Self

pub fn with_reorder_filters(self, reorder_filters: bool) -> Self

pub fn with_enable_page_index(self, enable_page_index: bool) -> Self

pub fn with_bloom_filter_on_read(self, bloom_filter_on_read: bool) -> Self

pub fn with_bloom_filter_on_write( self, enable_bloom_filter_on_write: bool, ) -> Self

Trait Implementations§

impl Clone for ParquetExec

fn clone(&self) -> ParquetExec

fn clone_from(&mut self, source: &Self)

impl Debug for ParquetExec

fn fmt(&self, f: &mut Formatter<'_>) -> Result

impl DisplayAs for ParquetExec

fn fmt_as(&self, t: DisplayFormatType, f: &mut Formatter<'_>) -> Result

impl ExecutionPlan for ParquetExec

fn as_any(&self) -> &dyn Any

fn repartitioned( &self, target_partitions: usize, config: &ConfigOptions, ) -> Result<Option<Arc<dyn ExecutionPlan>>>

fn name(&self) -> &'static str

fn properties(&self) -> &PlanProperties

fn children(&self) -> Vec<&Arc<dyn ExecutionPlan>>

fn with_new_children( self: Arc<Self>, _: Vec<Arc<dyn ExecutionPlan>>, ) -> Result<Arc<dyn ExecutionPlan>>

fn execute( &self, partition_index: usize, ctx: Arc<TaskContext>, ) -> Result<SendableRecordBatchStream>

fn metrics(&self) -> Option<MetricsSet>

fn statistics(&self) -> Result<Statistics>

fn with_fetch(&self, limit: Option<usize>) -> Option<Arc<dyn ExecutionPlan>>

fn static_name() -> &'static strwhere Self: Sized,

fn schema(&self) -> Arc<Schema>

fn required_input_distribution(&self) -> Vec<Distribution>

fn required_input_ordering(&self) -> Vec<Option<Vec<PhysicalSortRequirement>>>

fn maintains_input_order(&self) -> Vec<bool>

fn benefits_from_input_partitioning(&self) -> Vec<bool>

fn supports_limit_pushdown(&self) -> bool

Auto Trait Implementations§

impl Freeze for ParquetExec

impl !RefUnwindSafe for ParquetExec

impl Send for ParquetExec

impl Sync for ParquetExec

impl Unpin for ParquetExec

impl !UnwindSafe for ParquetExec

Blanket Implementations§

impl<T> Any for Twhere T: 'static + ?Sized,

fn type_id(&self) -> TypeId

impl<T> Borrow<T> for Twhere T: ?Sized,

fn borrow(&self) -> &T

impl<T> BorrowMut<T> for Twhere T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

impl<T> CloneToUninit for Twhere T: Clone,

default unsafe fn clone_to_uninit(&self, dst: *mut T)

impl<T> From<T> for T

fn from(t: T) -> T

impl<T, U> Into<U> for Twhere U: From<T>,

fn into(self) -> U

impl<T> IntoEither for T

fn into_either(self, into_left: bool) -> Either<Self, Self>

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>where F: FnOnce(&Self) -> bool,

impl<T> Same for T

type Output = T

impl<T> ToOwned for Twhere T: Clone,

type Owned = T

fn to_owned(&self) -> T

fn clone_into(&self, target: &mut T)

impl<T, U> TryFrom<U> for Twhere U: Into<T>,

type Error = Infallible

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

impl<T, U> TryInto<U> for Twhere U: TryFrom<T>,

Struct datafusion::datasource::physical_plan::parquet::ParquetExec

§Example: Create a `ParquetExec`

fn static_name() -> &'static str
where Self: Sized,

impl<T> Any for T
where T: 'static + ?Sized,

impl<T> Borrow<T> for T
where T: ?Sized,

impl<T> BorrowMut<T> for T
where T: ?Sized,

impl<T> CloneToUninit for T
where T: Clone,

impl<T, U> Into<U> for T
where U: From<T>,

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

impl<T> ToOwned for T
where T: Clone,

impl<T, U> TryFrom<U> for T
where U: Into<T>,

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

impl<V, T> VZip<V> for T
where V: MultiLane<T>,