Trait datafusion_catalog::TableProvider
source · pub trait TableProvider: Sync + Send {
// Required methods
fn as_any(&self) -> &dyn Any;
fn schema(&self) -> SchemaRef;
fn table_type(&self) -> TableType;
fn scan<'life0, 'life1, 'life2, 'life3, 'async_trait>(
&'life0 self,
state: &'life1 dyn Session,
projection: Option<&'life2 Vec<usize>>,
filters: &'life3 [Expr],
limit: Option<usize>,
) -> Pin<Box<dyn Future<Output = Result<Arc<dyn ExecutionPlan>>> + Send + 'async_trait>>
where Self: 'async_trait,
'life0: 'async_trait,
'life1: 'async_trait,
'life2: 'async_trait,
'life3: 'async_trait;
// Provided methods
fn constraints(&self) -> Option<&Constraints> { ... }
fn get_table_definition(&self) -> Option<&str> { ... }
fn get_logical_plan(&self) -> Option<&LogicalPlan> { ... }
fn get_column_default(&self, _column: &str) -> Option<&Expr> { ... }
fn supports_filters_pushdown(
&self,
filters: &[&Expr],
) -> Result<Vec<TableProviderFilterPushDown>> { ... }
fn statistics(&self) -> Option<Statistics> { ... }
fn insert_into<'life0, 'life1, 'async_trait>(
&'life0 self,
_state: &'life1 dyn Session,
_input: Arc<dyn ExecutionPlan>,
_overwrite: bool,
) -> Pin<Box<dyn Future<Output = Result<Arc<dyn ExecutionPlan>>> + Send + 'async_trait>>
where Self: 'async_trait,
'life0: 'async_trait,
'life1: 'async_trait { ... }
}
Expand description
Source table
Required Methods§
sourcefn as_any(&self) -> &dyn Any
fn as_any(&self) -> &dyn Any
Returns the table provider as Any
so that it can be
downcast to a specific implementation.
sourcefn table_type(&self) -> TableType
fn table_type(&self) -> TableType
Get the type of this table for metadata/catalog purposes.
sourcefn scan<'life0, 'life1, 'life2, 'life3, 'async_trait>(
&'life0 self,
state: &'life1 dyn Session,
projection: Option<&'life2 Vec<usize>>,
filters: &'life3 [Expr],
limit: Option<usize>,
) -> Pin<Box<dyn Future<Output = Result<Arc<dyn ExecutionPlan>>> + Send + 'async_trait>>where
Self: 'async_trait,
'life0: 'async_trait,
'life1: 'async_trait,
'life2: 'async_trait,
'life3: 'async_trait,
fn scan<'life0, 'life1, 'life2, 'life3, 'async_trait>(
&'life0 self,
state: &'life1 dyn Session,
projection: Option<&'life2 Vec<usize>>,
filters: &'life3 [Expr],
limit: Option<usize>,
) -> Pin<Box<dyn Future<Output = Result<Arc<dyn ExecutionPlan>>> + Send + 'async_trait>>where
Self: 'async_trait,
'life0: 'async_trait,
'life1: 'async_trait,
'life2: 'async_trait,
'life3: 'async_trait,
Create an ExecutionPlan
for scanning the table with optionally
specified projection
, filter
and limit
, described below.
The ExecutionPlan
is responsible scanning the datasource’s
partitions in a streaming, parallelized fashion.
§Projection
If specified, only a subset of columns should be returned, in the order
specified. The projection is a set of indexes of the fields in
Self::schema
.
DataFusion provides the projection to scan only the columns actually used in the query to improve performance, an optimization called “Projection Pushdown”. Some datasources, such as Parquet, can use this information to go significantly faster when only a subset of columns is required.
§Filters
A list of boolean filter Expr
s to evaluate during the scan, in the
manner specified by Self::supports_filters_pushdown
. Only rows for
which all of the Expr
s evaluate to true
must be returned (aka the
expressions are AND
ed together).
To enable filter pushdown you must override
Self::supports_filters_pushdown
as the default implementation does
not and filters
will be empty.
DataFusion pushes filtering into the scans whenever possible (“Filter Pushdown”), and depending on the format and the implementation of the format, evaluating the predicate during the scan can increase performance significantly.
§Note: Some columns may appear only in Filters
In certain cases, a query may only use a certain column in a Filter that has been completely pushed down to the scan. In this case, the projection will not contain all the columns found in the filter expressions.
For example, given the query SELECT t.a FROM t WHERE t.b > 5
,
┌────────────────────┐
│ Projection(t.a) │
└────────────────────┘
▲
│
│
┌────────────────────┐ Filter ┌────────────────────┐ Projection ┌────────────────────┐
│ Filter(t.b > 5) │────Pushdown──▶ │ Projection(t.a) │ ───Pushdown───▶ │ Projection(t.a) │
└────────────────────┘ └────────────────────┘ └────────────────────┘
▲ ▲ ▲
│ │ │
│ │ ┌────────────────────┐
┌────────────────────┐ ┌────────────────────┐ │ Scan │
│ Scan │ │ Scan │ │ filter=(t.b > 5) │
└────────────────────┘ │ filter=(t.b > 5) │ │ projection=(t.a) │
└────────────────────┘ └────────────────────┘
Initial Plan If `TableProviderFilterPushDown` Projection pushdown notes that
returns true, filter pushdown the scan only needs t.a
pushes the filter into the scan
BUT internally evaluating the
predicate still requires t.b
§Limit
If limit
is specified, must only produce at least this many rows,
(though it may return more). Like Projection Pushdown and Filter
Pushdown, DataFusion pushes LIMIT
s as far down in the plan as
possible, called “Limit Pushdown” as some sources can use this
information to improve their performance. Note that if there are any
Inexact filters pushed down, the LIMIT cannot be pushed down. This is
because inexact filters do not guarantee that every filtered row is
removed, so applying the limit could lead to too few rows being available
to return as a final result.
Provided Methods§
sourcefn constraints(&self) -> Option<&Constraints>
fn constraints(&self) -> Option<&Constraints>
Get a reference to the constraints of the table. Returns:
None
for tables that do not support constraints.Some(&Constraints)
for tables supporting constraints. Therefore, aSome(&Constraints::empty())
return value indicates that this table supports constraints, but there are no constraints.
sourcefn get_table_definition(&self) -> Option<&str>
fn get_table_definition(&self) -> Option<&str>
Get the create statement used to create this table, if available.
sourcefn get_logical_plan(&self) -> Option<&LogicalPlan>
fn get_logical_plan(&self) -> Option<&LogicalPlan>
Get the LogicalPlan
of this table, if available
sourcefn get_column_default(&self, _column: &str) -> Option<&Expr>
fn get_column_default(&self, _column: &str) -> Option<&Expr>
Get the default value for a column, if available.
sourcefn supports_filters_pushdown(
&self,
filters: &[&Expr],
) -> Result<Vec<TableProviderFilterPushDown>>
fn supports_filters_pushdown( &self, filters: &[&Expr], ) -> Result<Vec<TableProviderFilterPushDown>>
Specify if DataFusion should provide filter expressions to the TableProvider to apply during the scan.
Some TableProviders can evaluate filters more efficiently than the
Filter
operator in DataFusion, for example by using an index.
§Parameters and Return Value
The return Vec
must have one element for each element of the filters
argument. The value of each element indicates if the TableProvider can
apply the corresponding filter during the scan. The position in the return
value corresponds to the expression in the filters
parameter.
If the length of the resulting Vec
does not match the filters
input
an error will be thrown.
Each element in the resulting Vec
is one of the following:
Exact
orInexact
: The TableProvider can apply the filter during scanUnsupported
: The TableProvider cannot apply the filter during scan
By default, this function returns Unsupported
for all filters,
meaning no filters will be provided to Self::scan
.
§Example
// Define a struct that implements the TableProvider trait
struct TestDataSource {}
#[async_trait]
impl TableProvider for TestDataSource {
todo!()
// Override the supports_filters_pushdown to evaluate which expressions
// to accept as pushdown predicates.
fn supports_filters_pushdown(&self, filters: &[&Expr]) -> Result<Vec<TableProviderFilterPushDown>> {
// Process each filter
let support: Vec<_> = filters.iter().map(|expr| {
match expr {
// This example only supports a between expr with a single column named "c1".
Expr::Between(between_expr) => {
between_expr.expr
.try_into_col()
.map(|column| {
if column.name == "c1" {
TableProviderFilterPushDown::Exact
} else {
TableProviderFilterPushDown::Unsupported
}
})
// If there is no column in the expr set the filter to unsupported.
.unwrap_or(TableProviderFilterPushDown::Unsupported)
}
_ => {
// For all other cases return Unsupported.
TableProviderFilterPushDown::Unsupported
}
}
}).collect();
Ok(support)
}
}
sourcefn statistics(&self) -> Option<Statistics>
fn statistics(&self) -> Option<Statistics>
Get statistics for this table, if available
sourcefn insert_into<'life0, 'life1, 'async_trait>(
&'life0 self,
_state: &'life1 dyn Session,
_input: Arc<dyn ExecutionPlan>,
_overwrite: bool,
) -> Pin<Box<dyn Future<Output = Result<Arc<dyn ExecutionPlan>>> + Send + 'async_trait>>where
Self: 'async_trait,
'life0: 'async_trait,
'life1: 'async_trait,
fn insert_into<'life0, 'life1, 'async_trait>(
&'life0 self,
_state: &'life1 dyn Session,
_input: Arc<dyn ExecutionPlan>,
_overwrite: bool,
) -> Pin<Box<dyn Future<Output = Result<Arc<dyn ExecutionPlan>>> + Send + 'async_trait>>where
Self: 'async_trait,
'life0: 'async_trait,
'life1: 'async_trait,
Return an ExecutionPlan
to insert data into this table, if
supported.
The returned plan should return a single row in a UInt64 column called “count” such as the following
+-------+,
| count |,
+-------+,
| 6 |,
+-------+,
§See Also
See DataSinkExec
for the common pattern of inserting a
streams of RecordBatch
es as files to an ObjectStore.