datafusion::datasource::listing

Struct ListingTable

Source
pub struct ListingTable { /* private fields */ }
Expand description

Reads data from one or more files as a single table.

Implements TableProvider, a DataFusion data source. The files are read using an ObjectStore instance, for example from local files or objects from AWS S3.

For example, given the table1 directory (or object store prefix)

table1
 ├── file1.parquet
 └── file2.parquet

A ListingTable would read the files file1.parquet and file2.parquet as a single table, merging the schemas if the files have compatible but not identical schemas.

Given the table2 directory (or object store prefix)

table2
 ├── date=2024-06-01
 │    ├── file3.parquet
 │    └── file4.parquet
 └── date=2024-06-02
      └── file5.parquet

A ListingTable would read the files file3.parquet, file4.parquet, and file5.parquet as a single table, again merging schemas if necessary.

Given the hive style partitioning structure (e.g,. directories named date=2024-06-01 and date=2026-06-02), ListingTable also adds a date column when reading the table:

  • The files in table2/date=2024-06-01 will have the value 2024-06-01
  • The files in table2/date=2024-06-02 will have the value 2024-06-02.

If the query has a predicate like WHERE date = '2024-06-01' only the corresponding directory will be read.

ListingTable also supports filter and projection pushdown for formats that support it as such as Parquet.

§Example

Here is an example of reading a directory of parquet files using a ListingTable:

let ctx = SessionContext::new();
let session_state = ctx.state();
let table_path = "/path/to/parquet";

// Parse the path
let table_path = ListingTableUrl::parse(table_path)?;

// Create default parquet options
let file_format = ParquetFormat::new();
let listing_options = ListingOptions::new(Arc::new(file_format))
  .with_file_extension(".parquet");

// Resolve the schema
let resolved_schema = listing_options
   .infer_schema(&session_state, &table_path)
   .await?;

let config = ListingTableConfig::new(table_path)
  .with_listing_options(listing_options)
  .with_schema(resolved_schema);

// Create a new TableProvider
let provider = Arc::new(ListingTable::try_new(config)?);

// This provider can now be read as a dataframe:
let df = ctx.read_table(provider.clone());

// or registered as a named table:
ctx.register_table("my_table", provider);

Implementations§

Source§

impl ListingTable

Source

pub fn try_new(config: ListingTableConfig) -> Result<Self>

Create new ListingTable that lists the FS to get the files to scan. See ListingTable for and example.

Takes a ListingTableConfig as input which requires an ObjectStore and table_path. ListingOptions and SchemaRef are optional. If they are not provided the file type is inferred based on the file suffix. If the schema is provided then it must be resolved before creating the table and should contain the fields of the file without the table partitioning columns.

Source

pub fn with_constraints(self, constraints: Constraints) -> Self

Assign constraints

Source

pub fn with_column_defaults( self, column_defaults: HashMap<String, Expr>, ) -> Self

Assign column defaults

Source

pub fn with_cache(self, cache: Option<FileStatisticsCache>) -> Self

Set the FileStatisticsCache used to cache parquet file statistics.

Setting a statistics cache on the SessionContext can avoid refetching statistics multiple times in the same session.

If None, creates a new DefaultFileStatisticsCache scoped to this query.

Source

pub fn with_definition(self, definition: Option<String>) -> Self

Specify the SQL definition for this table, if any

Source

pub fn table_paths(&self) -> &Vec<ListingTableUrl>

Get paths ref

Source

pub fn options(&self) -> &ListingOptions

Get options ref

Trait Implementations§

Source§

impl Debug for ListingTable

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
Source§

impl TableProvider for ListingTable

Source§

fn as_any(&self) -> &dyn Any

Returns the table provider as Any so that it can be downcast to a specific implementation.
Source§

fn schema(&self) -> SchemaRef

Get a reference to the schema for this table
Source§

fn constraints(&self) -> Option<&Constraints>

Get a reference to the constraints of the table. Returns: Read more
Source§

fn table_type(&self) -> TableType

Get the type of this table for metadata/catalog purposes.
Source§

fn scan<'life0, 'life1, 'life2, 'life3, 'async_trait>( &'life0 self, state: &'life1 dyn Session, projection: Option<&'life2 Vec<usize>>, filters: &'life3 [Expr], limit: Option<usize>, ) -> Pin<Box<dyn Future<Output = Result<Arc<dyn ExecutionPlan>>> + Send + 'async_trait>>
where Self: 'async_trait, 'life0: 'async_trait, 'life1: 'async_trait, 'life2: 'async_trait, 'life3: 'async_trait,

Create an ExecutionPlan for scanning the table with optionally specified projection, filter and limit, described below. Read more
Source§

fn supports_filters_pushdown( &self, filters: &[&Expr], ) -> Result<Vec<TableProviderFilterPushDown>>

Specify if DataFusion should provide filter expressions to the TableProvider to apply during the scan. Read more
Source§

fn get_table_definition(&self) -> Option<&str>

Get the create statement used to create this table, if available.
Source§

fn insert_into<'life0, 'life1, 'async_trait>( &'life0 self, state: &'life1 dyn Session, input: Arc<dyn ExecutionPlan>, insert_op: InsertOp, ) -> Pin<Box<dyn Future<Output = Result<Arc<dyn ExecutionPlan>>> + Send + 'async_trait>>
where Self: 'async_trait, 'life0: 'async_trait, 'life1: 'async_trait,

Return an ExecutionPlan to insert data into this table, if supported. Read more
Source§

fn get_column_default(&self, column: &str) -> Option<&Expr>

Get the default value for a column, if available.
Source§

fn get_logical_plan(&self) -> Option<Cow<'_, LogicalPlan>>

Get the LogicalPlan of this table, if available.
Source§

fn statistics(&self) -> Option<Statistics>

Get statistics for this table, if available

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T> Same for T

Source§

type Output = T

Should always be Self
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V

Source§

impl<T> ErasedDestructor for T
where T: 'static,

Source§

impl<T> MaybeSendSync for T