Struct PruningPredicate

Source

pub struct PruningPredicate { /* private fields */ }

Expand description

Used to prove that arbitrary predicates (boolean expression) can not possibly evaluate to true given information about a column provided by PruningStatistics.

§Introduction

PruningPredicate analyzes filter expressions using statistics such as min/max values and null counts, attempting to prove a “container” (e.g. Parquet Row Group) can be skipped without reading the actual data, potentially leading to significant performance improvements.

For example, PruningPredicates are used to prune Parquet Row Groups based on the min/max values found in the Parquet metadata. If the PruningPredicate can prove that the filter can never evaluate to true for any row in the Row Group, the entire Row Group is skipped during query execution.

The PruningPredicate API is general, and can be used for pruning other types of containers (e.g. files) based on statistics that may be known from external catalogs (e.g. Delta Lake) or other sources. How this works is a subtle topic. See the Background and Implementation section for details.

PruningPredicate supports:

Arbitrary expressions (including user defined functions)
Vectorized evaluation (provide more than one set of statistics at a time) so it is suitable for pruning 1000s of containers.
Any source of information that implements the PruningStatistics trait (not just Parquet metadata).

§Example

See the pruning.rs example in the datafusion-examples for a complete example of how to use PruningPredicate to prune files based on min/max values.

Given an expression like x = 5 and statistics for 3 containers (Row Groups, files, etc) A, B, and C:

  A: {x_min = 0, x_max = 4}
  B: {x_min = 2, x_max = 10}
  C: {x_min = 5, x_max = 8}

PruningPredicate will conclude that the rows in container A can never be true (as the maximum value is only 4), so it can be pruned:

A: false (no rows could possibly match x = 5)
B: true  (rows might match x = 5)
C: true  (rows might match x = 5)

See PruningPredicate::try_new and PruningPredicate::prune for more information.

§Background

§Boolean Tri-state logic

To understand the details of the rest of this documentation, it is important to understand how the tri-state boolean logic in SQL works. As this is somewhat esoteric, we review it here.

SQL has a notion of NULL that represents the value is “unknown” and this uncertainty propagates through expressions. SQL NULL behaves very differently than the NULL in most other languages where it is a special, sentinel value (e.g. 0 in C/C++). While representing uncertainty with NULL is powerful and elegant, SQL NULLs are often deeply confusing when first encountered as they behave differently than most programmers may expect.

In most other programming languages,

a == NULL evaluates to true if a also had the value NULL
a == NULL evaluates to false if a has any other value

However, in SQL a = NULL always evaluates to NULL (never true or false):

Expression	Result
`1 = NULL`	`NULL`
`NULL = NULL`	`NULL`

Also important is how AND and OR works with tri-state boolean logic as (perhaps counterintuitively) the result is not always NULL. While consistent with the notion of NULL representing “unknown”, this is again, often deeply confusing 🤯 when first encountered.

Expression	Result	Intuition
`NULL AND true`	`NULL`	The `NULL` stands for “unknown” and if it were `true` or `false` the overall expression value could change
`NULL AND false`	`false`	If the `NULL` was either `true` or `false` the overall expression is still `false`
`NULL AND NULL`	`NULL`

Expression	Result	Intuition
`NULL OR true`	`true`	If the `NULL` was either `true` or `false` the overall expression is still `true`
`NULL OR false`	`NULL`	The `NULL` stands for “unknown” and if it were `true` or `false` the overall expression value could change
`NULL OR NULL`	`NULL`

§SQL Filter Semantics

The SQL WHERE clause has a boolean expression, often called a filter or predicate. The semantics of this predicate are that the query evaluates the predicate for each row in the input tables and:

Rows that evaluate to true are returned in the query results
Rows that evaluate to false are not returned (“filtered out” or “pruned” or “skipped”).
Rows that evaluate to NULL are NOT returned (also “filtered out”). Note: this treatment of NULL is DIFFERENT than how NULL is treated in the rewritten predicate described below.

§`PruningPredicate` Implementation

Armed with the information in the Background section, we can now understand how the PruningPredicate logic works.

§Interface

Inputs

An input schema describing what columns exist
A predicate (expression that evaluates to a boolean)
PruningStatistics that provides information about columns in that schema, for multiple “containers”. For each column in each container, it provides optional information on contained values, min_values, max_values, null_counts counts, and row_counts counts.

Outputs: A (non null) boolean value for each container:

true: There MAY be rows that match the predicate
false: There are no rows that could possibly match the predicate (the predicate can never possibly be true). The container can be pruned (skipped) entirely.

While PruningPredicate will never return a NULL value, the rewritten predicate (as returned by build_predicate_expression and used internally by PruningPredicate) may evaluate to NULL when some of the min/max values or null / row counts are not known.

In order to be correct, PruningPredicate must return false only if it can determine that for all rows in the container, the predicate could never evaluate to true (always evaluates to either NULL or false).

§Contains Analysis and Min/Max Rewrite

PruningPredicate works by first analyzing the predicate to see what LiteralGuarantee must hold for the predicate to be true.

Then, the PruningPredicate rewrites the original predicate into an expression that references the min/max values of each column in the original predicate.

When the min/max values are actually substituted in to this expression and evaluated, the result means

true: there MAY be rows that pass the predicate, KEEPS the container
NULL: there MAY be rows that pass the predicate, KEEPS the container Note that rewritten predicate can evaluate to NULL when some of the min/max values are not known. Note that this is different than the SQL filter semantics where NULL means the row is filtered out.
false: there are no rows that could possibly match the predicate, PRUNES the container

For example, given a column x, the x_min, x_max, x_null_count, and x_row_count represent the minimum and maximum values, the null count of column x, and the row count of column x, provided by the PruningStatistics. x_null_count and x_row_count are used to handle the case where the column x is known to be all NULLs. Note this is different from knowing nothing about the column x, which confusingly is encoded by returning NULL for the min/max values from PruningStatistics::max_values and PruningStatistics::min_values.

Here are some examples of the rewritten predicates:

Original Predicate	Rewritten Predicate
`x = 5`	`x_null_count != x_row_count AND (x_min <= 5 AND 5 <= x_max)`
`x < 5`	`x_null_count != x_row_count THEN false (x_max < 5)`
`x = 5 AND y = 10`	`x_null_count != x_row_count AND (x_min <= 5 AND 5 <= x_max) AND y_null_count != y_row_count (y_min <= 10 AND 10 <= y_max)`
`x IS NULL`	`x_null_count > 0`
`x IS NOT NULL`	`x_null_count != row_count`
`CAST(x as int) = 5`	`x_null_count != x_row_count (CAST(x_min as int) <= 5 AND 5 <= CAST(x_max as int))`

§Predicate Evaluation

The PruningPredicate works in two passes

First pass: For each LiteralGuarantee calls PruningStatistics::contained and rules out containers where the LiteralGuarantees are not satisfied

Second Pass: Evaluates the rewritten expression using the min/max/null_counts/row_counts values for each column for each container. For any container that this expression evaluates to false, it rules out those containers.

§Example 1

Given the predicate, x = 5 AND y = 10, the rewritten predicate would look like:

x_null_count != x_row_count AND (x_min <= 5 AND 5 <= x_max)
AND
y_null_count != y_row_count AND (y_min <= 10 AND 10 <= y_max)

If we know that for a given container, x is between 1 and 100 and we know that y is between 4 and 7, we know nothing about the null count and row count of x and y, the input statistics might look like:

Column	Value
`x_min`	`1`
`x_max`	`100`
`x_null_count`	`null`
`x_row_count`	`null`
`y_min`	`4`
`y_max`	`7`
`y_null_count`	`null`
`y_row_count`	`null`

When these statistics values are substituted in to the rewritten predicate and simplified, the result is false:

null != null AND (1 <= 5 AND 5 <= 100) AND null != null AND (4 <= 10 AND 10 <= 7)
null = null is null which is not true, so the AND moves on to the next clause
null and (1 <= 5 AND 5 <= 100) AND null AND (4 <= 10 AND 10 <= 7)
evaluating the clauses further we get:
null and true and null and false
null and false
false

Returning false means the container can be pruned, which matches the intuition that x = 5 AND y = 10 can’t be true for any row if all values of y are 7 or less.

Note that if we had ended up with null AND true AND null AND true the result would have been null. null is treated the same astrue, because we can’t prove that the predicate is false.

If, for some other container, we knew y was between the values 4 and 15, then the rewritten predicate evaluates to true (verifying this is left as an exercise to the reader – are you still here?), and the container could not be pruned. The intuition is that there may be rows where the predicate might evaluate to true, and the only way to find out is to do more analysis, for example by actually reading the data and evaluating the predicate row by row.

§Example 2

Given the same predicate, x = 5 AND y = 10, the rewritten predicate would look like the same as example 1:

x_null_count != x_row_count AND (x_min <= 5 AND 5 <= x_max)
AND
y_null_count != y_row_count AND (y_min <= 10 AND 10 <= y_max)

If we know that for another given container, x_min is NULL and x_max is NULL (the min/max values are unknown), x_null_count is 100 and x_row_count is 100; we know that y is between 4 and 7, but we know nothing about the null count and row count of y. The input statistics might look like:

Column	Value
`x_min`	`null`
`x_max`	`null`
`x_null_count`	`100`
`x_row_count`	`100`
`y_min`	`4`
`y_max`	`7`
`y_null_count`	`null`
`y_row_count`	`null`

When these statistics values are substituted in to the rewritten predicate and simplified, the result is false:

100 != 100 AND (null <= 5 AND 5 <= null) AND null = null AND (4 <= 10 AND 10 <= 7)
false AND null AND null AND false
false AND false
false

Returning false means the container can be pruned, which matches the intuition that x = 5 AND y = 10 can’t be true because all values in x are known to be NULL.

PruningPredicate implements the type of min/max pruning described in Section 3.3.3 of the Snowflake SIGMOD Paper. The technique is described by various research such as small materialized aggregates, zone maps, and data skipping.

Struct PruningPredicateCopy item path

§Introduction

§Example

§Background

§Boolean Tri-state logic

§SQL Filter Semantics

§PruningPredicate Implementation

§Interface

§Contains Analysis and Min/Max Rewrite

§Predicate Evaluation

§Example 1

§Example 2

§Related Work

Implementations§

impl PruningPredicate

pub fn try_new(expr: Arc<dyn PhysicalExpr>, schema: SchemaRef) -> Result<Self>

pub fn prune<S: PruningStatistics>(&self, statistics: &S) -> Result<Vec<bool>>

pub fn schema(&self) -> &SchemaRef

pub fn orig_expr(&self) -> &Arc<dyn PhysicalExpr>

pub fn predicate_expr(&self) -> &Arc<dyn PhysicalExpr>

pub fn literal_guarantees(&self) -> &[LiteralGuarantee]

pub fn always_true(&self) -> bool

pub fn required_columns(&self) -> &RequiredColumns

pub fn literal_columns(&self) -> Vec<String>

Trait Implementations§

impl Clone for PruningPredicate

fn clone(&self) -> PruningPredicate

fn clone_from(&mut self, source: &Self)

impl Debug for PruningPredicate

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Auto Trait Implementations§

impl Freeze for PruningPredicate

impl !RefUnwindSafe for PruningPredicate

impl Send for PruningPredicate

impl Sync for PruningPredicate

impl Unpin for PruningPredicate

impl !UnwindSafe for PruningPredicate

Blanket Implementations§

impl<T> Any for Twhere T: 'static + ?Sized,

fn type_id(&self) -> TypeId

impl<T> Borrow<T> for Twhere T: ?Sized,

fn borrow(&self) -> &T

impl<T> BorrowMut<T> for Twhere T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

impl<T> CloneToUninit for Twhere T: Clone,

unsafe fn clone_to_uninit(&self, dst: *mut u8)

impl<T> From<T> for T

fn from(t: T) -> T

impl<T, U> Into<U> for Twhere U: From<T>,

fn into(self) -> U

impl<T> IntoEither for T

fn into_either(self, into_left: bool) -> Either<Self, Self>

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>where F: FnOnce(&Self) -> bool,

impl<T> ToOwned for Twhere T: Clone,

type Owned = T

fn to_owned(&self) -> T

fn clone_into(&self, target: &mut T)

impl<T, U> TryFrom<U> for Twhere U: Into<T>,

type Error = Infallible

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

impl<T, U> TryInto<U> for Twhere U: TryFrom<T>,

type Error = <U as TryFrom<T>>::Error

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

impl<V, T> VZip<V> for Twhere V: MultiLane<T>,

fn vzip(self) -> V

impl<T> ErasedDestructor for Twhere T: 'static,

impl<T> MaybeSendSync for T

Struct PruningPredicate

§`PruningPredicate` Implementation

impl<T> Any for T
where T: 'static + ?Sized,

impl<T> Borrow<T> for T
where T: ?Sized,

impl<T> BorrowMut<T> for T
where T: ?Sized,

impl<T> CloneToUninit for T
where T: Clone,

impl<T, U> Into<U> for T
where U: From<T>,

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

impl<T> ToOwned for T
where T: Clone,

impl<T, U> TryFrom<U> for T
where U: Into<T>,

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

impl<T> ErasedDestructor for T
where T: 'static,