pub trait PruningStatistics {
// Required methods
fn min_values(&self, column: &Column) -> Option<ArrayRef>;
fn max_values(&self, column: &Column) -> Option<ArrayRef>;
fn num_containers(&self) -> usize;
fn null_counts(&self, column: &Column) -> Option<ArrayRef>;
fn row_counts(&self, column: &Column) -> Option<ArrayRef>;
fn contained(
&self,
column: &Column,
values: &HashSet<ScalarValue>,
) -> Option<BooleanArray>;
}
Expand description
A source of runtime statistical information to PruningPredicate
s.
§Supported Information
-
Minimum and maximum values for columns
-
Null counts and row counts for columns
-
Whether the values in a column are contained in a set of literals
§Vectorized Interface
Information for containers / files are returned as Arrow ArrayRef
, so
the evaluation happens once on a single RecordBatch
, which amortizes the
overhead of evaluating the predicate. This is important when pruning 1000s
of containers which often happens in analytic systems that have 1000s of
potential files to consider.
For example, for the following three files with a single column a
:
file1: column a: min=5, max=10
file2: column a: No stats
file2: column a: min=20, max=30
PruningStatistics would return:
min_values("a") -> Some([5, Null, 20])
max_values("a") -> Some([10, Null, 30])
min_values("X") -> None
Required Methods§
Sourcefn min_values(&self, column: &Column) -> Option<ArrayRef>
fn min_values(&self, column: &Column) -> Option<ArrayRef>
Return the minimum values for the named column, if known.
If the minimum value for a particular container is not known, the
returned array should have null
in that row. If the minimum value is
not known for any row, return None
.
Note: the returned array must contain Self::num_containers
rows
Sourcefn max_values(&self, column: &Column) -> Option<ArrayRef>
fn max_values(&self, column: &Column) -> Option<ArrayRef>
Return the maximum values for the named column, if known.
See Self::min_values
for when to return None
and null values.
Note: the returned array must contain Self::num_containers
rows
Sourcefn num_containers(&self) -> usize
fn num_containers(&self) -> usize
Return the number of containers (e.g. Row Groups) being pruned with these statistics.
This value corresponds to the size of the ArrayRef
returned by
Self::min_values
, Self::max_values
, Self::null_counts
,
and Self::row_counts
.
Sourcefn null_counts(&self, column: &Column) -> Option<ArrayRef>
fn null_counts(&self, column: &Column) -> Option<ArrayRef>
Return the number of null values for the named column as an
UInt64Array
See Self::min_values
for when to return None
and null values.
Note: the returned array must contain Self::num_containers
rows
Sourcefn row_counts(&self, column: &Column) -> Option<ArrayRef>
fn row_counts(&self, column: &Column) -> Option<ArrayRef>
Return the number of rows for the named column in each container
as an UInt64Array
.
See Self::min_values
for when to return None
and null values.
Note: the returned array must contain Self::num_containers
rows
Sourcefn contained(
&self,
column: &Column,
values: &HashSet<ScalarValue>,
) -> Option<BooleanArray>
fn contained( &self, column: &Column, values: &HashSet<ScalarValue>, ) -> Option<BooleanArray>
Returns BooleanArray
where each row represents information known
about specific literal values
in a column.
For example, Parquet Bloom Filters implement this API to communicate
that values
are known not to be present in a Row Group.
The returned array has one row for each container, with the following meanings:
true
if the values incolumn
ONLY contain values fromvalues
false
if the values incolumn
are NOT ANY ofvalues
null
if the neither of the above holds or is unknown.
If these statistics can not determine column membership for any
container, return None
(the default).
Note: the returned array must contain Self::num_containers
rows