pub struct Scanner { /* private fields */ }
Expand description
Dataset Scanner
let dataset = Dataset::open(uri).await.unwrap();
let stream = dataset.scan()
.project(&["col", "col2.subfield"]).unwrap()
.limit(10)
.into_stream();
stream
.map(|batch| batch.num_rows())
.buffered(16)
.sum()
Implementations§
Source§impl Scanner
impl Scanner
pub fn new(dataset: Arc<Dataset>) -> Self
pub fn from_fragment(dataset: Arc<Dataset>, fragment: Fragment) -> Self
Sourcepub fn with_fragments(&mut self, fragments: Vec<Fragment>) -> &mut Self
pub fn with_fragments(&mut self, fragments: Vec<Fragment>) -> &mut Self
Set which fragments should be scanned.
If scan_in_order is set to true, the fragments will be scanned in the order of the vector.
Sourcepub fn project<T: AsRef<str>>(&mut self, columns: &[T]) -> Result<&mut Self>
pub fn project<T: AsRef<str>>(&mut self, columns: &[T]) -> Result<&mut Self>
Projection.
Only select the specified columns. If not specified, all columns will be scanned.
Sourcepub fn project_with_transform(
&mut self,
columns: &[(impl AsRef<str>, impl AsRef<str>)],
) -> Result<&mut Self>
pub fn project_with_transform( &mut self, columns: &[(impl AsRef<str>, impl AsRef<str>)], ) -> Result<&mut Self>
Projection with transform
Only select the specified columns with the given transform.
Sourcepub fn prefilter(&mut self, should_prefilter: bool) -> &mut Self
pub fn prefilter(&mut self, should_prefilter: bool) -> &mut Self
Should the filter run before the vector index is applied
If true then the filter will be applied before the vector index. This means the results will be accurate but the overall query may be more expensive.
If false then the filter will be applied to the nearest results. This means you may get back fewer results than you ask for (or none at all) if the closest results do not match the filter.
Sourcepub fn materialization_style(
&mut self,
style: MaterializationStyle,
) -> &mut Self
pub fn materialization_style( &mut self, style: MaterializationStyle, ) -> &mut Self
Set the materialization style for the scan
This controls when columns are fetched from storage. The default should work well for most cases.
If you know (in advance) a query will return relatively few results (less than 0.1% of the rows) then you may want to experiment with applying late materialization to more (or all) columns.
If you know a query is going to return many rows then you may want to experiment with applying early materialization to more (or all) columns.
Sourcepub fn filter(&mut self, filter: &str) -> Result<&mut Self>
pub fn filter(&mut self, filter: &str) -> Result<&mut Self>
Apply filters
The filters can be presented as the string, as in WHERE clause in SQL.
let dataset = Dataset::open(uri).await.unwrap();
let stream = dataset.scan()
.project(&["col", "col2.subfield"]).unwrap()
.filter("a > 10 AND b < 200").unwrap()
.limit(10)
.into_stream();
Once the filter is applied, Lance will create an optimized I/O plan for filtering.
Sourcepub fn full_text_search(
&mut self,
query: FullTextSearchQuery,
) -> Result<&mut Self>
pub fn full_text_search( &mut self, query: FullTextSearchQuery, ) -> Result<&mut Self>
Filter by full text search The column must be a string column. The query is a string to search for. The search is case-insensitive, BM25 scoring is used.
let dataset = Dataset::open(uri).await.unwrap();
let stream = dataset.scan()
.project(&["col", "col2.subfield"]).unwrap()
.full_text_search("col", "query").unwrap()
.limit(10)
.into_stream();
Sourcepub fn filter_substrait(&mut self, filter: &[u8]) -> Result<&mut Self>
pub fn filter_substrait(&mut self, filter: &[u8]) -> Result<&mut Self>
Set a filter using a Substrait ExtendedExpression message
The message must contain exactly one expression and that expression must be a scalar expression whose return type is boolean.
Sourcepub fn batch_size(&mut self, batch_size: usize) -> &mut Self
pub fn batch_size(&mut self, batch_size: usize) -> &mut Self
Set the batch size.
Sourcepub fn io_buffer_size(&mut self, size: u64) -> &mut Self
pub fn io_buffer_size(&mut self, size: u64) -> &mut Self
Set the I/O buffer size
This is the amount of RAM that will be reserved for holding I/O received from storage before it is processed. This is used to control the amount of memory used by the scanner. If the buffer is full then the scanner will block until the buffer is processed.
Generally this should scale with the number of concurrent I/O threads. The default is 2GiB which comfortably provides enough space for somewhere between 32 and 256 concurrent I/O threads.
This value is not a hard cap on the amount of RAM the scanner will use. Some space is used for the compute (which can be controlled by the batch size) and Lance does not keep track of memory after it is returned to the user.
Currently, if there is a single batch of data which is larger than the io buffer size then the scanner will deadlock. This is a known issue and will be fixed in a future release.
Sourcepub fn batch_readahead(&mut self, nbatches: usize) -> &mut Self
pub fn batch_readahead(&mut self, nbatches: usize) -> &mut Self
Set the prefetch size.
Sourcepub fn fragment_readahead(&mut self, nfragments: usize) -> &mut Self
pub fn fragment_readahead(&mut self, nfragments: usize) -> &mut Self
Set the fragment readahead.
This is only used if scan_in_order
is set to false.
Sourcepub fn scan_in_order(&mut self, ordered: bool) -> &mut Self
pub fn scan_in_order(&mut self, ordered: bool) -> &mut Self
Set whether to read data in order (default: true)
A scan will always read from the disk concurrently. If this property is true then a ready batch (a batch that has been read from disk) will only be returned if it is the next batch in the sequence. Otherwise, the batch will be held until the stream catches up. This means the sequence is returned in order but there may be slightly less parallelism.
If this is false, then batches will be returned as soon as they are available, potentially increasing throughput slightly
If an ordering is defined (using Self::order_by) then the scan will always scan in parallel and any value set here will be ignored.
Sourcepub fn use_scalar_index(&mut self, use_scalar_index: bool) -> &mut Self
pub fn use_scalar_index(&mut self, use_scalar_index: bool) -> &mut Self
Set whether to use scalar index.
By default, scalar indices will be used to optimize a query if available. However, in some corner cases, scalar indices may not be the best choice. This option allows users to disable scalar indices for a query.
Sourcepub fn limit(
&mut self,
limit: Option<i64>,
offset: Option<i64>,
) -> Result<&mut Self>
pub fn limit( &mut self, limit: Option<i64>, offset: Option<i64>, ) -> Result<&mut Self>
Set limit and offset.
If offset is set, the first offset rows will be skipped. If limit is set, only the provided number of rows will be returned. These can be set independently. For example, setting offset to 10 and limit to None will skip the first 10 rows and return the rest of the rows in the dataset.
Sourcepub fn nearest(
&mut self,
column: &str,
q: &Float32Array,
k: usize,
) -> Result<&mut Self>
pub fn nearest( &mut self, column: &str, q: &Float32Array, k: usize, ) -> Result<&mut Self>
Find k-nearest neighbor within the vector column.
pub fn nprobs(&mut self, n: usize) -> &mut Self
pub fn ef(&mut self, ef: usize) -> &mut Self
Sourcepub fn fast_search(&mut self) -> &mut Self
pub fn fast_search(&mut self) -> &mut Self
Only search the data being indexed.
Default value is false.
This is essentially a weak consistency search, only on the indexed data.
Sourcepub fn refine(&mut self, factor: u32) -> &mut Self
pub fn refine(&mut self, factor: u32) -> &mut Self
Apply a refine step to the vector search.
A refine improves query accuracy but also makes search slower, by reading extra elements and using the original vector values to re-rank the distances.
factor
- the factor of extra elements to read. For example, if factor is 2, then the search will read 2x more elements than the requested k before performing the re-ranking. Note: even if the factor is 1, the results will still be re-ranked without fetching additional elements.
Sourcepub fn distance_metric(&mut self, metric_type: MetricType) -> &mut Self
pub fn distance_metric(&mut self, metric_type: MetricType) -> &mut Self
Change the distance MetricType, i.e, L2 or Cosine distance.
Sourcepub fn order_by(
&mut self,
ordering: Option<Vec<ColumnOrdering>>,
) -> Result<&mut Self>
pub fn order_by( &mut self, ordering: Option<Vec<ColumnOrdering>>, ) -> Result<&mut Self>
Sort the results of the scan by one or more columns
If Some, then the resulting stream will be sorted according to the given ordering. This may increase the latency of the first result since all data must be read before the first batch can be returned.
Sourcepub fn use_index(&mut self, use_index: bool) -> &mut Self
pub fn use_index(&mut self, use_index: bool) -> &mut Self
Set whether to use the index if available
Sourcepub fn with_row_id(&mut self) -> &mut Self
pub fn with_row_id(&mut self) -> &mut Self
Instruct the scanner to return the _rowid
meta column from the dataset.
Sourcepub fn with_row_address(&mut self) -> &mut Self
pub fn with_row_address(&mut self) -> &mut Self
Instruct the scanner to return the _rowaddr
meta column from the dataset.
Sourcepub fn use_stats(&mut self, use_stats: bool) -> &mut Self
pub fn use_stats(&mut self, use_stats: bool) -> &mut Self
Set whether to use statistics to optimize the scan (default: true)
This is used for debugging or benchmarking purposes.
Sourcepub async fn schema(&self) -> Result<SchemaRef>
pub async fn schema(&self) -> Result<SchemaRef>
The Arrow schema of the output, including projections and vector / _distance
Sourcepub fn get_filter(&self) -> Result<Option<Expr>>
pub fn get_filter(&self) -> Result<Option<Expr>>
Fetches the currently set filter
Note that this forces the filter to be evaluated and the result will depend on the current state of the scanner (e.g. if with_row_id has been called then _rowid will be available for filtering but not otherwise) and so you may want to call this after setting all other options.
Sourcepub async fn try_into_stream(&self) -> Result<DatasetRecordBatchStream>
pub async fn try_into_stream(&self) -> Result<DatasetRecordBatchStream>
Create a stream from the Scanner.
pub async fn try_into_batch(&self) -> Result<RecordBatch>
Sourcepub async fn count_rows(&self) -> Result<u64>
pub async fn count_rows(&self) -> Result<u64>
Scan and return the number of matching rows
Sourcepub async fn create_plan(&self) -> Result<Arc<dyn ExecutionPlan>>
pub async fn create_plan(&self) -> Result<Arc<dyn ExecutionPlan>>
Create ExecutionPlan
for Scan.
An ExecutionPlan is a graph of operators that can be executed.
The following plans are supported:
- Plain scan without filter or limits.
Scan(projections)
- Scan with filter and/or limits.
Scan(filtered_cols) -> Filter(expr)
-> (*LimitExec(limit, offset))
-> Take(remaining_cols) -> Projection()
- Use KNN Index (with filter and/or limits)
KNNIndex() -> Take(vector) -> FlatRefine()
-> Take(filtered_cols) -> Filter(expr)
-> (*LimitExec(limit, offset))
-> Take(remaining_cols) -> Projection()
- Use KNN flat (brute force) with filter and/or limits
Scan(vector) -> FlatKNN()
-> Take(filtered_cols) -> Filter(expr)
-> (*LimitExec(limit, offset))
-> Take(remaining_cols) -> Projection()
In general, a plan has 5 stages:
- Source (from dataset Scan or from index, may include prefilter)
- Filter
- Sort
- Limit / Offset
- Take remaining columns / Projection
pub async fn explain_plan(&self, verbose: bool) -> Result<String>
Auto Trait Implementations§
impl Freeze for Scanner
impl !RefUnwindSafe for Scanner
impl Send for Scanner
impl Sync for Scanner
impl Unpin for Scanner
impl !UnwindSafe for Scanner
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> Downcast for Twhere
T: Any,
impl<T> Downcast for Twhere
T: Any,
Source§fn into_any(self: Box<T>) -> Box<dyn Any>
fn into_any(self: Box<T>) -> Box<dyn Any>
Box<dyn Trait>
(where Trait: Downcast
) to Box<dyn Any>
. Box<dyn Any>
can
then be further downcast
into Box<ConcreteType>
where ConcreteType
implements Trait
.Source§fn into_any_rc(self: Rc<T>) -> Rc<dyn Any>
fn into_any_rc(self: Rc<T>) -> Rc<dyn Any>
Rc<Trait>
(where Trait: Downcast
) to Rc<Any>
. Rc<Any>
can then be
further downcast
into Rc<ConcreteType>
where ConcreteType
implements Trait
.Source§fn as_any(&self) -> &(dyn Any + 'static)
fn as_any(&self) -> &(dyn Any + 'static)
&Trait
(where Trait: Downcast
) to &Any
. This is needed since Rust cannot
generate &Any
’s vtable from &Trait
’s.Source§fn as_any_mut(&mut self) -> &mut (dyn Any + 'static)
fn as_any_mut(&mut self) -> &mut (dyn Any + 'static)
&mut Trait
(where Trait: Downcast
) to &Any
. This is needed since Rust cannot
generate &mut Any
’s vtable from &mut Trait
’s.Source§impl<T> DowncastSync for T
impl<T> DowncastSync for T
Source§impl<T> FmtForward for T
impl<T> FmtForward for T
Source§fn fmt_binary(self) -> FmtBinary<Self>where
Self: Binary,
fn fmt_binary(self) -> FmtBinary<Self>where
Self: Binary,
self
to use its Binary
implementation when Debug
-formatted.Source§fn fmt_display(self) -> FmtDisplay<Self>where
Self: Display,
fn fmt_display(self) -> FmtDisplay<Self>where
Self: Display,
self
to use its Display
implementation when
Debug
-formatted.Source§fn fmt_lower_exp(self) -> FmtLowerExp<Self>where
Self: LowerExp,
fn fmt_lower_exp(self) -> FmtLowerExp<Self>where
Self: LowerExp,
self
to use its LowerExp
implementation when
Debug
-formatted.Source§fn fmt_lower_hex(self) -> FmtLowerHex<Self>where
Self: LowerHex,
fn fmt_lower_hex(self) -> FmtLowerHex<Self>where
Self: LowerHex,
self
to use its LowerHex
implementation when
Debug
-formatted.Source§fn fmt_octal(self) -> FmtOctal<Self>where
Self: Octal,
fn fmt_octal(self) -> FmtOctal<Self>where
Self: Octal,
self
to use its Octal
implementation when Debug
-formatted.Source§fn fmt_pointer(self) -> FmtPointer<Self>where
Self: Pointer,
fn fmt_pointer(self) -> FmtPointer<Self>where
Self: Pointer,
self
to use its Pointer
implementation when
Debug
-formatted.Source§fn fmt_upper_exp(self) -> FmtUpperExp<Self>where
Self: UpperExp,
fn fmt_upper_exp(self) -> FmtUpperExp<Self>where
Self: UpperExp,
self
to use its UpperExp
implementation when
Debug
-formatted.Source§fn fmt_upper_hex(self) -> FmtUpperHex<Self>where
Self: UpperHex,
fn fmt_upper_hex(self) -> FmtUpperHex<Self>where
Self: UpperHex,
self
to use its UpperHex
implementation when
Debug
-formatted.Source§impl<T> Instrument for T
impl<T> Instrument for T
Source§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
Source§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self
into a Left
variant of Either<Self, Self>
if into_left
is true
.
Converts self
into a Right
variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self
into a Left
variant of Either<Self, Self>
if into_left(&self)
returns true
.
Converts self
into a Right
variant of Either<Self, Self>
otherwise. Read moreSource§impl<T> Pipe for Twhere
T: ?Sized,
impl<T> Pipe for Twhere
T: ?Sized,
Source§fn pipe<R>(self, func: impl FnOnce(Self) -> R) -> Rwhere
Self: Sized,
fn pipe<R>(self, func: impl FnOnce(Self) -> R) -> Rwhere
Self: Sized,
Source§fn pipe_ref<'a, R>(&'a self, func: impl FnOnce(&'a Self) -> R) -> Rwhere
R: 'a,
fn pipe_ref<'a, R>(&'a self, func: impl FnOnce(&'a Self) -> R) -> Rwhere
R: 'a,
self
and passes that borrow into the pipe function. Read moreSource§fn pipe_ref_mut<'a, R>(&'a mut self, func: impl FnOnce(&'a mut Self) -> R) -> Rwhere
R: 'a,
fn pipe_ref_mut<'a, R>(&'a mut self, func: impl FnOnce(&'a mut Self) -> R) -> Rwhere
R: 'a,
self
and passes that borrow into the pipe function. Read moreSource§fn pipe_borrow<'a, B, R>(&'a self, func: impl FnOnce(&'a B) -> R) -> R
fn pipe_borrow<'a, B, R>(&'a self, func: impl FnOnce(&'a B) -> R) -> R
Source§fn pipe_borrow_mut<'a, B, R>(
&'a mut self,
func: impl FnOnce(&'a mut B) -> R,
) -> R
fn pipe_borrow_mut<'a, B, R>( &'a mut self, func: impl FnOnce(&'a mut B) -> R, ) -> R
Source§fn pipe_as_ref<'a, U, R>(&'a self, func: impl FnOnce(&'a U) -> R) -> R
fn pipe_as_ref<'a, U, R>(&'a self, func: impl FnOnce(&'a U) -> R) -> R
self
, then passes self.as_ref()
into the pipe function.Source§fn pipe_as_mut<'a, U, R>(&'a mut self, func: impl FnOnce(&'a mut U) -> R) -> R
fn pipe_as_mut<'a, U, R>(&'a mut self, func: impl FnOnce(&'a mut U) -> R) -> R
self
, then passes self.as_mut()
into the pipe
function.Source§fn pipe_deref<'a, T, R>(&'a self, func: impl FnOnce(&'a T) -> R) -> R
fn pipe_deref<'a, T, R>(&'a self, func: impl FnOnce(&'a T) -> R) -> R
self
, then passes self.deref()
into the pipe function.Source§impl<T> Pointable for T
impl<T> Pointable for T
Source§impl<T> Tap for T
impl<T> Tap for T
Source§fn tap_borrow<B>(self, func: impl FnOnce(&B)) -> Self
fn tap_borrow<B>(self, func: impl FnOnce(&B)) -> Self
Borrow<B>
of a value. Read moreSource§fn tap_borrow_mut<B>(self, func: impl FnOnce(&mut B)) -> Self
fn tap_borrow_mut<B>(self, func: impl FnOnce(&mut B)) -> Self
BorrowMut<B>
of a value. Read moreSource§fn tap_ref<R>(self, func: impl FnOnce(&R)) -> Self
fn tap_ref<R>(self, func: impl FnOnce(&R)) -> Self
AsRef<R>
view of a value. Read moreSource§fn tap_ref_mut<R>(self, func: impl FnOnce(&mut R)) -> Self
fn tap_ref_mut<R>(self, func: impl FnOnce(&mut R)) -> Self
AsMut<R>
view of a value. Read moreSource§fn tap_deref<T>(self, func: impl FnOnce(&T)) -> Self
fn tap_deref<T>(self, func: impl FnOnce(&T)) -> Self
Deref::Target
of a value. Read moreSource§fn tap_deref_mut<T>(self, func: impl FnOnce(&mut T)) -> Self
fn tap_deref_mut<T>(self, func: impl FnOnce(&mut T)) -> Self
Deref::Target
of a value. Read moreSource§fn tap_dbg(self, func: impl FnOnce(&Self)) -> Self
fn tap_dbg(self, func: impl FnOnce(&Self)) -> Self
.tap()
only in debug builds, and is erased in release builds.Source§fn tap_mut_dbg(self, func: impl FnOnce(&mut Self)) -> Self
fn tap_mut_dbg(self, func: impl FnOnce(&mut Self)) -> Self
.tap_mut()
only in debug builds, and is erased in release
builds.Source§fn tap_borrow_dbg<B>(self, func: impl FnOnce(&B)) -> Self
fn tap_borrow_dbg<B>(self, func: impl FnOnce(&B)) -> Self
.tap_borrow()
only in debug builds, and is erased in release
builds.Source§fn tap_borrow_mut_dbg<B>(self, func: impl FnOnce(&mut B)) -> Self
fn tap_borrow_mut_dbg<B>(self, func: impl FnOnce(&mut B)) -> Self
.tap_borrow_mut()
only in debug builds, and is erased in release
builds.Source§fn tap_ref_dbg<R>(self, func: impl FnOnce(&R)) -> Self
fn tap_ref_dbg<R>(self, func: impl FnOnce(&R)) -> Self
.tap_ref()
only in debug builds, and is erased in release
builds.Source§fn tap_ref_mut_dbg<R>(self, func: impl FnOnce(&mut R)) -> Self
fn tap_ref_mut_dbg<R>(self, func: impl FnOnce(&mut R)) -> Self
.tap_ref_mut()
only in debug builds, and is erased in release
builds.Source§fn tap_deref_dbg<T>(self, func: impl FnOnce(&T)) -> Self
fn tap_deref_dbg<T>(self, func: impl FnOnce(&T)) -> Self
.tap_deref()
only in debug builds, and is erased in release
builds.