lance::dataset::scanner

Struct Scanner

Source
pub struct Scanner { /* private fields */ }
Expand description

Dataset Scanner

let dataset = Dataset::open(uri).await.unwrap();
let stream = dataset.scan()
    .project(&["col", "col2.subfield"]).unwrap()
    .limit(10)
    .into_stream();
stream
  .map(|batch| batch.num_rows())
  .buffered(16)
  .sum()

Implementations§

Source§

impl Scanner

Source

pub fn new(dataset: Arc<Dataset>) -> Self

Source

pub fn from_fragment(dataset: Arc<Dataset>, fragment: Fragment) -> Self

Source

pub fn with_fragments(&mut self, fragments: Vec<Fragment>) -> &mut Self

Set which fragments should be scanned.

If scan_in_order is set to true, the fragments will be scanned in the order of the vector.

Source

pub fn project<T: AsRef<str>>(&mut self, columns: &[T]) -> Result<&mut Self>

Projection.

Only select the specified columns. If not specified, all columns will be scanned.

Source

pub fn project_with_transform( &mut self, columns: &[(impl AsRef<str>, impl AsRef<str>)], ) -> Result<&mut Self>

Projection with transform

Only select the specified columns with the given transform.

Source

pub fn prefilter(&mut self, should_prefilter: bool) -> &mut Self

Should the filter run before the vector index is applied

If true then the filter will be applied before the vector index. This means the results will be accurate but the overall query may be more expensive.

If false then the filter will be applied to the nearest results. This means you may get back fewer results than you ask for (or none at all) if the closest results do not match the filter.

Source

pub fn materialization_style( &mut self, style: MaterializationStyle, ) -> &mut Self

Set the materialization style for the scan

This controls when columns are fetched from storage. The default should work well for most cases.

If you know (in advance) a query will return relatively few results (less than 0.1% of the rows) then you may want to experiment with applying late materialization to more (or all) columns.

If you know a query is going to return many rows then you may want to experiment with applying early materialization to more (or all) columns.

Source

pub fn filter(&mut self, filter: &str) -> Result<&mut Self>

Apply filters

The filters can be presented as the string, as in WHERE clause in SQL.

let dataset = Dataset::open(uri).await.unwrap();
let stream = dataset.scan()
    .project(&["col", "col2.subfield"]).unwrap()
    .filter("a > 10 AND b < 200").unwrap()
    .limit(10)
    .into_stream();

Once the filter is applied, Lance will create an optimized I/O plan for filtering.

Filter by full text search The column must be a string column. The query is a string to search for. The search is case-insensitive, BM25 scoring is used.

let dataset = Dataset::open(uri).await.unwrap();
let stream = dataset.scan()
   .project(&["col", "col2.subfield"]).unwrap()
   .full_text_search("col", "query").unwrap()
   .limit(10)
   .into_stream();
Source

pub fn filter_substrait(&mut self, filter: &[u8]) -> Result<&mut Self>

Set a filter using a Substrait ExtendedExpression message

The message must contain exactly one expression and that expression must be a scalar expression whose return type is boolean.

Source

pub fn batch_size(&mut self, batch_size: usize) -> &mut Self

Set the batch size.

Source

pub fn io_buffer_size(&mut self, size: u64) -> &mut Self

Set the I/O buffer size

This is the amount of RAM that will be reserved for holding I/O received from storage before it is processed. This is used to control the amount of memory used by the scanner. If the buffer is full then the scanner will block until the buffer is processed.

Generally this should scale with the number of concurrent I/O threads. The default is 2GiB which comfortably provides enough space for somewhere between 32 and 256 concurrent I/O threads.

This value is not a hard cap on the amount of RAM the scanner will use. Some space is used for the compute (which can be controlled by the batch size) and Lance does not keep track of memory after it is returned to the user.

Currently, if there is a single batch of data which is larger than the io buffer size then the scanner will deadlock. This is a known issue and will be fixed in a future release.

Source

pub fn batch_readahead(&mut self, nbatches: usize) -> &mut Self

Set the prefetch size.

Source

pub fn fragment_readahead(&mut self, nfragments: usize) -> &mut Self

Set the fragment readahead.

This is only used if scan_in_order is set to false.

Source

pub fn scan_in_order(&mut self, ordered: bool) -> &mut Self

Set whether to read data in order (default: true)

A scan will always read from the disk concurrently. If this property is true then a ready batch (a batch that has been read from disk) will only be returned if it is the next batch in the sequence. Otherwise, the batch will be held until the stream catches up. This means the sequence is returned in order but there may be slightly less parallelism.

If this is false, then batches will be returned as soon as they are available, potentially increasing throughput slightly

If an ordering is defined (using Self::order_by) then the scan will always scan in parallel and any value set here will be ignored.

Source

pub fn use_scalar_index(&mut self, use_scalar_index: bool) -> &mut Self

Set whether to use scalar index.

By default, scalar indices will be used to optimize a query if available. However, in some corner cases, scalar indices may not be the best choice. This option allows users to disable scalar indices for a query.

Source

pub fn limit( &mut self, limit: Option<i64>, offset: Option<i64>, ) -> Result<&mut Self>

Set limit and offset.

If offset is set, the first offset rows will be skipped. If limit is set, only the provided number of rows will be returned. These can be set independently. For example, setting offset to 10 and limit to None will skip the first 10 rows and return the rest of the rows in the dataset.

Source

pub fn nearest( &mut self, column: &str, q: &Float32Array, k: usize, ) -> Result<&mut Self>

Find k-nearest neighbor within the vector column.

Source

pub fn nprobs(&mut self, n: usize) -> &mut Self

Source

pub fn ef(&mut self, ef: usize) -> &mut Self

Only search the data being indexed.

Default value is false.

This is essentially a weak consistency search, only on the indexed data.

Source

pub fn refine(&mut self, factor: u32) -> &mut Self

Apply a refine step to the vector search.

A refine improves query accuracy but also makes search slower, by reading extra elements and using the original vector values to re-rank the distances.

  • factor - the factor of extra elements to read. For example, if factor is 2, then the search will read 2x more elements than the requested k before performing the re-ranking. Note: even if the factor is 1, the results will still be re-ranked without fetching additional elements.
Source

pub fn distance_metric(&mut self, metric_type: MetricType) -> &mut Self

Change the distance MetricType, i.e, L2 or Cosine distance.

Source

pub fn order_by( &mut self, ordering: Option<Vec<ColumnOrdering>>, ) -> Result<&mut Self>

Sort the results of the scan by one or more columns

If Some, then the resulting stream will be sorted according to the given ordering. This may increase the latency of the first result since all data must be read before the first batch can be returned.

Source

pub fn use_index(&mut self, use_index: bool) -> &mut Self

Set whether to use the index if available

Source

pub fn with_row_id(&mut self) -> &mut Self

Instruct the scanner to return the _rowid meta column from the dataset.

Source

pub fn with_row_address(&mut self) -> &mut Self

Instruct the scanner to return the _rowaddr meta column from the dataset.

Source

pub fn use_stats(&mut self, use_stats: bool) -> &mut Self

Set whether to use statistics to optimize the scan (default: true)

This is used for debugging or benchmarking purposes.

Source

pub async fn schema(&self) -> Result<SchemaRef>

The Arrow schema of the output, including projections and vector / _distance

Source

pub fn get_filter(&self) -> Result<Option<Expr>>

Fetches the currently set filter

Note that this forces the filter to be evaluated and the result will depend on the current state of the scanner (e.g. if with_row_id has been called then _rowid will be available for filtering but not otherwise) and so you may want to call this after setting all other options.

Source

pub async fn try_into_stream(&self) -> Result<DatasetRecordBatchStream>

Create a stream from the Scanner.

Source

pub async fn try_into_batch(&self) -> Result<RecordBatch>

Source

pub async fn count_rows(&self) -> Result<u64>

Scan and return the number of matching rows

Source

pub async fn create_plan(&self) -> Result<Arc<dyn ExecutionPlan>>

Create ExecutionPlan for Scan.

An ExecutionPlan is a graph of operators that can be executed.

The following plans are supported:

  • Plain scan without filter or limits.
Scan(projections)
  • Scan with filter and/or limits.
Scan(filtered_cols) -> Filter(expr)
   -> (*LimitExec(limit, offset))
   -> Take(remaining_cols) -> Projection()
  • Use KNN Index (with filter and/or limits)
KNNIndex() -> Take(vector) -> FlatRefine()
    -> Take(filtered_cols) -> Filter(expr)
    -> (*LimitExec(limit, offset))
    -> Take(remaining_cols) -> Projection()
  • Use KNN flat (brute force) with filter and/or limits
Scan(vector) -> FlatKNN()
    -> Take(filtered_cols) -> Filter(expr)
    -> (*LimitExec(limit, offset))
    -> Take(remaining_cols) -> Projection()

In general, a plan has 5 stages:

  1. Source (from dataset Scan or from index, may include prefilter)
  2. Filter
  3. Sort
  4. Limit / Offset
  5. Take remaining columns / Projection
Source

pub async fn explain_plan(&self, verbose: bool) -> Result<String>

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> Conv for T

Source§

fn conv<T>(self) -> T
where Self: Into<T>,

Converts self into T using Into<T>. Read more
Source§

impl<T> Downcast for T
where T: Any,

Source§

fn into_any(self: Box<T>) -> Box<dyn Any>

Convert Box<dyn Trait> (where Trait: Downcast) to Box<dyn Any>. Box<dyn Any> can then be further downcast into Box<ConcreteType> where ConcreteType implements Trait.
Source§

fn into_any_rc(self: Rc<T>) -> Rc<dyn Any>

Convert Rc<Trait> (where Trait: Downcast) to Rc<Any>. Rc<Any> can then be further downcast into Rc<ConcreteType> where ConcreteType implements Trait.
Source§

fn as_any(&self) -> &(dyn Any + 'static)

Convert &Trait (where Trait: Downcast) to &Any. This is needed since Rust cannot generate &Any’s vtable from &Trait’s.
Source§

fn as_any_mut(&mut self) -> &mut (dyn Any + 'static)

Convert &mut Trait (where Trait: Downcast) to &Any. This is needed since Rust cannot generate &mut Any’s vtable from &mut Trait’s.
Source§

impl<T> DowncastSync for T
where T: Any + Send + Sync,

Source§

fn into_any_arc(self: Arc<T>) -> Arc<dyn Any + Sync + Send>

Convert Arc<Trait> (where Trait: Downcast) to Arc<Any>. Arc<Any> can then be further downcast into Arc<ConcreteType> where ConcreteType implements Trait.
Source§

impl<T> FmtForward for T

Source§

fn fmt_binary(self) -> FmtBinary<Self>
where Self: Binary,

Causes self to use its Binary implementation when Debug-formatted.
Source§

fn fmt_display(self) -> FmtDisplay<Self>
where Self: Display,

Causes self to use its Display implementation when Debug-formatted.
Source§

fn fmt_lower_exp(self) -> FmtLowerExp<Self>
where Self: LowerExp,

Causes self to use its LowerExp implementation when Debug-formatted.
Source§

fn fmt_lower_hex(self) -> FmtLowerHex<Self>
where Self: LowerHex,

Causes self to use its LowerHex implementation when Debug-formatted.
Source§

fn fmt_octal(self) -> FmtOctal<Self>
where Self: Octal,

Causes self to use its Octal implementation when Debug-formatted.
Source§

fn fmt_pointer(self) -> FmtPointer<Self>
where Self: Pointer,

Causes self to use its Pointer implementation when Debug-formatted.
Source§

fn fmt_upper_exp(self) -> FmtUpperExp<Self>
where Self: UpperExp,

Causes self to use its UpperExp implementation when Debug-formatted.
Source§

fn fmt_upper_hex(self) -> FmtUpperHex<Self>
where Self: UpperHex,

Causes self to use its UpperHex implementation when Debug-formatted.
Source§

fn fmt_list(self) -> FmtList<Self>
where &'a Self: for<'a> IntoIterator,

Formats each item in a sequence. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T> Instrument for T

Source§

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more
Source§

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more
Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<Unshared, Shared> IntoShared<Shared> for Unshared
where Shared: FromUnshared<Unshared>,

Source§

fn into_shared(self) -> Shared

Creates a shared type from an unshared type.
Source§

impl<T> Pipe for T
where T: ?Sized,

Source§

fn pipe<R>(self, func: impl FnOnce(Self) -> R) -> R
where Self: Sized,

Pipes by value. This is generally the method you want to use. Read more
Source§

fn pipe_ref<'a, R>(&'a self, func: impl FnOnce(&'a Self) -> R) -> R
where R: 'a,

Borrows self and passes that borrow into the pipe function. Read more
Source§

fn pipe_ref_mut<'a, R>(&'a mut self, func: impl FnOnce(&'a mut Self) -> R) -> R
where R: 'a,

Mutably borrows self and passes that borrow into the pipe function. Read more
Source§

fn pipe_borrow<'a, B, R>(&'a self, func: impl FnOnce(&'a B) -> R) -> R
where Self: Borrow<B>, B: 'a + ?Sized, R: 'a,

Borrows self, then passes self.borrow() into the pipe function. Read more
Source§

fn pipe_borrow_mut<'a, B, R>( &'a mut self, func: impl FnOnce(&'a mut B) -> R, ) -> R
where Self: BorrowMut<B>, B: 'a + ?Sized, R: 'a,

Mutably borrows self, then passes self.borrow_mut() into the pipe function. Read more
Source§

fn pipe_as_ref<'a, U, R>(&'a self, func: impl FnOnce(&'a U) -> R) -> R
where Self: AsRef<U>, U: 'a + ?Sized, R: 'a,

Borrows self, then passes self.as_ref() into the pipe function.
Source§

fn pipe_as_mut<'a, U, R>(&'a mut self, func: impl FnOnce(&'a mut U) -> R) -> R
where Self: AsMut<U>, U: 'a + ?Sized, R: 'a,

Mutably borrows self, then passes self.as_mut() into the pipe function.
Source§

fn pipe_deref<'a, T, R>(&'a self, func: impl FnOnce(&'a T) -> R) -> R
where Self: Deref<Target = T>, T: 'a + ?Sized, R: 'a,

Borrows self, then passes self.deref() into the pipe function.
Source§

fn pipe_deref_mut<'a, T, R>( &'a mut self, func: impl FnOnce(&'a mut T) -> R, ) -> R
where Self: DerefMut<Target = T> + Deref, T: 'a + ?Sized, R: 'a,

Mutably borrows self, then passes self.deref_mut() into the pipe function.
Source§

impl<T> Pointable for T

Source§

const ALIGN: usize = _

The alignment of pointer.
Source§

type Init = T

The type for initializers.
Source§

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more
Source§

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more
Source§

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more
Source§

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more
Source§

impl<T> Same for T

Source§

type Output = T

Should always be Self
Source§

impl<T> Tap for T

Source§

fn tap(self, func: impl FnOnce(&Self)) -> Self

Immutable access to a value. Read more
Source§

fn tap_mut(self, func: impl FnOnce(&mut Self)) -> Self

Mutable access to a value. Read more
Source§

fn tap_borrow<B>(self, func: impl FnOnce(&B)) -> Self
where Self: Borrow<B>, B: ?Sized,

Immutable access to the Borrow<B> of a value. Read more
Source§

fn tap_borrow_mut<B>(self, func: impl FnOnce(&mut B)) -> Self
where Self: BorrowMut<B>, B: ?Sized,

Mutable access to the BorrowMut<B> of a value. Read more
Source§

fn tap_ref<R>(self, func: impl FnOnce(&R)) -> Self
where Self: AsRef<R>, R: ?Sized,

Immutable access to the AsRef<R> view of a value. Read more
Source§

fn tap_ref_mut<R>(self, func: impl FnOnce(&mut R)) -> Self
where Self: AsMut<R>, R: ?Sized,

Mutable access to the AsMut<R> view of a value. Read more
Source§

fn tap_deref<T>(self, func: impl FnOnce(&T)) -> Self
where Self: Deref<Target = T>, T: ?Sized,

Immutable access to the Deref::Target of a value. Read more
Source§

fn tap_deref_mut<T>(self, func: impl FnOnce(&mut T)) -> Self
where Self: DerefMut<Target = T> + Deref, T: ?Sized,

Mutable access to the Deref::Target of a value. Read more
Source§

fn tap_dbg(self, func: impl FnOnce(&Self)) -> Self

Calls .tap() only in debug builds, and is erased in release builds.
Source§

fn tap_mut_dbg(self, func: impl FnOnce(&mut Self)) -> Self

Calls .tap_mut() only in debug builds, and is erased in release builds.
Source§

fn tap_borrow_dbg<B>(self, func: impl FnOnce(&B)) -> Self
where Self: Borrow<B>, B: ?Sized,

Calls .tap_borrow() only in debug builds, and is erased in release builds.
Source§

fn tap_borrow_mut_dbg<B>(self, func: impl FnOnce(&mut B)) -> Self
where Self: BorrowMut<B>, B: ?Sized,

Calls .tap_borrow_mut() only in debug builds, and is erased in release builds.
Source§

fn tap_ref_dbg<R>(self, func: impl FnOnce(&R)) -> Self
where Self: AsRef<R>, R: ?Sized,

Calls .tap_ref() only in debug builds, and is erased in release builds.
Source§

fn tap_ref_mut_dbg<R>(self, func: impl FnOnce(&mut R)) -> Self
where Self: AsMut<R>, R: ?Sized,

Calls .tap_ref_mut() only in debug builds, and is erased in release builds.
Source§

fn tap_deref_dbg<T>(self, func: impl FnOnce(&T)) -> Self
where Self: Deref<Target = T>, T: ?Sized,

Calls .tap_deref() only in debug builds, and is erased in release builds.
Source§

fn tap_deref_mut_dbg<T>(self, func: impl FnOnce(&mut T)) -> Self
where Self: DerefMut<Target = T> + Deref, T: ?Sized,

Calls .tap_deref_mut() only in debug builds, and is erased in release builds.
Source§

impl<T> TryConv for T

Source§

fn try_conv<T>(self) -> Result<T, Self::Error>
where Self: TryInto<T>,

Attempts to convert self into T using TryInto<T>. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V

Source§

impl<T> WithSubscriber for T

Source§

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

impl<T> Fruit for T
where T: Send + Downcast,