[][src]Struct tantivy_fst::raw::Fst

pub struct Fst<Data = Vec<u8>> { /* fields omitted */ }

An acyclic deterministic finite state transducer.

How does it work?

The short answer: it's just like a prefix trie, which compresses keys based only on their prefixes, except that a automaton/transducer also compresses suffixes.

The longer answer is that keys in an automaton are stored only in the transitions from one state to another. A key can be acquired by tracing a path from the root of the automaton to any match state. The inputs along each transition are concatenated. Once a match state is reached, the concatenation of inputs up until that point corresponds to a single key.

But why is it called a transducer instead of an automaton? A finite state transducer is just like a finite state automaton, except that it has output transitions in addition to input transitions. Namely, the value associated with any particular key is determined by summing the outputs along every input transition that leads to the key's corresponding match state.

This is best demonstrated with a couple images. First, let's ignore the "transducer" aspect and focus on a plain automaton.

Consider that your keys are abbreviations of some of the months in the Gregorian calendar:

This example is not tested
jan
feb
mar
apr
may
jun
jul

The corresponding automaton that stores all of these as keys looks like this:

finite state automaton

Notice here how the prefix and suffix of jan and jun are shared. Similarly, the prefixes of jun and jul are shared and the prefixes of mar and may are shared.

All of the keys from this automaton can be enumerated in lexicographic order by following every transition from each node in lexicographic order. Since it is acyclic, the procedure will terminate.

A key can be found by tracing it through the transitions in the automaton. For example, the key aug is known not to be in the automaton by only visiting the root state (because there is no a transition). For another example, the key jax is known not to be in the set only after moving through the transitions for j and a. Namely, after those transitions are followed, there are no transitions for x.

Notice here that looking up a key is proportional the length of the key itself. Namely, lookup time is not affected by the number of keys in the automaton!

Additionally, notice that the automaton exploits the fact that many keys share common prefixes and suffixes. For example, jun and jul are represented with no more states than would be required to represent either one on its own. Instead, the only change is a single extra transition. This is a form of compression and is key to how the automatons produced by this crate are so small.

Let's move on to finite state transducers. Consider the same set of keys as above, but let's assign their numeric month values:

This example is not tested
jan,1
feb,2
mar,3
apr,4
may,5
jun,6
jul,7

The corresponding transducer looks very similar to the automaton above, except outputs have been added to some of the transitions:

finite state transducer

All of the operations with a transducer are the same as described above for automatons. Additionally, the same compression techniques are used: common prefixes and suffixes in keys are exploited.

The key difference is that some transitions have been given an output. As one follows input transitions, one must sum the outputs as they are seen. (A transition with no output represents the additive identity, or 0 in this case.) For example, when looking up feb, the transition f has output 2, the transition e has output 0, and the transition b also has output 0. The sum of these is 2, which is exactly the value we associated with feb.

For another more interesting example, consider jul. The j transition has output 1, the u transition has output 5 and the l transition has output 1. Summing these together gets us 7, which is again the correct value associated with jul. Notice that if we instead looked up the jun key, then the n transition would be followed instead of the l transition, which has no output. Therefore, the jun key equals 1+5+0=6.

The trick to transducers is that there exists a unique path through the transducer for every key, and its outputs are stored appropriately along this path such that the correct value is returned when they are all summed together. This process also enables the data that makes up each value to be shared across many values in the transducer in exactly the same way that keys are shared. This is yet another form of compression!

Bonus: a billion strings

The amount of compression one can get from automata can be absolutely ridiuclous. Consider the particular case of storing all billion strings in the range 0000000001-1000000000, e.g.,

This example is not tested
0000000001
0000000002
...
0000000100
0000000101
...
0999999999
1000000000

The corresponding automaton looks like this:

![finite state automaton - one billion strings] (http://burntsushi.net/stuff/one-billion.png)

Indeed, the on disk size of this automaton is a mere 251 bytes.

Of course, this is a bit of a pathological best case, but it does serve to show how good compression can be in the optimal case.

Also, check out the corresponding transducer that maps each string to its integer value. It's a bit bigger, but still only takes up 896 bytes of space on disk. This demonstrates that output values are also compressible.

Does this crate produce minimal transducers?

For any non-trivial sized set of keys, it is unlikely that this crate will produce a minimal transducer. As far as this author knows, guaranteeing a minimal transducer requires working memory proportional to the number of states. This can be quite costly and is anathema to the main design goal of this crate: provide the ability to work with gigantic sets of strings with constant memory overhead.

Instead, construction of a finite state transducer uses a cache of states. More frequently used states are cached and reused, which provides reasonably good compression ratios. (No comprehensive benchmarks exist to back up this claim.)

It is possible that this crate may expose a way to guarantee minimal construction of transducers at the expense of exorbitant memory requirements.

Bibliography

I initially got the idea to use finite state tranducers to represent ordered sets/maps from Michael McCandless' work on incorporating transducers in Lucene.

However, my work would also not have been possible without the hard work of many academics, especially Jan Daciuk.

Methods

impl<Data: Deref<Target = [u8]>> Fst<Data>[src]

pub fn new(data: Data) -> Result<Fst<Data>>[src]

Open a Fst from a given data.

pub fn get<B: AsRef<[u8]>>(&self, key: B) -> Option<Output>[src]

Retrieves the value associated with a key.

If the key does not exist, then None is returned.

pub fn contains_key<B: AsRef<[u8]>>(&self, key: B) -> bool[src]

Returns true if and only if the given key is in this FST.

pub fn stream(&self) -> Stream[src]

Return a lexicographically ordered stream of all key-value pairs in this fst.

pub fn range(&self) -> StreamBuilder[src]

Return a builder for range queries.

A range query returns a subset of key-value pairs in this fst in a range given in lexicographic order.

pub fn search<A: Automaton>(&self, aut: A) -> StreamBuilder<A>[src]

Executes an automaton on the keys of this map.

pub fn len(&self) -> usize[src]

Returns the number of keys in this fst.

pub fn is_empty(&self) -> bool[src]

Returns true if and only if this fst has no keys.

pub fn size(&self) -> usize[src]

Returns the number of bytes used by this fst.

pub fn op(&self) -> OpBuilder[src]

Creates a new fst operation with this fst added to it.

The OpBuilder type can be used to add additional fst streams and perform set operations like union, intersection, difference and symmetric difference on the keys of the fst. These set operations also allow one to specify how conflicting values are merged in the stream.

pub fn is_disjoint<'f, I, S>(&self, stream: I) -> bool where
    I: for<'a> IntoStreamer<'a, Into = S, Item = (&'a [u8], Output)>,
    S: 'f + for<'a> Streamer<'a, Item = (&'a [u8], Output)>, 
[src]

Returns true if and only if the self fst is disjoint with the fst stream.

stream must be a lexicographically ordered sequence of byte strings with associated values.

pub fn is_subset<'f, I, S>(&self, stream: I) -> bool where
    I: for<'a> IntoStreamer<'a, Into = S, Item = (&'a [u8], Output)>,
    S: 'f + for<'a> Streamer<'a, Item = (&'a [u8], Output)>, 
[src]

Returns true if and only if the self fst is a subset of the fst stream.

stream must be a lexicographically ordered sequence of byte strings with associated values.

pub fn is_superset<'f, I, S>(&self, stream: I) -> bool where
    I: for<'a> IntoStreamer<'a, Into = S, Item = (&'a [u8], Output)>,
    S: 'f + for<'a> Streamer<'a, Item = (&'a [u8], Output)>, 
[src]

Returns true if and only if the self fst is a superset of the fst stream.

stream must be a lexicographically ordered sequence of byte strings with associated values.

pub fn fst_type(&self) -> FstType[src]

Returns the underlying type of this fst.

FstType is a convention used to indicate the type of the underlying transducer.

This crate reserves the range 0-255 (inclusive) but currently leaves the meaning of 0-255 unspecified.

pub fn root(&self) -> Node[src]

Returns the root node of this fst.

pub fn node(&self, addr: CompiledAddr) -> Node[src]

Returns the node at the given address.

Node addresses can be obtained by reading transitions on Node values.

pub fn to_vec(&self) -> Vec<u8>[src]

Returns a copy of the binary contents of this FST.

Trait Implementations

impl<Data> AsRef<Fst<Data>> for Map<Data>[src]

Returns the underlying finite state transducer.

impl<Data> From<Fst<Data>> for Map<Data>[src]

impl<'a, 'f, Data> IntoStreamer<'a> for &'f Fst<Data> where
    Data: Deref<Target = [u8]>, 
[src]

type Item = (&'a [u8], Output)

The type of the item emitted by the stream.

type Into = Stream<'f>

The type of the stream to be constructed.

Auto Trait Implementations

impl<Data> RefUnwindSafe for Fst<Data> where
    Data: RefUnwindSafe

impl<Data> Send for Fst<Data> where
    Data: Send

impl<Data> Sync for Fst<Data> where
    Data: Sync

impl<Data> Unpin for Fst<Data> where
    Data: Unpin

impl<Data> UnwindSafe for Fst<Data> where
    Data: UnwindSafe

Blanket Implementations

impl<T> Any for T where
    T: 'static + ?Sized
[src]

impl<T> Borrow<T> for T where
    T: ?Sized
[src]

impl<T> BorrowMut<T> for T where
    T: ?Sized
[src]

impl<T> From<T> for T[src]

impl<T, U> Into<U> for T where
    U: From<T>, 
[src]

impl<T, U> TryFrom<U> for T where
    U: Into<T>, 
[src]

type Error = Infallible

The type returned in the event of a conversion error.

impl<T, U> TryInto<U> for T where
    U: TryFrom<T>, 
[src]

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.