pub struct DFA<T> { /* private fields */ }
Expand description

A sparse deterministic finite automaton (DFA) with variable sized states.

In contrast to a dense::DFA, a sparse DFA uses a more space efficient representation for its transitions. Consequently, sparse DFAs may use much less memory than dense DFAs, but this comes at a price. In particular, reading the more space efficient transitions takes more work, and consequently, searching using a sparse DFA is typically slower than a dense DFA.

A sparse DFA can be built using the default configuration via the DFA::new constructor. Otherwise, one can configure various aspects of a dense DFA via dense::Builder, and then convert a dense DFA to a sparse DFA using dense::DFA::to_sparse.

In general, a sparse DFA supports all the same search operations as a dense DFA.

Making the choice between a dense and sparse DFA depends on your specific work load. If you can sacrifice a bit of search time performance, then a sparse DFA might be the best choice. In particular, while sparse DFAs are probably always slower than dense DFAs, you may find that they are easily fast enough for your purposes!

Type parameters

A DFA has one type parameter, T, which is used to represent the parts of a sparse DFA. T is typically a Vec<u8> or a &[u8].

The Automaton trait

This type implements the Automaton trait, which means it can be used for searching. For example:

use regex_automata::{
    dfa::{Automaton, sparse::DFA},
    HalfMatch,
};

let dfa = DFA::new("foo[0-9]+")?;
let expected = HalfMatch::must(0, 8);
assert_eq!(Some(expected), dfa.find_leftmost_fwd(b"foo12345")?);

Implementations

Parse the given regular expression using a default configuration and return the corresponding sparse DFA.

If you want a non-default configuration, then use the dense::Builder to set your own configuration, and then call dense::DFA::to_sparse to create a sparse DFA.

Example
use regex_automata::{
    dfa::{Automaton, sparse},
    HalfMatch,
};

let dfa = sparse::DFA::new("foo[0-9]+bar")?;

let expected = HalfMatch::must(0, 11);
assert_eq!(Some(expected), dfa.find_leftmost_fwd(b"foo12345bar")?);

Parse the given regular expressions using a default configuration and return the corresponding multi-DFA.

If you want a non-default configuration, then use the dense::Builder to set your own configuration, and then call dense::DFA::to_sparse to create a sparse DFA.

Example
use regex_automata::{
    dfa::{Automaton, sparse},
    HalfMatch,
};

let dfa = sparse::DFA::new_many(&["[0-9]+", "[a-z]+"])?;
let expected = HalfMatch::must(1, 3);
assert_eq!(Some(expected), dfa.find_leftmost_fwd(b"foo12345bar")?);

Create a new DFA that matches every input.

Example
use regex_automata::{
    dfa::{Automaton, sparse},
    HalfMatch,
};

let dfa = sparse::DFA::always_match()?;

let expected = HalfMatch::must(0, 0);
assert_eq!(Some(expected), dfa.find_leftmost_fwd(b"")?);
assert_eq!(Some(expected), dfa.find_leftmost_fwd(b"foo")?);

Create a new sparse DFA that never matches any input.

Example
use regex_automata::dfa::{Automaton, sparse};

let dfa = sparse::DFA::never_match()?;
assert_eq!(None, dfa.find_leftmost_fwd(b"")?);
assert_eq!(None, dfa.find_leftmost_fwd(b"foo")?);

Cheaply return a borrowed version of this sparse DFA. Specifically, the DFA returned always uses &[u8] for its transitions.

Return an owned version of this sparse DFA. Specifically, the DFA returned always uses Vec<u8> for its transitions.

Effectively, this returns a sparse DFA whose transitions live on the heap.

Returns the memory usage, in bytes, of this DFA.

The memory usage is computed based on the number of bytes used to represent this DFA.

This does not include the stack size used up by this DFA. To compute that, use std::mem::size_of::<sparse::DFA>().

Returns true only if this DFA has starting states for each pattern.

When a DFA has starting states for each pattern, then a search with the DFA can be configured to only look for anchored matches of a specific pattern. Specifically, APIs like Automaton::find_earliest_fwd_at can accept a non-None pattern_id if and only if this method returns true. Otherwise, calling find_earliest_fwd_at will panic.

Note that if the DFA is empty, this always returns false.

Routines for converting a sparse DFA to other representations, such as raw bytes suitable for persistent storage.

Serialize this DFA as raw bytes to a Vec<u8> in little endian format.

The written bytes are guaranteed to be deserialized correctly and without errors in a semver compatible release of this crate by a DFA’s deserialization APIs (assuming all other criteria for the deserialization APIs has been satisfied):

Note that unlike a dense::DFA’s serialization methods, this does not add any initial padding to the returned bytes. Padding isn’t required for sparse DFAs since they have no alignment requirements.

Example

This example shows how to serialize and deserialize a DFA:

use regex_automata::{
    dfa::{Automaton, sparse::DFA},
    HalfMatch,
};

// Compile our original DFA.
let original_dfa = DFA::new("foo[0-9]+")?;

// N.B. We use native endianness here to make the example work, but
// using to_bytes_little_endian would work on a little endian target.
let buf = original_dfa.to_bytes_native_endian();
// Even if buf has initial padding, DFA::from_bytes will automatically
// ignore it.
let dfa: DFA<&[u8]> = DFA::from_bytes(&buf)?.0;

let expected = HalfMatch::must(0, 8);
assert_eq!(Some(expected), dfa.find_leftmost_fwd(b"foo12345")?);

Serialize this DFA as raw bytes to a Vec<u8> in big endian format.

The written bytes are guaranteed to be deserialized correctly and without errors in a semver compatible release of this crate by a DFA’s deserialization APIs (assuming all other criteria for the deserialization APIs has been satisfied):

Note that unlike a dense::DFA’s serialization methods, this does not add any initial padding to the returned bytes. Padding isn’t required for sparse DFAs since they have no alignment requirements.

Example

This example shows how to serialize and deserialize a DFA:

use regex_automata::{
    dfa::{Automaton, sparse::DFA},
    HalfMatch,
};

// Compile our original DFA.
let original_dfa = DFA::new("foo[0-9]+")?;

// N.B. We use native endianness here to make the example work, but
// using to_bytes_big_endian would work on a big endian target.
let buf = original_dfa.to_bytes_native_endian();
// Even if buf has initial padding, DFA::from_bytes will automatically
// ignore it.
let dfa: DFA<&[u8]> = DFA::from_bytes(&buf)?.0;

let expected = HalfMatch::must(0, 8);
assert_eq!(Some(expected), dfa.find_leftmost_fwd(b"foo12345")?);

Serialize this DFA as raw bytes to a Vec<u8> in native endian format.

The written bytes are guaranteed to be deserialized correctly and without errors in a semver compatible release of this crate by a DFA’s deserialization APIs (assuming all other criteria for the deserialization APIs has been satisfied):

Note that unlike a dense::DFA’s serialization methods, this does not add any initial padding to the returned bytes. Padding isn’t required for sparse DFAs since they have no alignment requirements.

Generally speaking, native endian format should only be used when you know that the target you’re compiling the DFA for matches the endianness of the target on which you’re compiling DFA. For example, if serialization and deserialization happen in the same process or on the same machine. Otherwise, when serializing a DFA for use in a portable environment, you’ll almost certainly want to serialize both a little endian and a big endian version and then load the correct one based on the target’s configuration.

Example

This example shows how to serialize and deserialize a DFA:

use regex_automata::{
    dfa::{Automaton, sparse::DFA},
    HalfMatch,
};

// Compile our original DFA.
let original_dfa = DFA::new("foo[0-9]+")?;

let buf = original_dfa.to_bytes_native_endian();
// Even if buf has initial padding, DFA::from_bytes will automatically
// ignore it.
let dfa: DFA<&[u8]> = DFA::from_bytes(&buf)?.0;

let expected = HalfMatch::must(0, 8);
assert_eq!(Some(expected), dfa.find_leftmost_fwd(b"foo12345")?);

Serialize this DFA as raw bytes to the given slice, in little endian format. Upon success, the total number of bytes written to dst is returned.

The written bytes are guaranteed to be deserialized correctly and without errors in a semver compatible release of this crate by a DFA’s deserialization APIs (assuming all other criteria for the deserialization APIs has been satisfied):

Errors

This returns an error if the given destination slice is not big enough to contain the full serialized DFA. If an error occurs, then nothing is written to dst.

Example

This example shows how to serialize and deserialize a DFA without dynamic memory allocation.

use regex_automata::{
    dfa::{Automaton, sparse::DFA},
    HalfMatch,
};

// Compile our original DFA.
let original_dfa = DFA::new("foo[0-9]+")?;

// Create a 4KB buffer on the stack to store our serialized DFA.
let mut buf = [0u8; 4 * (1<<10)];
// N.B. We use native endianness here to make the example work, but
// using write_to_little_endian would work on a little endian target.
let written = original_dfa.write_to_native_endian(&mut buf)?;
let dfa: DFA<&[u8]> = DFA::from_bytes(&buf[..written])?.0;

let expected = HalfMatch::must(0, 8);
assert_eq!(Some(expected), dfa.find_leftmost_fwd(b"foo12345")?);

Serialize this DFA as raw bytes to the given slice, in big endian format. Upon success, the total number of bytes written to dst is returned.

The written bytes are guaranteed to be deserialized correctly and without errors in a semver compatible release of this crate by a DFA’s deserialization APIs (assuming all other criteria for the deserialization APIs has been satisfied):

Errors

This returns an error if the given destination slice is not big enough to contain the full serialized DFA. If an error occurs, then nothing is written to dst.

Example

This example shows how to serialize and deserialize a DFA without dynamic memory allocation.

use regex_automata::{
    dfa::{Automaton, sparse::DFA},
    HalfMatch,
};

// Compile our original DFA.
let original_dfa = DFA::new("foo[0-9]+")?;

// Create a 4KB buffer on the stack to store our serialized DFA.
let mut buf = [0u8; 4 * (1<<10)];
// N.B. We use native endianness here to make the example work, but
// using write_to_big_endian would work on a big endian target.
let written = original_dfa.write_to_native_endian(&mut buf)?;
let dfa: DFA<&[u8]> = DFA::from_bytes(&buf[..written])?.0;

let expected = HalfMatch::must(0, 8);
assert_eq!(Some(expected), dfa.find_leftmost_fwd(b"foo12345")?);

Serialize this DFA as raw bytes to the given slice, in native endian format. Upon success, the total number of bytes written to dst is returned.

The written bytes are guaranteed to be deserialized correctly and without errors in a semver compatible release of this crate by a DFA’s deserialization APIs (assuming all other criteria for the deserialization APIs has been satisfied):

Generally speaking, native endian format should only be used when you know that the target you’re compiling the DFA for matches the endianness of the target on which you’re compiling DFA. For example, if serialization and deserialization happen in the same process or on the same machine. Otherwise, when serializing a DFA for use in a portable environment, you’ll almost certainly want to serialize both a little endian and a big endian version and then load the correct one based on the target’s configuration.

Errors

This returns an error if the given destination slice is not big enough to contain the full serialized DFA. If an error occurs, then nothing is written to dst.

Example

This example shows how to serialize and deserialize a DFA without dynamic memory allocation.

use regex_automata::{
    dfa::{Automaton, sparse::DFA},
    HalfMatch,
};

// Compile our original DFA.
let original_dfa = DFA::new("foo[0-9]+")?;

// Create a 4KB buffer on the stack to store our serialized DFA.
let mut buf = [0u8; 4 * (1<<10)];
let written = original_dfa.write_to_native_endian(&mut buf)?;
let dfa: DFA<&[u8]> = DFA::from_bytes(&buf[..written])?.0;

let expected = HalfMatch::must(0, 8);
assert_eq!(Some(expected), dfa.find_leftmost_fwd(b"foo12345")?);

Return the total number of bytes required to serialize this DFA.

This is useful for determining the size of the buffer required to pass to one of the serialization routines:

Passing a buffer smaller than the size returned by this method will result in a serialization error.

Example

This example shows how to dynamically allocate enough room to serialize a sparse DFA.

use regex_automata::{
    dfa::{Automaton, sparse::DFA},
    HalfMatch,
};

// Compile our original DFA.
let original_dfa = DFA::new("foo[0-9]+")?;

let mut buf = vec![0; original_dfa.write_to_len()];
let written = original_dfa.write_to_native_endian(&mut buf)?;
let dfa: DFA<&[u8]> = DFA::from_bytes(&buf[..written])?.0;

let expected = HalfMatch::must(0, 8);
assert_eq!(Some(expected), dfa.find_leftmost_fwd(b"foo12345")?);

Safely deserialize a sparse DFA with a specific state identifier representation. Upon success, this returns both the deserialized DFA and the number of bytes read from the given slice. Namely, the contents of the slice beyond the DFA are not read.

Deserializing a DFA using this routine will never allocate heap memory. For safety purposes, the DFA’s transitions will be verified such that every transition points to a valid state. If this verification is too costly, then a DFA::from_bytes_unchecked API is provided, which will always execute in constant time.

The bytes given must be generated by one of the serialization APIs of a DFA using a semver compatible release of this crate. Those include:

The to_bytes methods allocate and return a Vec<u8> for you. The write_to methods do not allocate and write to an existing slice (which may be on the stack). Since deserialization always uses the native endianness of the target platform, the serialization API you use should match the endianness of the target platform. (It’s often a good idea to generate serialized DFAs for both forms of endianness and then load the correct one based on endianness.)

Errors

Generally speaking, it’s easier to state the conditions in which an error is not returned. All of the following must be true:

  • The bytes given must be produced by one of the serialization APIs on this DFA, as mentioned above.
  • The endianness of the target platform matches the endianness used to serialized the provided DFA.

If any of the above are not true, then an error will be returned.

Note that unlike deserializing a dense::DFA, deserializing a sparse DFA has no alignment requirements. That is, an alignment of 1 is valid.

Panics

This routine will never panic for any input.

Example

This example shows how to serialize a DFA to raw bytes, deserialize it and then use it for searching.

use regex_automata::{
    dfa::{Automaton, sparse::DFA},
    HalfMatch,
};

let initial = DFA::new("foo[0-9]+")?;
let bytes = initial.to_bytes_native_endian();
let dfa: DFA<&[u8]> = DFA::from_bytes(&bytes)?.0;

let expected = HalfMatch::must(0, 8);
assert_eq!(Some(expected), dfa.find_leftmost_fwd(b"foo12345")?);
Example: loading a DFA from static memory

One use case this library supports is the ability to serialize a DFA to disk and then use include_bytes! to store it in a compiled Rust program. Those bytes can then be cheaply deserialized into a DFA structure at runtime and used for searching without having to re-compile the DFA (which can be quite costly).

We can show this in two parts. The first part is serializing the DFA to a file:

use regex_automata::dfa::{Automaton, sparse::DFA};

let dfa = DFA::new("foo[0-9]+")?;

// Write a big endian serialized version of this DFA to a file.
let bytes = dfa.to_bytes_big_endian();
std::fs::write("foo.bigendian.dfa", &bytes)?;

// Do it again, but this time for little endian.
let bytes = dfa.to_bytes_little_endian();
std::fs::write("foo.littleendian.dfa", &bytes)?;

And now the second part is embedding the DFA into the compiled program and deserializing it at runtime on first use. We use conditional compilation to choose the correct endianness. As mentioned above, we do not need to employ any special tricks to ensure a proper alignment, since a sparse DFA has no alignment requirements.

use regex_automata::{
    dfa::{Automaton, sparse},
    HalfMatch,
};

type DFA = sparse::DFA<&'static [u8]>;

fn get_foo() -> &'static DFA {
    use std::cell::Cell;
    use std::mem::MaybeUninit;
    use std::sync::Once;

    #[cfg(target_endian = "big")]
    static BYTES: &[u8] = include_bytes!("foo.bigendian.dfa");
    #[cfg(target_endian = "little")]
    static BYTES: &[u8] = include_bytes!("foo.littleendian.dfa");

    struct Lazy(Cell<MaybeUninit<DFA>>);
    // SAFETY: This is safe because DFA impls Sync.
    unsafe impl Sync for Lazy {}

    static INIT: Once = Once::new();
    static DFA: Lazy = Lazy(Cell::new(MaybeUninit::uninit()));

    INIT.call_once(|| {
        let (dfa, _) = DFA::from_bytes(BYTES)
            .expect("serialized DFA should be valid");
        // SAFETY: This is guaranteed to only execute once, and all
        // we do with the pointer is write the DFA to it.
        unsafe {
            (*DFA.0.as_ptr()).as_mut_ptr().write(dfa);
        }
    });
    // SAFETY: DFA is guaranteed to by initialized via INIT and is
    // stored in static memory.
    unsafe {
        let dfa = (*DFA.0.as_ptr()).as_ptr();
        std::mem::transmute::<*const DFA, &'static DFA>(dfa)
    }
}

let dfa = get_foo();
let expected = HalfMatch::must(0, 8);
assert_eq!(Ok(Some(expected)), dfa.find_leftmost_fwd(b"foo12345"));

Alternatively, consider using lazy_static or once_cell, which will guarantee safety for you.

Deserialize a DFA with a specific state identifier representation in constant time by omitting the verification of the validity of the sparse transitions.

This is just like DFA::from_bytes, except it can potentially return a DFA that exhibits undefined behavior if its transitions contains invalid state identifiers.

This routine is useful if you need to deserialize a DFA cheaply and cannot afford the transition validation performed by from_bytes.

Safety

This routine is unsafe because it permits callers to provide arbitrary transitions with possibly incorrect state identifiers. While the various serialization routines will never return an incorrect DFA, there is no guarantee that the bytes provided here are correct. While from_bytes_unchecked will still do several forms of basic validation, this routine does not check that the transitions themselves are correct. Given an incorrect transition table, it is possible for the search routines to access out-of-bounds memory because of explicit bounds check elision.

Example
use regex_automata::{
    dfa::{Automaton, sparse::DFA},
    HalfMatch,
};

let initial = DFA::new("foo[0-9]+")?;
let bytes = initial.to_bytes_native_endian();
// SAFETY: This is guaranteed to be safe since the bytes given come
// directly from a compatible serialization routine.
let dfa: DFA<&[u8]> = unsafe { DFA::from_bytes_unchecked(&bytes)?.0 };

let expected = HalfMatch::must(0, 8);
assert_eq!(Some(expected), dfa.find_leftmost_fwd(b"foo12345")?);

Trait Implementations

Returns true if and only if the given identifier corresponds to a “special” state. A special state is one or more of the following: a dead state, a quit state, a match state, a start state or an accelerated state. Read more

Returns true if and only if the given identifier corresponds to a dead state. When a DFA enters a dead state, it is impossible to leave. That is, every transition on a dead state by definition leads back to the same dead state. Read more

Returns true if and only if the given identifier corresponds to a quit state. A quit state is like a dead state (it has no transitions other than to itself), except it indicates that the DFA failed to complete the search. When this occurs, callers can neither accept or reject that a match occurred. Read more

Returns true if and only if the given identifier corresponds to a match state. A match state is also referred to as a “final” state and indicates that a match has been found. Read more

Returns true if and only if the given identifier corresponds to a start state. A start state is a state in which a DFA begins a search. All searches begin in a start state. Moreover, since all matches are delayed by one byte, a start state can never be a match state. Read more

Returns true if and only if the given identifier corresponds to an accelerated state. Read more

Transitions from the current state to the next state, given the next byte of input. Read more

Transitions from the current state to the next state, given the next byte of input. Read more

Transitions from the current state to the next state for the special EOI symbol. Read more

Returns the total number of patterns compiled into this DFA. Read more

Returns the total number of patterns that match in this state. Read more

Returns the pattern ID corresponding to the given match index in the given state. Read more

Return the ID of the start state for this DFA when executing a forward search. Read more

Return the ID of the start state for this DFA when executing a reverse search. Read more

Return a slice of bytes to accelerate for the given state, if possible. Read more

Executes a forward search and returns the end position of the first match that is found as early as possible. If no match exists, then None is returned. Read more

Executes a reverse search and returns the start position of the first match that is found as early as possible. If no match exists, then None is returned. Read more

Executes a forward search and returns the end position of the leftmost match that is found. If no match exists, then None is returned. Read more

Executes a reverse search and returns the start of the position of the leftmost match that is found. If no match exists, then None is returned. Read more

Executes an overlapping forward search and returns the end position of matches as they are found. If no match exists, then None is returned. Read more

Executes a forward search and returns the end position of the first match that is found as early as possible. If no match exists, then None is returned. Read more

Executes a reverse search and returns the start position of the first match that is found as early as possible. If no match exists, then None is returned. Read more

Executes a forward search and returns the end position of the leftmost match that is found. If no match exists, then None is returned. Read more

Executes a reverse search and returns the start of the position of the leftmost match that is found. If no match exists, then None is returned. Read more

Executes an overlapping forward search and returns the end position of matches as they are found. If no match exists, then None is returned. Read more

Returns a copy of the value. Read more

Performs copy-assignment from source. Read more

Formats the value using the given formatter. Read more

Auto Trait Implementations

Blanket Implementations

Gets the TypeId of self. Read more

Immutably borrows from an owned value. Read more

Mutably borrows from an owned value. Read more

Returns the argument unchanged.

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

The resulting type after obtaining ownership.

Creates owned data from borrowed data, usually by cloning. Read more

🔬 This is a nightly-only experimental API. (toowned_clone_into)

Uses borrowed data to replace owned data, usually by cloning. Read more

The type returned in the event of a conversion error.

Performs the conversion.

The type returned in the event of a conversion error.

Performs the conversion.