Struct regex_automata::hybrid::LazyStateID

source · [−]

pub struct LazyStateID(_);

Expand description

A state identifier especially tailored for lazy DFAs.

A lazy state ID logically represents a pointer to a DFA state. In practice, by limiting the number of DFA states it can address, it reserves some bits of its representation to encode some additional information. That additional information is called a “tag.” That tag is used to record whether the state it points to is an unknown, dead, quit, start or match state.

When implementing a low level search routine with a lazy DFA, it is necessary to query the type of the current state to know what to do:

Unknown - The state has not yet been computed. The parameters used to get this state ID must be re-passed to DFA::next_state, which will never return an unknown state ID.
Dead - A dead state only has transitions to itself. It indicates that the search cannot do anything else and should stop with whatever result it has.
Quit - A quit state indicates that the automaton could not answer whether a match exists or not. Correct search implementations must return a MatchError::Quit.
Start - A start state indicates that the automaton will begin searching at a starting state. Branching on this isn’t required for correctness, but a common optimization is to use this to more quickly look for a prefix.
Match - A match state indicates that a match has been found. Depending on the semantics of your search implementation, it may either continue until the end of the haystack or a dead state, or it might quit and return the match immediately.

As an optimization, the is_tagged predicate can be used to determine if a tag exists at all. This is useful to avoid branching on all of the above types for every byte searched.

Example

This example shows how LazyStateID can be used to implement a correct search routine with minimal branching. In particular, this search routine implements “leftmost” matching, which means that it doesn’t immediately stop once a match is found. Instead, it continues until it reaches a dead state.

Notice also how a correct search implementation deals with CacheErrors returned by some of the lazy DFA routines. When a CacheError occurs, it returns MatchError::GaveUp.

use regex_automata::{
    hybrid::dfa::{Cache, DFA},
    HalfMatch, MatchError, PatternID,
};

fn find_leftmost_first(
    dfa: &DFA,
    cache: &mut Cache,
    haystack: &[u8],
) -> Result<Option<HalfMatch>, MatchError> {
    // The start state is determined by inspecting the position and the
    // initial bytes of the haystack. Note that start states can never
    // be match states (since DFAs in this crate delay matches by 1
    // byte), so we don't need to check if the start state is a match.
    let mut sid = dfa.start_state_forward(
        cache, None, haystack, 0, haystack.len(),
    ).map_err(|_| MatchError::GaveUp { offset: 0 })?;
    let mut last_match = None;
    // Walk all the bytes in the haystack. We can quit early if we see
    // a dead or a quit state. The former means the automaton will
    // never transition to any other state. The latter means that the
    // automaton entered a condition in which its search failed.
    for (i, &b) in haystack.iter().enumerate() {
        sid = dfa
            .next_state(cache, sid, b)
            .map_err(|_| MatchError::GaveUp { offset: i })?;
        if sid.is_tagged() {
            if sid.is_match() {
                last_match = Some(HalfMatch::new(
                    dfa.match_pattern(cache, sid, 0),
                    i,
                ));
            } else if sid.is_dead() {
                return Ok(last_match);
            } else if sid.is_quit() {
                // It is possible to enter into a quit state after
                // observing a match has occurred. In that case, we
                // should return the match instead of an error.
                if last_match.is_some() {
                    return Ok(last_match);
                }
                return Err(MatchError::Quit { byte: b, offset: i });
            }
            // Implementors may also want to check for start states and
            // handle them differently for performance reasons. But it is
            // not necessary for correctness.
        }
    }
    // Matches are always delayed by 1 byte, so we must explicitly walk
    // the special "EOI" transition at the end of the search.
    sid = dfa
        .next_eoi_state(cache, sid)
        .map_err(|_| MatchError::GaveUp { offset: haystack.len() })?;
    if sid.is_match() {
        last_match = Some(HalfMatch::new(
            dfa.match_pattern(cache, sid, 0),
            haystack.len(),
        ));
    }
    Ok(last_match)
}

// We use a greedy '+' operator to show how the search doesn't just stop
// once a match is detected. It continues extending the match. Using
// '[a-z]+?' would also work as expected and stop the search early.
// Greediness is built into the automaton.
let dfa = DFA::new(r"[a-z]+")?;
let mut cache = dfa.create_cache();
let haystack = "123 foobar 4567".as_bytes();
let mat = find_leftmost_first(&dfa, &mut cache, haystack)?.unwrap();
assert_eq!(mat.pattern().as_usize(), 0);
assert_eq!(mat.offset(), 10);

// Here's another example that tests our handling of the special
// EOI transition. This will fail to find a match if we don't call
// 'next_eoi_state' at the end of the search since the match isn't found
// until the final byte in the haystack.
let dfa = DFA::new(r"[0-9]{4}")?;
let mut cache = dfa.create_cache();
let haystack = "123 foobar 4567".as_bytes();
let mat = find_leftmost_first(&dfa, &mut cache, haystack)?.unwrap();
assert_eq!(mat.pattern().as_usize(), 0);
assert_eq!(mat.offset(), 15);

// And note that our search implementation above automatically works
// with multi-DFAs. Namely, `dfa.match_pattern(match_state, 0)` selects
// the appropriate pattern ID for us.
let dfa = DFA::new_many(&[r"[a-z]+", r"[0-9]+"])?;
let mut cache = dfa.create_cache();
let haystack = "123 foobar 4567".as_bytes();
let mat = find_leftmost_first(&dfa, &mut cache, haystack)?.unwrap();
assert_eq!(mat.pattern().as_usize(), 1);
assert_eq!(mat.offset(), 3);
let mat = find_leftmost_first(&dfa, &mut cache, &haystack[3..])?.unwrap();
assert_eq!(mat.pattern().as_usize(), 0);
assert_eq!(mat.offset(), 7);
let mat = find_leftmost_first(&dfa, &mut cache, &haystack[10..])?.unwrap();
assert_eq!(mat.pattern().as_usize(), 1);
assert_eq!(mat.offset(), 5);

Struct regex_automata::hybrid::LazyStateID

Example

Implementations

impl LazyStateID

pub const fn is_tagged(&self) -> bool

pub const fn is_unknown(&self) -> bool

pub const fn is_dead(&self) -> bool

pub const fn is_quit(&self) -> bool

pub const fn is_start(&self) -> bool

pub const fn is_match(&self) -> bool

Trait Implementations

impl Clone for LazyStateID

fn clone(&self) -> LazyStateID

fn clone_from(&mut self, source: &Self)

impl Debug for LazyStateID

fn fmt(&self, f: &mut Formatter<'_>) -> Result

impl Default for LazyStateID

fn default() -> LazyStateID

impl Hash for LazyStateID

fn hash<__H: Hasher>(&self, state: &mut __H)

fn hash_slice<H>(data: &[Self], state: &mut H) where H: Hasher,

impl Ord for LazyStateID

fn cmp(&self, other: &LazyStateID) -> Ordering

fn max(self, other: Self) -> Self

fn min(self, other: Self) -> Self

fn clamp(self, min: Self, max: Self) -> Self

impl PartialEq<LazyStateID> for LazyStateID

fn eq(&self, other: &LazyStateID) -> bool

fn ne(&self, other: &LazyStateID) -> bool

impl PartialOrd<LazyStateID> for LazyStateID

fn partial_cmp(&self, other: &LazyStateID) -> Option<Ordering>

fn lt(&self, other: &Rhs) -> bool

fn le(&self, other: &Rhs) -> bool

fn gt(&self, other: &Rhs) -> bool

fn ge(&self, other: &Rhs) -> bool

impl Copy for LazyStateID

impl Eq for LazyStateID

impl StructuralEq for LazyStateID

impl StructuralPartialEq for LazyStateID

Auto Trait Implementations

impl RefUnwindSafe for LazyStateID

impl Send for LazyStateID

impl Sync for LazyStateID

impl Unpin for LazyStateID

impl UnwindSafe for LazyStateID

Blanket Implementations

impl<T> Any for T where T: 'static + ?Sized,

fn type_id(&self) -> TypeId

impl<T> Borrow<T> for T where T: ?Sized,

fn borrow(&self) -> &T

impl<T> BorrowMut<T> for T where T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

impl<T> From<T> for T

fn from(t: T) -> T

impl<T, U> Into<U> for T where U: From<T>,

fn into(self) -> U

impl<T> ToOwned for T where T: Clone,

type Owned = T

fn to_owned(&self) -> T

fn clone_into(&self, target: &mut T)

impl<T, U> TryFrom<U> for T where U: Into<T>,

type Error = Infallible

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

impl<T, U> TryInto<U> for T where U: TryFrom<T>,

type Error = <U as TryFrom<T>>::Error

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

fn hash<H: Hasher>(&self, state: &mut H)

fn hash_slice<H>(data: &[Self], state: &mut H) where
H: Hasher,

impl<T> Any for T where
T: 'static + ?Sized,

impl<T> Borrow<T> for T where
T: ?Sized,

impl<T> BorrowMut<T> for T where
T: ?Sized,

impl<T, U> Into<U> for T where
U: From<T>,

impl<T> ToOwned for T where
T: Clone,

impl<T, U> TryFrom<U> for T where
U: Into<T>,

impl<T, U> TryInto<U> for T where
U: TryFrom<T>,