pub struct Config { /* private fields */ }
dfa-search
or dfa-onepass
) and crate feature dfa-search
and crate feature dfa-build
only.Expand description
The configuration used for compiling a dense DFA.
As a convenience, DFA::config
is an alias for Config::new
. The
advantage of the former is that it often lets you avoid importing the
Config
type directly.
A dense DFA configuration is a simple data object that is typically used
with dense::Builder::configure
.
The default configuration guarantees that a search will never return
a “quit” error, although it is possible for a search to fail if
Config::starts_for_each_pattern
wasn’t enabled (which it is
not by default) and an Anchored::Pattern
mode is requested via
Input
.
Implementations§
source§impl Config
impl Config
sourcepub fn accelerate(self, yes: bool) -> Config
pub fn accelerate(self, yes: bool) -> Config
Enable state acceleration.
When enabled, DFA construction will analyze each state to determine
whether it is eligible for simple acceleration. Acceleration typically
occurs when most of a state’s transitions loop back to itself, leaving
only a select few bytes that will exit the state. When this occurs,
other routines like memchr
can be used to look for those bytes which
may be much faster than traversing the DFA.
Callers may elect to disable this if consistent performance is more desirable than variable performance. Namely, acceleration can sometimes make searching slower than it otherwise would be if the transitions that leave accelerated states are traversed frequently.
See Automaton::accelerator
for an example.
This is enabled by default.
sourcepub fn prefilter(self, pre: Option<Prefilter>) -> Config
pub fn prefilter(self, pre: Option<Prefilter>) -> Config
Set a prefilter to be used whenever a start state is entered.
A Prefilter
in this context is meant to accelerate searches by
looking for literal prefixes that every match for the corresponding
pattern (or patterns) must start with. Once a prefilter produces a
match, the underlying search routine continues on to try and confirm
the match.
Be warned that setting a prefilter does not guarantee that the search will be faster. While it’s usually a good bet, if the prefilter produces a lot of false positive candidates (i.e., positions matched by the prefilter but not by the regex), then the overall result can be slower than if you had just executed the regex engine without any prefilters.
Note that unless Config::specialize_start_states
has been
explicitly set, then setting this will also enable (when pre
is
Some
) or disable (when pre
is None
) start state specialization.
This occurs because without start state specialization, a prefilter
is likely to be less effective. And without a prefilter, start state
specialization is usually pointless.
WARNING: Note that prefilters are not preserved as part of serialization. Serializing a DFA will drop its prefilter.
By default no prefilter is set.
§Example
use regex_automata::{
dfa::{dense::DFA, Automaton},
util::prefilter::Prefilter,
Input, HalfMatch, MatchKind,
};
let pre = Prefilter::new(MatchKind::LeftmostFirst, &["foo", "bar"]);
let re = DFA::builder()
.configure(DFA::config().prefilter(pre))
.build(r"(foo|bar)[a-z]+")?;
let input = Input::new("foo1 barfox bar");
assert_eq!(
Some(HalfMatch::must(0, 11)),
re.try_search_fwd(&input)?,
);
Be warned though that an incorrect prefilter can lead to incorrect results!
use regex_automata::{
dfa::{dense::DFA, Automaton},
util::prefilter::Prefilter,
Input, HalfMatch, MatchKind,
};
let pre = Prefilter::new(MatchKind::LeftmostFirst, &["foo", "car"]);
let re = DFA::builder()
.configure(DFA::config().prefilter(pre))
.build(r"(foo|bar)[a-z]+")?;
let input = Input::new("foo1 barfox bar");
assert_eq!(
// No match reported even though there clearly is one!
None,
re.try_search_fwd(&input)?,
);
sourcepub fn minimize(self, yes: bool) -> Config
pub fn minimize(self, yes: bool) -> Config
Minimize the DFA.
When enabled, the DFA built will be minimized such that it is as small as possible.
Whether one enables minimization or not depends on the types of costs
you’re willing to pay and how much you care about its benefits. In
particular, minimization has worst case O(n*k*logn)
time and O(k*n)
space, where n
is the number of DFA states and k
is the alphabet
size. In practice, minimization can be quite costly in terms of both
space and time, so it should only be done if you’re willing to wait
longer to produce a DFA. In general, you might want a minimal DFA in
the following circumstances:
- You would like to optimize for the size of the automaton. This can manifest in one of two ways. Firstly, if you’re converting the DFA into Rust code (or a table embedded in the code), then a minimal DFA will translate into a corresponding reduction in code size, and thus, also the final compiled binary size. Secondly, if you are building many DFAs and putting them on the heap, you’ll be able to fit more if they are smaller. Note though that building a minimal DFA itself requires additional space; you only realize the space savings once the minimal DFA is constructed (at which point, the space used for minimization is freed).
- You’ve observed that a smaller DFA results in faster match performance. Naively, this isn’t guaranteed since there is no inherent difference between matching with a bigger-than-minimal DFA and a minimal DFA. However, a smaller DFA may make use of your CPU’s cache more efficiently.
- You are trying to establish an equivalence between regular languages. The standard method for this is to build a minimal DFA for each language and then compare them. If the DFAs are equivalent (up to state renaming), then the languages are equivalent.
Typically, minimization only makes sense as an offline process. That is, one might minimize a DFA before serializing it to persistent storage. In practical terms, minimization can take around an order of magnitude more time than compiling the initial DFA via determinization.
This option is disabled by default.
sourcepub fn match_kind(self, kind: MatchKind) -> Config
pub fn match_kind(self, kind: MatchKind) -> Config
Set the desired match semantics.
The default is MatchKind::LeftmostFirst
, which corresponds to the
match semantics of Perl-like regex engines. That is, when multiple
patterns would match at the same leftmost position, the pattern that
appears first in the concrete syntax is chosen.
Currently, the only other kind of match semantics supported is
MatchKind::All
. This corresponds to classical DFA construction
where all possible matches are added to the DFA.
Typically, All
is used when one wants to execute an overlapping
search and LeftmostFirst
otherwise. In particular, it rarely makes
sense to use All
with the various “leftmost” find routines, since the
leftmost routines depend on the LeftmostFirst
automata construction
strategy. Specifically, LeftmostFirst
adds dead states to the DFA
as a way to terminate the search and report a match. LeftmostFirst
also supports non-greedy matches using this strategy where as All
does not.
§Example: overlapping search
This example shows the typical use of MatchKind::All
, which is to
report overlapping matches.
use regex_automata::{
dfa::{Automaton, OverlappingState, dense},
HalfMatch, Input, MatchKind,
};
let dfa = dense::Builder::new()
.configure(dense::Config::new().match_kind(MatchKind::All))
.build_many(&[r"\w+$", r"\S+$"])?;
let input = Input::new("@foo");
let mut state = OverlappingState::start();
let expected = Some(HalfMatch::must(1, 4));
dfa.try_search_overlapping_fwd(&input, &mut state)?;
assert_eq!(expected, state.get_match());
// The first pattern also matches at the same position, so re-running
// the search will yield another match. Notice also that the first
// pattern is returned after the second. This is because the second
// pattern begins its match before the first, is therefore an earlier
// match and is thus reported first.
let expected = Some(HalfMatch::must(0, 4));
dfa.try_search_overlapping_fwd(&input, &mut state)?;
assert_eq!(expected, state.get_match());
§Example: reverse automaton to find start of match
Another example for using MatchKind::All
is for constructing a
reverse automaton to find the start of a match. All
semantics are
used for this in order to find the longest possible match, which
corresponds to the leftmost starting position.
Note that if you need the starting position then
dfa::regex::Regex
will handle this for
you, so it’s usually not necessary to do this yourself.
use regex_automata::{
dfa::{dense, Automaton, StartKind},
nfa::thompson::NFA,
Anchored, HalfMatch, Input, MatchKind,
};
let haystack = "123foobar456".as_bytes();
let pattern = r"[a-z]+r";
let dfa_fwd = dense::DFA::new(pattern)?;
let dfa_rev = dense::Builder::new()
.thompson(NFA::config().reverse(true))
.configure(dense::Config::new()
// This isn't strictly necessary since both anchored and
// unanchored searches are supported by default. But since
// finding the start-of-match only requires anchored searches,
// we can get rid of the unanchored configuration and possibly
// slim down our DFA considerably.
.start_kind(StartKind::Anchored)
.match_kind(MatchKind::All)
)
.build(pattern)?;
let expected_fwd = HalfMatch::must(0, 9);
let expected_rev = HalfMatch::must(0, 3);
let got_fwd = dfa_fwd.try_search_fwd(&Input::new(haystack))?.unwrap();
// Here we don't specify the pattern to search for since there's only
// one pattern and we're doing a leftmost search. But if this were an
// overlapping search, you'd need to specify the pattern that matched
// in the forward direction. (Otherwise, you might wind up finding the
// starting position of a match of some other pattern.) That in turn
// requires building the reverse automaton with starts_for_each_pattern
// enabled. Indeed, this is what Regex does internally.
let input = Input::new(haystack)
.range(..got_fwd.offset())
.anchored(Anchored::Yes);
let got_rev = dfa_rev.try_search_rev(&input)?.unwrap();
assert_eq!(expected_fwd, got_fwd);
assert_eq!(expected_rev, got_rev);
sourcepub fn start_kind(self, kind: StartKind) -> Config
pub fn start_kind(self, kind: StartKind) -> Config
The type of starting state configuration to use for a DFA.
By default, the starting state configuration is StartKind::Both
.
§Example
use regex_automata::{
dfa::{dense::DFA, Automaton, StartKind},
Anchored, HalfMatch, Input,
};
let haystack = "quux foo123";
let expected = HalfMatch::must(0, 11);
// By default, DFAs support both anchored and unanchored searches.
let dfa = DFA::new(r"[0-9]+")?;
let input = Input::new(haystack);
assert_eq!(Some(expected), dfa.try_search_fwd(&input)?);
// But if we only need anchored searches, then we can build a DFA
// that only supports anchored searches. This leads to a smaller DFA
// (potentially significantly smaller in some cases), but a DFA that
// will panic if you try to use it with an unanchored search.
let dfa = DFA::builder()
.configure(DFA::config().start_kind(StartKind::Anchored))
.build(r"[0-9]+")?;
let input = Input::new(haystack)
.range(8..)
.anchored(Anchored::Yes);
assert_eq!(Some(expected), dfa.try_search_fwd(&input)?);
sourcepub fn starts_for_each_pattern(self, yes: bool) -> Config
pub fn starts_for_each_pattern(self, yes: bool) -> Config
Whether to compile a separate start state for each pattern in the automaton.
When enabled, a separate anchored start state is added for each pattern in the DFA. When this start state is used, then the DFA will only search for matches for the pattern specified, even if there are other patterns in the DFA.
The main downside of this option is that it can potentially increase the size of the DFA and/or increase the time it takes to build the DFA.
There are a few reasons one might want to enable this (it’s disabled by default):
- When looking for the start of an overlapping match (using a
reverse DFA), doing it correctly requires starting the reverse search
using the starting state of the pattern that matched in the forward
direction. Indeed, when building a
Regex
, it will automatically enable this option when building the reverse DFA internally. - When you want to use a DFA with multiple patterns to both search for matches of any pattern or to search for anchored matches of one particular pattern while using the same DFA. (Otherwise, you would need to compile a new DFA for each pattern.)
- Since the start states added for each pattern are anchored, if you
compile an unanchored DFA with one pattern while also enabling this
option, then you can use the same DFA to perform anchored or unanchored
searches. The latter you get with the standard search APIs. The former
you get from the various
_at
search methods that allow you specify a pattern ID to search for.
By default this is disabled.
§Example
This example shows how to use this option to permit the same DFA to run both anchored and unanchored searches for a single pattern.
use regex_automata::{
dfa::{dense, Automaton},
Anchored, HalfMatch, PatternID, Input,
};
let dfa = dense::Builder::new()
.configure(dense::Config::new().starts_for_each_pattern(true))
.build(r"foo[0-9]+")?;
let haystack = "quux foo123";
// Here's a normal unanchored search. Notice that we use 'None' for the
// pattern ID. Since the DFA was built as an unanchored machine, it
// use its default unanchored starting state.
let expected = HalfMatch::must(0, 11);
let input = Input::new(haystack);
assert_eq!(Some(expected), dfa.try_search_fwd(&input)?);
// But now if we explicitly specify the pattern to search ('0' being
// the only pattern in the DFA), then it will use the starting state
// for that specific pattern which is always anchored. Since the
// pattern doesn't have a match at the beginning of the haystack, we
// find nothing.
let input = Input::new(haystack)
.anchored(Anchored::Pattern(PatternID::must(0)));
assert_eq!(None, dfa.try_search_fwd(&input)?);
// And finally, an anchored search is not the same as putting a '^' at
// beginning of the pattern. An anchored search can only match at the
// beginning of the *search*, which we can change:
let input = Input::new(haystack)
.anchored(Anchored::Pattern(PatternID::must(0)))
.range(5..);
assert_eq!(Some(expected), dfa.try_search_fwd(&input)?);
sourcepub fn byte_classes(self, yes: bool) -> Config
pub fn byte_classes(self, yes: bool) -> Config
Whether to attempt to shrink the size of the DFA’s alphabet or not.
This option is enabled by default and should never be disabled unless one is debugging a generated DFA.
When enabled, the DFA will use a map from all possible bytes to their
corresponding equivalence class. Each equivalence class represents a
set of bytes that does not discriminate between a match and a non-match
in the DFA. For example, the pattern [ab]+
has at least two
equivalence classes: a set containing a
and b
and a set containing
every byte except for a
and b
. a
and b
are in the same
equivalence class because they never discriminate between a match and a
non-match.
The advantage of this map is that the size of the transition table
can be reduced drastically from #states * 256 * sizeof(StateID)
to
#states * k * sizeof(StateID)
where k
is the number of equivalence
classes (rounded up to the nearest power of 2). As a result, total
space usage can decrease substantially. Moreover, since a smaller
alphabet is used, DFA compilation becomes faster as well.
WARNING: This is only useful for debugging DFAs. Disabling this does not yield any speed advantages. Namely, even when this is disabled, a byte class map is still used while searching. The only difference is that every byte will be forced into its own distinct equivalence class. This is useful for debugging the actual generated transitions because it lets one see the transitions defined on actual bytes instead of the equivalence classes.
sourcepub fn unicode_word_boundary(self, yes: bool) -> Config
pub fn unicode_word_boundary(self, yes: bool) -> Config
Heuristically enable Unicode word boundaries.
When set, this will attempt to implement Unicode word boundaries as if
they were ASCII word boundaries. This only works when the search input
is ASCII only. If a non-ASCII byte is observed while searching, then a
MatchError::quit
error is returned.
A possible alternative to enabling this option is to simply use an
ASCII word boundary, e.g., via (?-u:\b)
. The main reason to use this
option is if you absolutely need Unicode support. This option lets one
use a fast search implementation (a DFA) for some potentially very
common cases, while providing the option to fall back to some other
regex engine to handle the general case when an error is returned.
If the pattern provided has no Unicode word boundary in it, then this option has no effect. (That is, quitting on a non-ASCII byte only occurs when this option is enabled and a Unicode word boundary is present in the pattern.)
This is almost equivalent to setting all non-ASCII bytes to be quit bytes. The only difference is that this will cause non-ASCII bytes to be quit bytes only when a Unicode word boundary is present in the pattern.
When enabling this option, callers must be prepared to handle
a MatchError
error during search.
When using a Regex
, this corresponds
to using the try_
suite of methods. Alternatively, if
callers can guarantee that their input is ASCII only, then a
MatchError::quit
error will never be
returned while searching.
This is disabled by default.
§Example
This example shows how to heuristically enable Unicode word boundaries in a pattern. It also shows what happens when a search comes across a non-ASCII byte.
use regex_automata::{
dfa::{Automaton, dense},
HalfMatch, Input, MatchError,
};
let dfa = dense::Builder::new()
.configure(dense::Config::new().unicode_word_boundary(true))
.build(r"\b[0-9]+\b")?;
// The match occurs before the search ever observes the snowman
// character, so no error occurs.
let haystack = "foo 123 ☃".as_bytes();
let expected = Some(HalfMatch::must(0, 7));
let got = dfa.try_search_fwd(&Input::new(haystack))?;
assert_eq!(expected, got);
// Notice that this search fails, even though the snowman character
// occurs after the ending match offset. This is because search
// routines read one byte past the end of the search to account for
// look-around, and indeed, this is required here to determine whether
// the trailing \b matches.
let haystack = "foo 123 ☃".as_bytes();
let expected = MatchError::quit(0xE2, 8);
let got = dfa.try_search_fwd(&Input::new(haystack));
assert_eq!(Err(expected), got);
// Another example is executing a search where the span of the haystack
// we specify is all ASCII, but there is non-ASCII just before it. This
// correctly also reports an error.
let input = Input::new("β123").range(2..);
let expected = MatchError::quit(0xB2, 1);
let got = dfa.try_search_fwd(&input);
assert_eq!(Err(expected), got);
// And similarly for the trailing word boundary.
let input = Input::new("123β").range(..3);
let expected = MatchError::quit(0xCE, 3);
let got = dfa.try_search_fwd(&input);
assert_eq!(Err(expected), got);
sourcepub fn quit(self, byte: u8, yes: bool) -> Config
pub fn quit(self, byte: u8, yes: bool) -> Config
Add a “quit” byte to the DFA.
When a quit byte is seen during search time, then search will return
a MatchError::quit
error indicating the
offset at which the search stopped.
A quit byte will always overrule any other aspects of a regex. For
example, if the x
byte is added as a quit byte and the regex \w
is
used, then observing x
will cause the search to quit immediately
despite the fact that x
is in the \w
class.
This mechanism is primarily useful for heuristically enabling certain
features like Unicode word boundaries in a DFA. Namely, if the input
to search is ASCII, then a Unicode word boundary can be implemented
via an ASCII word boundary with no change in semantics. Thus, a DFA
can attempt to match a Unicode word boundary but give up as soon as it
observes a non-ASCII byte. Indeed, if callers set all non-ASCII bytes
to be quit bytes, then Unicode word boundaries will be permitted when
building DFAs. Of course, callers should enable
Config::unicode_word_boundary
if they want this behavior instead.
(The advantage being that non-ASCII quit bytes will only be added if a
Unicode word boundary is in the pattern.)
When enabling this option, callers must be prepared to handle a
MatchError
error during search. When using a
Regex
, this corresponds to using the
try_
suite of methods.
By default, there are no quit bytes set.
§Panics
This panics if heuristic Unicode word boundaries are enabled and any non-ASCII byte is removed from the set of quit bytes. Namely, enabling Unicode word boundaries requires setting every non-ASCII byte to a quit byte. So if the caller attempts to undo any of that, then this will panic.
§Example
This example shows how to cause a search to terminate if it sees a
\n
byte. This could be useful if, for example, you wanted to prevent
a user supplied pattern from matching across a line boundary.
use regex_automata::{dfa::{Automaton, dense}, Input, MatchError};
let dfa = dense::Builder::new()
.configure(dense::Config::new().quit(b'\n', true))
.build(r"foo\p{any}+bar")?;
let haystack = "foo\nbar".as_bytes();
// Normally this would produce a match, since \p{any} contains '\n'.
// But since we instructed the automaton to enter a quit state if a
// '\n' is observed, this produces a match error instead.
let expected = MatchError::quit(b'\n', 3);
let got = dfa.try_search_fwd(&Input::new(haystack)).unwrap_err();
assert_eq!(expected, got);
sourcepub fn specialize_start_states(self, yes: bool) -> Config
pub fn specialize_start_states(self, yes: bool) -> Config
Enable specializing start states in the DFA.
When start states are specialized, an implementor of a search routine using a lazy DFA can tell when the search has entered a starting state. When start states aren’t specialized, then it is impossible to know whether the search has entered a start state.
Ideally, this option wouldn’t need to exist and we could always specialize start states. The problem is that start states can be quite active. This in turn means that an efficient search routine is likely to ping-pong between a heavily optimized hot loop that handles most states and to a less optimized specialized handling of start states. This causes branches to get heavily mispredicted and overall can materially decrease throughput. Therefore, specializing start states should only be enabled when it is needed.
Knowing whether a search is in a start state is typically useful when a prefilter is active for the search. A prefilter is typically only run when in a start state and a prefilter can greatly accelerate a search. Therefore, the possible cost of specializing start states is worth it in this case. Otherwise, if you have no prefilter, there is likely no reason to specialize start states.
This is disabled by default, but note that it is automatically
enabled (or disabled) if Config::prefilter
is set. Namely, unless
specialize_start_states
has already been set, Config::prefilter
will automatically enable or disable it based on whether a prefilter
is present or not, respectively. This is done because a prefilter’s
effectiveness is rooted in being executed whenever the DFA is in a
start state, and that’s only possible to do when they are specialized.
Note that it is plausibly reasonable to disable this option explicitly while enabling a prefilter. In that case, a prefilter will still be run at the beginning of a search, but never again. This in theory could strike a good balance if you’re in a situation where a prefilter is likely to produce many false positive candidates.
§Example
This example shows how to enable start state specialization and then shows how to check whether a state is a start state or not.
use regex_automata::{dfa::{Automaton, dense::DFA}, Input};
let dfa = DFA::builder()
.configure(DFA::config().specialize_start_states(true))
.build(r"[a-z]+")?;
let haystack = "123 foobar 4567".as_bytes();
let sid = dfa.start_state_forward(&Input::new(haystack))?;
// The ID returned by 'start_state_forward' will always be tagged as
// a start state when start state specialization is enabled.
assert!(dfa.is_special_state(sid));
assert!(dfa.is_start_state(sid));
Compare the above with the default DFA configuration where start states are not specialized. In this case, the start state is not tagged at all:
use regex_automata::{dfa::{Automaton, dense::DFA}, Input};
let dfa = DFA::new(r"[a-z]+")?;
let haystack = "123 foobar 4567";
let sid = dfa.start_state_forward(&Input::new(haystack))?;
// Start states are not special in the default configuration!
assert!(!dfa.is_special_state(sid));
assert!(!dfa.is_start_state(sid));
sourcepub fn dfa_size_limit(self, bytes: Option<usize>) -> Config
pub fn dfa_size_limit(self, bytes: Option<usize>) -> Config
Set a size limit on the total heap used by a DFA.
This size limit is expressed in bytes and is applied during determinization of an NFA into a DFA. If the DFA’s heap usage, and only the DFA, exceeds this configured limit, then determinization is stopped and an error is returned.
This limit does not apply to auxiliary storage used during determinization that isn’t part of the generated DFA.
This limit is only applied during determinization. Currently, there is no way to post-pone this check to after minimization if minimization was enabled.
The total limit on heap used during determinization is the sum of the DFA and determinization size limits.
The default is no limit.
§Example
This example shows a DFA that fails to build because of a configured size limit. This particular example also serves as a cautionary tale demonstrating just how big DFAs with large Unicode character classes can get.
use regex_automata::{dfa::{dense, Automaton}, Input};
// 6MB isn't enough!
dense::Builder::new()
.configure(dense::Config::new().dfa_size_limit(Some(6_000_000)))
.build(r"\w{20}")
.unwrap_err();
// ... but 7MB probably is!
// (Note that DFA sizes aren't necessarily stable between releases.)
let dfa = dense::Builder::new()
.configure(dense::Config::new().dfa_size_limit(Some(7_000_000)))
.build(r"\w{20}")?;
let haystack = "A".repeat(20).into_bytes();
assert!(dfa.try_search_fwd(&Input::new(&haystack))?.is_some());
While one needs a little more than 6MB to represent \w{20}
, it
turns out that you only need a little more than 6KB to represent
(?-u:\w{20})
. So only use Unicode if you need it!
As with Config::determinize_size_limit
, the size of a DFA is
influenced by other factors, such as what start state configurations
to support. For example, if you only need unanchored searches and not
anchored searches, then configuring the DFA to only support unanchored
searches can reduce its size. By default, DFAs support both unanchored
and anchored searches.
use regex_automata::{dfa::{dense, Automaton, StartKind}, Input};
// 3MB isn't enough!
dense::Builder::new()
.configure(dense::Config::new()
.dfa_size_limit(Some(3_000_000))
.start_kind(StartKind::Unanchored)
)
.build(r"\w{20}")
.unwrap_err();
// ... but 4MB probably is!
// (Note that DFA sizes aren't necessarily stable between releases.)
let dfa = dense::Builder::new()
.configure(dense::Config::new()
.dfa_size_limit(Some(4_000_000))
.start_kind(StartKind::Unanchored)
)
.build(r"\w{20}")?;
let haystack = "A".repeat(20).into_bytes();
assert!(dfa.try_search_fwd(&Input::new(&haystack))?.is_some());
sourcepub fn determinize_size_limit(self, bytes: Option<usize>) -> Config
pub fn determinize_size_limit(self, bytes: Option<usize>) -> Config
Set a size limit on the total heap used by determinization.
This size limit is expressed in bytes and is applied during determinization of an NFA into a DFA. If the heap used for auxiliary storage during determinization (memory that is not in the DFA but necessary for building the DFA) exceeds this configured limit, then determinization is stopped and an error is returned.
This limit does not apply to heap used by the DFA itself.
The total limit on heap used during determinization is the sum of the DFA and determinization size limits.
The default is no limit.
§Example
This example shows a DFA that fails to build because of a
configured size limit on the amount of heap space used by
determinization. This particular example complements the example for
Config::dfa_size_limit
by demonstrating that not only does Unicode
potentially make DFAs themselves big, but it also results in more
auxiliary storage during determinization. (Although, auxiliary storage
is still not as much as the DFA itself.)
use regex_automata::{dfa::{dense, Automaton}, Input};
// 700KB isn't enough!
dense::Builder::new()
.configure(dense::Config::new()
.determinize_size_limit(Some(700_000))
)
.build(r"\w{20}")
.unwrap_err();
// ... but 800KB probably is!
// (Note that auxiliary storage sizes aren't necessarily stable between
// releases.)
let dfa = dense::Builder::new()
.configure(dense::Config::new()
.determinize_size_limit(Some(800_000))
)
.build(r"\w{20}")?;
let haystack = "A".repeat(20).into_bytes();
assert!(dfa.try_search_fwd(&Input::new(&haystack))?.is_some());
Note that some parts of the configuration on a DFA can have a
big impact on how big the DFA is, and thus, how much memory is
used. For example, the default setting for Config::start_kind
is
StartKind::Both
. But if you only need an anchored search, for
example, then it can be much cheaper to build a DFA that only supports
anchored searches. (Running an unanchored search with it would panic.)
use regex_automata::{
dfa::{dense, Automaton, StartKind},
Anchored, Input,
};
// 200KB isn't enough!
dense::Builder::new()
.configure(dense::Config::new()
.determinize_size_limit(Some(200_000))
.start_kind(StartKind::Anchored)
)
.build(r"\w{20}")
.unwrap_err();
// ... but 300KB probably is!
// (Note that auxiliary storage sizes aren't necessarily stable between
// releases.)
let dfa = dense::Builder::new()
.configure(dense::Config::new()
.determinize_size_limit(Some(300_000))
.start_kind(StartKind::Anchored)
)
.build(r"\w{20}")?;
let haystack = "A".repeat(20).into_bytes();
let input = Input::new(&haystack).anchored(Anchored::Yes);
assert!(dfa.try_search_fwd(&input)?.is_some());
sourcepub fn get_accelerate(&self) -> bool
pub fn get_accelerate(&self) -> bool
Returns whether this configuration has enabled simple state acceleration.
sourcepub fn get_prefilter(&self) -> Option<&Prefilter>
pub fn get_prefilter(&self) -> Option<&Prefilter>
Returns the prefilter attached to this configuration, if any.
sourcepub fn get_minimize(&self) -> bool
pub fn get_minimize(&self) -> bool
Returns whether this configuration has enabled the expensive process of minimizing a DFA.
sourcepub fn get_match_kind(&self) -> MatchKind
pub fn get_match_kind(&self) -> MatchKind
Returns the match semantics set in this configuration.
sourcepub fn get_starts(&self) -> StartKind
pub fn get_starts(&self) -> StartKind
Returns the starting state configuration for a DFA.
sourcepub fn get_starts_for_each_pattern(&self) -> bool
pub fn get_starts_for_each_pattern(&self) -> bool
Returns whether this configuration has enabled anchored starting states for every pattern in the DFA.
sourcepub fn get_byte_classes(&self) -> bool
pub fn get_byte_classes(&self) -> bool
Returns whether this configuration has enabled byte classes or not. This is typically a debugging oriented option, as disabling it confers no speed benefit.
sourcepub fn get_unicode_word_boundary(&self) -> bool
pub fn get_unicode_word_boundary(&self) -> bool
Returns whether this configuration has enabled heuristic Unicode word boundary support. When enabled, it is possible for a search to return an error.
sourcepub fn get_quit(&self, byte: u8) -> bool
pub fn get_quit(&self, byte: u8) -> bool
Returns whether this configuration will instruct the DFA to enter a quit state whenever the given byte is seen during a search. When at least one byte has this enabled, it is possible for a search to return an error.
sourcepub fn get_specialize_start_states(&self) -> bool
pub fn get_specialize_start_states(&self) -> bool
Returns whether this configuration will instruct the DFA to “specialize” start states. When enabled, the DFA will mark start states as “special” so that search routines using the DFA can detect when it’s in a start state and do some kind of optimization (like run a prefilter).
sourcepub fn get_dfa_size_limit(&self) -> Option<usize>
pub fn get_dfa_size_limit(&self) -> Option<usize>
Returns the DFA size limit of this configuration if one was set. The size limit is total number of bytes on the heap that a DFA is permitted to use. If the DFA exceeds this limit during construction, then construction is stopped and an error is returned.
sourcepub fn get_determinize_size_limit(&self) -> Option<usize>
pub fn get_determinize_size_limit(&self) -> Option<usize>
Returns the determinization size limit of this configuration if one was set. The size limit is total number of bytes on the heap that determinization is permitted to use. If determinization exceeds this limit during construction, then construction is stopped and an error is returned.
This is different from the DFA size limit in that this only applies to the auxiliary storage used during determinization. Once determinization is complete, this memory is freed.
The limit on the total heap memory used is the sum of the DFA and determinization size limits.
Trait Implementations§
Auto Trait Implementations§
impl Freeze for Config
impl RefUnwindSafe for Config
impl Send for Config
impl Sync for Config
impl Unpin for Config
impl UnwindSafe for Config
Blanket Implementations§
source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
source§unsafe fn clone_to_uninit(&self, dst: *mut T)
unsafe fn clone_to_uninit(&self, dst: *mut T)
clone_to_uninit
)