Struct DFA
struct DFA { ... }
A one-pass DFA for executing a subset of anchored regex searches while resolving capturing groups.
A one-pass DFA can be built from an NFA that is one-pass. An NFA is
one-pass when there is never any ambiguity about how to continue a search.
For example, a*a is not one-pass becuase during a search, it's not
possible to know whether to continue matching the a* or to move on to
the single a. However, a*b is one-pass, because for every byte in the
input, it's always clear when to move on from a* to b.
Only anchored searches are supported
In this crate, especially for DFAs, unanchored searches are implemented by
treating the pattern as if it had a (?s-u:.)*? prefix. While the prefix
is one-pass on its own, adding anything after it, e.g., (?s-u:.)*?a will
make the overall pattern not one-pass. Why? Because the (?s-u:.) matches
any byte, and there is therefore ambiguity as to when the prefix should
stop matching and something else should start matching.
Therefore, one-pass DFAs do not support unanchored searches. In addition
to many regexes simply not being one-pass, it implies that one-pass DFAs
have limited utility. With that said, when a one-pass DFA can be used, it
can potentially provide a dramatic speed up over alternatives like the
BoundedBacktracker
and the PikeVM. In particular,
a one-pass DFA is the only DFA capable of reporting the spans of matching
capturing groups.
To clarify, when we say that unanchored searches are not supported, what that actually means is:
- The high level routines,
DFA::is_matchandDFA::captures, always do anchored searches. - Since iterators are most useful in the context of unanchored searches,
there is no
DFA::captures_itermethod. - For lower level routines like
DFA::try_search, an error will be returned if the givenInputis configured to do an unanchored search or search for an invalid pattern ID. (Note that anInputis configured to do an unanchored search by default, so just giving aInput::newis guaranteed to return an error.)
Other limitations
In addition to the configurable heap limit and the requirement that a regex pattern be one-pass, there are some other limitations:
- There is an internal limit on the total number of explicit capturing groups that appear across all patterns. It is somewhat small and there is no way to configure it. If your pattern(s) exceed this limit, then building a one-pass DFA will fail.
- If the number of patterns exceeds an internal unconfigurable limit, then building a one-pass DFA will fail. This limit is quite large and you're unlikely to hit it.
- If the total number of states exceeds an internal unconfigurable limit, then building a one-pass DFA will fail. This limit is quite large and you're unlikely to hit it.
Other examples of regexes that aren't one-pass
One particularly unfortunate example is that enabling Unicode can cause
regexes that were one-pass to no longer be one-pass. Consider the regex
(?-u)\w*\s for example. It is one-pass because there is exactly no
overlap between the ASCII definitions of \w and \s. But \w*\s
(i.e., with Unicode enabled) is not one-pass because \w and \s get
translated to UTF-8 automatons. And while the codepoints in \w and \s
do not overlap, the underlying UTF-8 encodings do. Indeed, because of the
overlap between UTF-8 automata, the use of Unicode character classes will
tend to vastly increase the likelihood of a regex not being one-pass.
How does one know if a regex is one-pass or not?
At the time of writing, the only way to know is to try and build a one-pass DFA. The one-pass property is checked while constructing the DFA.
This does mean that you might potentially waste some CPU cycles and memory by optimistically trying to build a one-pass DFA. But this is currently the only way. In the future, building a one-pass DFA might be able to use some heuristics to detect common violations of the one-pass property and bail more quickly.
Resource usage
Unlike a general DFA, a one-pass DFA has stricter bounds on its resource
usage. Namely, construction of a one-pass DFA has a time and space
complexity of O(n), where n ~ nfa.states().len(). (A general DFA's time
and space complexity is O(2^n).) This smaller time bound is achieved
because there is at most one DFA state created for each NFA state. If
additional DFA states would be required, then the pattern is not one-pass
and construction will fail.
Note though that currently, this DFA uses a fully dense representation. This means that while its space complexity is no worse than an NFA, it may in practice use more memory because of higher constant factors. The reason for this trade off is two-fold. Firstly, a dense representation makes the search faster. Secondly, the bigger an NFA, the more unlikely it is to be one-pass. Therefore, most one-pass DFAs are usually pretty small.
Example
This example shows that the one-pass DFA implements Unicode word boundaries correctly while simultaneously reporting spans for capturing groups that participate in a match. (This is the only DFA that implements full support for Unicode word boundaries.)
# if cfg! // miri takes too long
use ;
let re = DFAnew?;
let = ;
re.captures;
assert_eq!;
assert_eq!;
assert_eq!;
# Ok::
Example: iteration
Unlike other regex engines in this crate, this one does not provide iterator search functions. This is because a one-pass DFA only supports anchored searches, and so iterator functions are generally not applicable.
However, if you know that all of your matches are
directly adjacent, then an iterator can be used. The
util::iter::Searcher type can be used for
this purpose:
# if cfg! // miri takes too long
use ;
let re = DFAnew?;
let = ;
let input = new.anchored;
let mut it = new.into_captures_iter.infallible;
let caps0 = it.next.unwrap;
assert_eq!;
# Ok::
Implementations
impl DFA
fn new(pattern: &str) -> Result<DFA, BuildError>Parse the given regular expression using the default configuration and return the corresponding one-pass DFA.
If you want a non-default configuration, then use the
Builderto set your own configuration.Example
use ; let re = DFAnew?; let = ; re.captures; assert_eq!; # Ok::fn new_many<P: AsRef<str>>(patterns: &[P]) -> Result<DFA, BuildError>Like
new, but parses multiple patterns into a single "multi regex." This similarly uses the default regex configuration.Example
use ; let re = DFAnew_many?; let = ; re.captures; assert_eq!; re.captures; assert_eq!; # Ok::fn new_from_nfa(nfa: NFA) -> Result<DFA, BuildError>Like
new, but builds a one-pass DFA directly from an NFA. This is useful if you already have an NFA, or even if you hand-assembled the NFA.Example
This shows how to hand assemble a regular expression via its HIR, compile an NFA from it and build a one-pass DFA from the NFA.
use ; use ; let hir = class; let config = NFAconfig.nfa_size_limit; let nfa = NFAcompiler.configure.build_from_hir?; let re = DFAnew_from_nfa?; let = ; let expected = Some; re.captures; assert_eq!; # Ok::fn always_match() -> Result<DFA, BuildError>Create a new one-pass DFA that matches every input.
Example
use ; let dfa = DFAalways_match?; let mut cache = dfa.create_cache; let mut caps = dfa.create_captures; let expected = must; dfa.captures; assert_eq!; dfa.captures; assert_eq!; # Ok::fn never_match() -> Result<DFA, BuildError>Create a new one-pass DFA that never matches any input.
Example
use DFA; let dfa = DFAnever_match?; let mut cache = dfa.create_cache; let mut caps = dfa.create_captures; dfa.captures; assert_eq!; dfa.captures; assert_eq!; # Ok::fn config() -> ConfigReturn a default configuration for a DFA.
This is a convenience routine to avoid needing to import the
Configtype when customizing the construction of a DFA.Example
This example shows how to change the match semantics of this DFA from its default "leftmost first" to "all." When using "all," non-greediness doesn't apply and neither does preference order matching. Instead, the longest match possible is always returned. (Although, by construction, it's impossible for a one-pass DFA to have a different answer for "preference order" vs "longest match.")
use ; let re = DFAbuilder .configure .build?; let mut cache = re.create_cache; let mut caps = re.create_captures; re.captures; // Normally, the non-greedy repetition would give us a 0..3 match. assert_eq!; # Ok::fn builder() -> BuilderReturn a builder for configuring the construction of a DFA.
This is a convenience routine to avoid needing to import the
Buildertype in common cases.Example
This example shows how to use the builder to disable UTF-8 mode.
# if cfg! // miri takes too long use ; let re = DFAbuilder .syntax .thompson .build?; let = ; let haystack = b"foo\xFFarzz\xE2\x98\xFF\n"; let expected = Some; re.captures; assert_eq!; # Ok::fn create_captures(self: &Self) -> CapturesCreate a new empty set of capturing groups that is guaranteed to be valid for the search APIs on this DFA.
A
Capturesvalue created for a specific DFA cannot be used with any other DFA.This is a convenience function for
Captures::all. See theCapturesdocumentation for an explanation of its alternative constructors that permit the DFA to do less work during a search, and thus might make it faster.fn create_cache(self: &Self) -> CacheCreate a new cache for this DFA.
The cache returned should only be used for searches for this DFA. If you want to reuse the cache for another DFA, then you must call
Cache::resetwith that DFA (or, equivalently,DFA::reset_cache).fn reset_cache(self: &Self, cache: &mut Cache)Reset the given cache such that it can be used for searching with the this DFA (and only this DFA).
A cache reset permits reusing memory already allocated in this cache with a different DFA.
Example
This shows how to re-purpose a cache for use with a different DFA.
# if cfg! // miri takes too long use ; let re1 = DFAnew?; let re2 = DFAnew?; let mut caps1 = re1.create_captures; let mut caps2 = re2.create_captures; let mut cache = re1.create_cache; assert_eq!; // Using 'cache' with re2 is not allowed. It may result in panics or // incorrect results. In order to re-purpose the cache, we must reset // it with the one-pass DFA we'd like to use it with. // // Similarly, after this reset, using the cache with 're1' is also not // allowed. re2.reset_cache; assert_eq!; # Ok::fn get_config(self: &Self) -> &ConfigReturn the config for this one-pass DFA.
fn get_nfa(self: &Self) -> &NFAReturns a reference to the underlying NFA.
fn pattern_len(self: &Self) -> usizeReturns the total number of patterns compiled into this DFA.
In the case of a DFA that contains no patterns, this returns
0.fn state_len(self: &Self) -> usizeReturns the total number of states in this one-pass DFA.
Note that unlike dense or sparse DFAs, a one-pass DFA does not expose a low level DFA API. Therefore, this routine has little use other than being informational.
fn alphabet_len(self: &Self) -> usizeReturns the total number of elements in the alphabet for this DFA.
That is, this returns the total number of transitions that each state in this DFA must have. The maximum alphabet size is 256, which corresponds to each possible byte value.
The alphabet size may be less than 256 though, and unless
Config::byte_classesis disabled, it is typically must less than 256. Namely, bytes are grouped into equivalence classes such that no two bytes in the same class can distinguish a match from a non-match. For example, in the regex^[a-z]+$, the ASCII bytesa-zcould all be in the same equivalence class. This leads to a massive space savings.Note though that the alphabet length does not necessarily equal the total stride space taken up by a single DFA state in the transition table. Namely, for performance reasons, the stride is always the smallest power of two that is greater than or equal to the alphabet length. For this reason,
DFA::strideorDFA::stride2are often more useful. The alphabet length is typically useful only for informational purposes.Note also that unlike dense or sparse DFAs, a one-pass DFA does not have a special end-of-input (EOI) transition. This is because a one-pass DFA handles look-around assertions explicitly (like the
PikeVM) and does not build them into the transitions of the DFA.fn stride2(self: &Self) -> usizeReturns the total stride for every state in this DFA, expressed as the exponent of a power of 2. The stride is the amount of space each state takes up in the transition table, expressed as a number of transitions. (Unused transitions map to dead states.)
The stride of a DFA is always equivalent to the smallest power of 2 that is greater than or equal to the DFA's alphabet length. This definition uses extra space, but possibly permits faster translation between state identifiers and their corresponding offsets in this DFA's transition table.
For example, if the DFA's stride is 16 transitions, then its
stride2is4since2^4 = 16.The minimum
stride2value is1(corresponding to a stride of2) while the maximumstride2value is9(corresponding to a stride of512). The maximum in theory should be8, but because of some implementation quirks that may be relaxed in the future, it is one more than8. (Do note that a maximal stride is incredibly rare, as it would imply that there is almost no redundant in the regex pattern.)Note that unlike dense or sparse DFAs, a one-pass DFA does not expose a low level DFA API. Therefore, this routine has little use other than being informational.
fn stride(self: &Self) -> usizeReturns the total stride for every state in this DFA. This corresponds to the total number of transitions used by each state in this DFA's transition table.
Please see
DFA::stride2for more information. In particular, this returns the stride as the number of transitions, where asstride2returns it as the exponent of a power of 2.Note that unlike dense or sparse DFAs, a one-pass DFA does not expose a low level DFA API. Therefore, this routine has little use other than being informational.
fn memory_usage(self: &Self) -> usizeReturns the memory usage, in bytes, of this DFA.
The memory usage is computed based on the number of bytes used to represent this DFA.
This does not include the stack size used up by this DFA. To compute that, use
std::mem::size_of::<onepass::DFA>().
impl DFA
fn is_match<'h, I: Into<Input<'h>>>(self: &Self, cache: &mut Cache, input: I) -> boolExecutes an anchored leftmost forward search, and returns true if and only if this one-pass DFA matches the given haystack.
This routine may short circuit if it knows that scanning future input will never lead to a different result. In particular, if the underlying DFA enters a match state, then this routine will return
trueimmediately without inspecting any future input. (Consider how this might make a difference given the regexa+on the haystackaaaaaaaaaaaaaaa. This routine can stop after it sees the firsta, but routines likefindneed to continue searching because+is greedy by default.)The given
Inputis forcefully set to useAnchored::Yesif the given configuration wasAnchored::No(which is the default).Panics
This routine panics if the search could not complete. This can occur in the following circumstances:
- When the provided
Inputconfiguration is not supported. For example, by providing an unsupported anchor mode. Concretely, this occurs when usingAnchored::Patternwithout enablingConfig::starts_for_each_pattern.
When a search panics, callers cannot know whether a match exists or not.
Use
DFA::try_searchif you want to handle these panics as error values instead.Example
This shows basic usage:
use DFA; let re = DFAnew?; let mut cache = re.create_cache; assert!; assert!; # Ok::Example: consistency with search APIs
is_matchis guaranteed to returntruewhenevercapturesreturns a match. This includes searches that are executed entirely within a codepoint:use ; let re = DFAnew?; let mut cache = re.create_cache; assert!; # Ok::Notice that when UTF-8 mode is disabled, then the above reports a match because the restriction against zero-width matches that split a codepoint has been lifted:
use ; let re = DFAbuilder .thompson .build?; let mut cache = re.create_cache; assert!; # Ok::- When the provided
fn find<'h, I: Into<Input<'h>>>(self: &Self, cache: &mut Cache, input: I) -> Option<Match>Executes an anchored leftmost forward search, and returns a
Matchif and only if this one-pass DFA matches the given haystack.This routine only includes the overall match span. To get access to the individual spans of each capturing group, use
DFA::captures.The given
Inputis forcefully set to useAnchored::Yesif the given configuration wasAnchored::No(which is the default).Panics
This routine panics if the search could not complete. This can occur in the following circumstances:
- When the provided
Inputconfiguration is not supported. For example, by providing an unsupported anchor mode. Concretely, this occurs when usingAnchored::Patternwithout enablingConfig::starts_for_each_pattern.
When a search panics, callers cannot know whether a match exists or not.
Use
DFA::try_searchif you want to handle these panics as error values instead.Example
Leftmost first match semantics corresponds to the match with the smallest starting offset, but where the end offset is determined by preferring earlier branches in the original regular expression. For example,
Sam|Samwisewill matchSaminSamwise, butSamwise|Samwill matchSamwiseinSamwise.Generally speaking, the "leftmost first" match is how most backtracking regular expressions tend to work. This is in contrast to POSIX-style regular expressions that yield "leftmost longest" matches. Namely, both
Sam|SamwiseandSamwise|SammatchSamwisewhen using leftmost longest semantics. (This crate does not currently support leftmost longest semantics.)use ; let re = DFAnew?; let mut cache = re.create_cache; let expected = must; assert_eq!; // Even though a match is found after reading the first byte (`a`), // the leftmost first match semantics demand that we find the earliest // match that prefers earlier parts of the pattern over later parts. let re = DFAnew?; let mut cache = re.create_cache; let expected = must; assert_eq!; # Ok::- When the provided
fn captures<'h, I: Into<Input<'h>>>(self: &Self, cache: &mut Cache, input: I, caps: &mut Captures)Executes an anchored leftmost forward search and writes the spans of capturing groups that participated in a match into the provided
Capturesvalue. If no match was found, thenCaptures::is_matchis guaranteed to returnfalse.The given
Inputis forcefully set to useAnchored::Yesif the given configuration wasAnchored::No(which is the default).Panics
This routine panics if the search could not complete. This can occur in the following circumstances:
- When the provided
Inputconfiguration is not supported. For example, by providing an unsupported anchor mode. Concretely, this occurs when usingAnchored::Patternwithout enablingConfig::starts_for_each_pattern.
When a search panics, callers cannot know whether a match exists or not.
Use
DFA::try_searchif you want to handle these panics as error values instead.Example
This shows a simple example of a one-pass regex that extracts capturing group spans.
use ; let re = DFAnew?; let = ; re.captures; assert_eq!; assert_eq!; assert_eq!; # Ok::- When the provided
fn try_search(self: &Self, cache: &mut Cache, input: &Input<'_>, caps: &mut Captures) -> Result<(), MatchError>Executes an anchored leftmost forward search and writes the spans of capturing groups that participated in a match into the provided
Capturesvalue. If no match was found, thenCaptures::is_matchis guaranteed to returnfalse.The differences with
DFA::capturesare:- This returns an error instead of panicking if the search fails.
- Accepts an
&Inputinstead of aInto<Input>. This permits reusing the same input for multiple searches, which may be important for latency. - This does not automatically change the
Anchoredmode fromNotoYes. Instead, ifInput::anchoredisAnchored::No, then an error is returned.
Errors
This routine errors if the search could not complete. This can occur in the following circumstances:
- When the provided
Inputconfiguration is not supported. For example, by providing an unsupported anchor mode. Concretely, this occurs when usingAnchored::Patternwithout enablingConfig::starts_for_each_pattern.
When a search returns an error, callers cannot know whether a match exists or not.
Example: specific pattern search
This example shows how to build a multi-regex that permits searching for specific patterns. Note that this is somewhat less useful than in other regex engines, since a one-pass DFA by definition has no ambiguity about which pattern can match at a position. That is, if it were possible for two different patterns to match at the same starting position, then the multi-regex would not be one-pass and construction would have failed.
Nevertheless, this can still be useful if you only care about matches for a specific pattern, and want the DFA to report "no match" even if some other pattern would have matched.
Note that in order to make use of this functionality,
Config::starts_for_each_patternmust be enabled. It is disabled by default since it may result in higher memory usage.use ; let re = DFAbuilder .configure .build_many?; let = ; let haystack = "123abc"; let input = new.anchored; // A normal multi-pattern search will show pattern 1 matches. re.try_search?; assert_eq!; // If we only want to report pattern 0 matches, then we'll get no // match here. let input = input.anchored; re.try_search?; assert_eq!; # Ok::Example: specifying the bounds of a search
This example shows how providing the bounds of a search can produce different results than simply sub-slicing the haystack.
# if cfg! // miri takes too long use ; // one-pass DFAs fully support Unicode word boundaries! // A sad joke is that a Unicode aware regex like \w+\s is not one-pass. // :-( let re = DFAnew?; let = ; let haystack = "foo123bar"; // Since we sub-slice the haystack, the search doesn't know about // the larger context and assumes that `123` is surrounded by word // boundaries. And of course, the match position is reported relative // to the sub-slice as well, which means we get `0..3` instead of // `3..6`. let expected = Some; let input = new.anchored; re.try_search?; assert_eq!; // But if we provide the bounds of the search within the context of the // entire haystack, then the search can take the surrounding context // into account. (And if we did find a match, it would be reported // as a valid offset into `haystack` instead of its sub-slice.) let expected = None; let input = new.range.anchored; re.try_search?; assert_eq!; # Ok::fn try_search_slots(self: &Self, cache: &mut Cache, input: &Input<'_>, slots: &mut [Option<NonMaxUsize>]) -> Result<Option<PatternID>, MatchError>Executes an anchored leftmost forward search and writes the spans of capturing groups that participated in a match into the provided
slots, and returns the matching pattern ID. The contents of the slots for patterns other than the matching pattern are unspecified. If no match was found, thenNoneis returned and the contents of allslotsis unspecified.This is like
DFA::try_search, but it accepts a raw slots slice instead of aCapturesvalue. This is useful in contexts where you don't want or need to allocate aCaptures.It is legal to pass any number of slots to this routine. If the regex engine would otherwise write a slot offset that doesn't fit in the provided slice, then it is simply skipped. In general though, there are usually three slice lengths you might want to use:
- An empty slice, if you only care about which pattern matched.
- A slice with
pattern_len() * 2slots, if you only care about the overall match spans for each matching pattern. - A slice with
slot_len()slots, which permits recording match offsets for every capturing group in every pattern.
Errors
This routine errors if the search could not complete. This can occur in the following circumstances:
- When the provided
Inputconfiguration is not supported. For example, by providing an unsupported anchor mode. Concretely, this occurs when usingAnchored::Patternwithout enablingConfig::starts_for_each_pattern.
When a search returns an error, callers cannot know whether a match exists or not.
Example
This example shows how to find the overall match offsets in a multi-pattern search without allocating a
Capturesvalue. Indeed, we can put our slots right on the stack.use ; let re = DFAnew_many?; let mut cache = re.create_cache; let input = new.anchored; // We only care about the overall match offsets here, so we just // allocate two slots for each pattern. Each slot records the start // and end of the match. let mut slots = ; let pid = re.try_search_slots?; assert_eq!; // The overall match offsets are always at 'pid * 2' and 'pid * 2 + 1'. // See 'GroupInfo' for more details on the mapping between groups and // slot indices. let slot_start = pid.unwrap.as_usize * 2; let slot_end = slot_start + 1; assert_eq!; assert_eq!; # Ok::
impl Clone for DFA
fn clone(self: &Self) -> DFA
impl Debug for DFA
fn fmt(self: &Self, f: &mut Formatter<'_>) -> Result
impl Freeze for DFA
impl RefUnwindSafe for DFA
impl Send for DFA
impl Sync for DFA
impl Unpin for DFA
impl UnsafeUnpin for DFA
impl UnwindSafe for DFA
impl<T> Any for DFA
fn type_id(self: &Self) -> TypeId
impl<T> Borrow for DFA
fn borrow(self: &Self) -> &T
impl<T> BorrowMut for DFA
fn borrow_mut(self: &mut Self) -> &mut T
impl<T> CloneToUninit for DFA
unsafe fn clone_to_uninit(self: &Self, dest: *mut u8)
impl<T> From for DFA
fn from(t: T) -> TReturns the argument unchanged.
impl<T> ToOwned for DFA
fn to_owned(self: &Self) -> Tfn clone_into(self: &Self, target: &mut T)
impl<T, U> Into for DFA
fn into(self: Self) -> UCalls
U::from(self).That is, this conversion is whatever the implementation of
[From]<T> for Uchooses to do.
impl<T, U> TryFrom for DFA
fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>
impl<T, U> TryInto for DFA
fn try_into(self: Self) -> Result<U, <U as TryFrom<T>>::Error>