Struct Config
struct Config { ... }
The configuration used for a Thompson NFA compiler.
Implementations
impl Config
fn new() -> ConfigReturn a new default Thompson NFA compiler configuration.
fn utf8(self: Self, yes: bool) -> ConfigWhether to enable UTF-8 mode during search or not.
A regex engine is said to be in UTF-8 mode when it guarantees that all matches returned by it have spans consisting of only valid UTF-8. That is, it is impossible for a match span to be returned that contains any invalid UTF-8.
UTF-8 mode generally consists of two things:
- Whether the NFA's states are constructed such that all paths to a match state that consume at least one byte always correspond to valid UTF-8.
- Whether all paths to a match state that do not consume any bytes should always correspond to valid UTF-8 boundaries.
(1) is a guarantee made by whoever constructs the NFA. If you're parsing a regex from its concrete syntax, then
syntax::Config::utf8can make this guarantee for you. It does it by returning an error if the regex pattern could every report a non-empty match span that contains invalid UTF-8. So long assyntax::Config::utf8mode is enabled and your regex successfully parses, then you're guaranteed that the corresponding NFA will only ever report non-empty match spans containing valid UTF-8.(2) is a trickier guarantee because it cannot be enforced by the NFA state graph itself. Consider, for example, the regex
a*. It matches the empty strings in☃at positions0,1,2and3, where positions1and2occur within the UTF-8 encoding of a codepoint, and thus correspond to invalid UTF-8 boundaries. Therefore, this guarantee must be made at a higher level than the NFA state graph itself. This crate deals with this case in each regex engine. Namely, when a zero-width match that splits a codepoint is found and UTF-8 mode enabled, then it is ignored and the engine moves on looking for the next match.Thus, UTF-8 mode is both a promise that the NFA built only reports non-empty matches that are valid UTF-8, and an instruction to regex engines that empty matches that split codepoints should be banned.
Because UTF-8 mode is fundamentally about avoiding invalid UTF-8 spans, it only makes sense to enable this option when you know your haystack is valid UTF-8. (For example, a
&str.) Enabling UTF-8 mode and searching a haystack that contains invalid UTF-8 leads to unspecified behavior.Therefore, it may make sense to enable
syntax::Config::utf8while simultaneously disabling this option. That would ensure all non-empty match spans are valid UTF-8, but that empty match spans may still split a codepoint or match at other places that aren't valid UTF-8.In general, this mode is only relevant if your regex can match the empty string. Most regexes don't.
This is enabled by default.
Example
This example shows how UTF-8 mode can impact the match spans that may be reported in certain cases.
use ; let re = new?; let = ; // UTF-8 mode is enabled by default. let mut input = new; re.search; assert_eq!; // Even though an empty regex matches at 1..1, our next match is // 3..3 because 1..1 and 2..2 split the snowman codepoint (which is // three bytes long). input.set_start; re.search; assert_eq!; // But if we disable UTF-8, then we'll get matches at 1..1 and 2..2: let re = builder .thompson .build?; re.search; assert_eq!; input.set_start; re.search; assert_eq!; input.set_start; re.search; assert_eq!; input.set_start; re.search; assert_eq!; # Ok::fn reverse(self: Self, yes: bool) -> ConfigReverse the NFA.
A NFA reversal is performed by reversing all of the concatenated sub-expressions in the original pattern, recursively. (Look around operators are also inverted.) The resulting NFA can be used to match the pattern starting from the end of a string instead of the beginning of a string.
Reversing the NFA is useful for building a reverse DFA, which is most useful for finding the start of a match after its ending position has been found. NFA execution engines typically do not work on reverse NFAs. For example, currently, the Pike VM reports the starting location of matches without a reverse NFA.
Currently, enabling this setting requires disabling the
capturessetting. If both are enabled, then the compiler will return an error. It is expected that this limitation will be lifted in the future.This is disabled by default.
Example
This example shows how to build a DFA from a reverse NFA, and then use the DFA to search backwards.
use ; let dfa = new .thompson .build?; let expected = Some; assert_eq!; # Ok::fn nfa_size_limit(self: Self, bytes: Option<usize>) -> ConfigSets an approximate size limit on the total heap used by the NFA being compiled.
This permits imposing constraints on the size of a compiled NFA. This may be useful in contexts where the regex pattern is untrusted and one wants to avoid using too much memory.
This size limit does not apply to auxiliary heap used during compilation that is not part of the built NFA.
Note that this size limit is applied during compilation in order for the limit to prevent too much heap from being used. However, the implementation may use an intermediate NFA representation that is otherwise slightly bigger than the final public form. Since the size limit may be applied to an intermediate representation, there is not necessarily a precise correspondence between the configured size limit and the heap usage of the final NFA.
There is no size limit by default.
Example
This example demonstrates how Unicode mode can greatly increase the size of the NFA.
# if cfg! // miri takes too long use NFA; // 400KB isn't enough! NFAcompiler .configure .build .unwrap_err; // ... but 500KB probably is. let nfa = NFAcompiler .configure .build?; assert_eq!; # Ok::fn shrink(self: Self, yes: bool) -> ConfigApply best effort heuristics to shrink the NFA at the expense of more time/memory.
Generally speaking, if one is using an NFA to compile a DFA, then the extra time used to shrink the NFA will be more than made up for during DFA construction (potentially by a lot). In other words, enabling this can substantially decrease the overall amount of time it takes to build a DFA.
A reason to keep this disabled is if you want to compile an NFA and start using it as quickly as possible without needing to build a DFA, and you don't mind using a bit of extra memory for the NFA. e.g., for an NFA simulation or for a lazy DFA.
NFA shrinking is currently most useful when compiling a reverse NFA with large Unicode character classes. In particular, it trades additional CPU time during NFA compilation in favor of generating fewer NFA states.
This is disabled by default because it can increase compile times quite a bit if you aren't building a full DFA.
Example
This example shows that NFA shrinking can lead to substantial space savings in some cases. Notice that, as noted above, we build a reverse DFA and use a pattern with a large Unicode character class.
# if cfg! // miri takes too long use ; // Currently we have to disable captures when enabling reverse NFA. let config = NFAconfig .which_captures .reverse; let not_shrunk = NFAcompiler .configure .build?; let shrunk = NFAcompiler .configure .build?; // While a specific shrink factor is not guaranteed, the savings can be // considerable in some cases. assert!; # Ok::fn captures(self: Self, yes: bool) -> ConfigWhether to include 'Capture' states in the NFA.
Currently, enabling this setting requires disabling the
reversesetting. If both are enabled, then the compiler will return an error. It is expected that this limitation will be lifted in the future.This is enabled by default.
Example
This example demonstrates that some regex engines, like the Pike VM, require capturing states to be present in the NFA to report match offsets.
(Note that since this method is deprecated, the example below uses
Config::which_capturesto disable capture states.)use ; let re = builder .thompson .build?; let mut cache = re.create_cache; assert!; assert_eq!; # Ok::fn which_captures(self: Self, which_captures: WhichCaptures) -> ConfigConfigures what kinds of capture groups are compiled into
State::Capturestates in a Thompson NFA.Currently, using any option except for
WhichCaptures::Nonerequires disabling thereversesetting. If both are enabled, then the compiler will return an error. It is expected that this limitation will be lifted in the future.This is set to
WhichCaptures::Allby default. Callers may wish to useWhichCaptures::Implicitin cases where one wants avoid the overhead of capture states for explicit groups. Usually this occurs when one wants to use thePikeVMonly for determining the overall match. Otherwise, thePikeVMcould use much more memory than is necessary.Example
This example demonstrates that some regex engines, like the Pike VM, require capturing states to be present in the NFA to report match offsets.
use ; let re = builder .thompson .build?; let mut cache = re.create_cache; assert!; assert_eq!; # Ok::The same applies to the bounded backtracker:
use ; let re = builder .thompson .build?; let mut cache = re.create_cache; assert!; assert_eq!; # Ok::fn look_matcher(self: Self, m: LookMatcher) -> ConfigSets the look-around matcher that should be used with this NFA.
A look-around matcher determines how to match look-around assertions. In particular, some assertions are configurable. For example, the
(?m:^)and(?m:$)assertions can have their line terminator changed from the default of\nto any other byte.Example
This shows how to change the line terminator for multi-line assertions.
use ; let mut lookm = new; lookm.set_line_terminator; let re = builder .thompson .build?; let mut cache = re.create_cache; // Multi-line assertions now use NUL as a terminator. assert_eq!; // ... and \n is no longer recognized as a terminator. assert_eq!; # Ok::fn get_utf8(self: &Self) -> boolReturns whether this configuration has enabled UTF-8 mode.
fn get_reverse(self: &Self) -> boolReturns whether this configuration has enabled reverse NFA compilation.
fn get_nfa_size_limit(self: &Self) -> Option<usize>Return the configured NFA size limit, if it exists, in the number of bytes of heap used.
fn get_shrink(self: &Self) -> boolReturn whether NFA shrinking is enabled.
fn get_captures(self: &Self) -> boolReturn whether NFA compilation is configured to produce capture states.
fn get_which_captures(self: &Self) -> WhichCapturesReturn what kinds of capture states will be compiled into an NFA.
fn get_look_matcher(self: &Self) -> LookMatcherReturn the look-around matcher for this NFA.
impl Clone for Config
fn clone(self: &Self) -> Config
impl Debug for Config
fn fmt(self: &Self, f: &mut Formatter<'_>) -> Result
impl Default for Config
fn default() -> Config
impl Freeze for Config
impl RefUnwindSafe for Config
impl Send for Config
impl Sync for Config
impl Unpin for Config
impl UnsafeUnpin for Config
impl UnwindSafe for Config
impl<T> Any for Config
fn type_id(self: &Self) -> TypeId
impl<T> Borrow for Config
fn borrow(self: &Self) -> &T
impl<T> BorrowMut for Config
fn borrow_mut(self: &mut Self) -> &mut T
impl<T> CloneToUninit for Config
unsafe fn clone_to_uninit(self: &Self, dest: *mut u8)
impl<T> From for Config
fn from(t: T) -> TReturns the argument unchanged.
impl<T> ToOwned for Config
fn to_owned(self: &Self) -> Tfn clone_into(self: &Self, target: &mut T)
impl<T, U> Into for Config
fn into(self: Self) -> UCalls
U::from(self).That is, this conversion is whatever the implementation of
[From]<T> for Uchooses to do.
impl<T, U> TryFrom for Config
fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>
impl<T, U> TryInto for Config
fn try_into(self: Self) -> Result<U, <U as TryFrom<T>>::Error>