Struct Extractor
struct Extractor { ... }
Extracts prefix or suffix literal sequences from Hir expressions.
Literal extraction is based on the following observations:
- Many regexes start with one or a small number of literals.
- Substring search for literals is often much faster (sometimes by an order of magnitude) than a regex search.
Thus, in many cases, one can search for literals to find candidate starting locations of a match, and then only run the full regex engine at each such location instead of over the full haystack.
The main downside of literal extraction is that it can wind up causing a search to be slower overall. For example, if there are many matches or if there are many candidates that don't ultimately lead to a match, then a lot of overhead will be spent in shuffing back-and-forth between substring search and the regex engine. This is the fundamental reason why literal optimizations for regex patterns is sometimes considered a "black art."
Look-around assertions
Literal extraction treats all look-around assertions as-if they match every
empty string. So for example, the regex \bquux\b will yield a sequence
containing a single exact literal quux. However, not all occurrences
of quux correspond to a match a of the regex. For example, \bquux\b
does not match ZquuxZ anywhere because quux does not fall on a word
boundary.
In effect, if your regex contains look-around assertions, then a match of an exact literal does not necessarily mean the regex overall matches. So you may still need to run the regex engine in such cases to confirm the match.
The precise guarantee you get from a literal sequence is: if every literal in the sequence is exact and the original regex contains zero look-around assertions, then a preference-order multi-substring search of those literals will precisely match a preference-order search of the original regex.
Example
This shows how to extract prefixes:
use ;
let hir = parse?;
let got = new.extract;
// All literals returned are "inexact" because none of them reach the
// match state.
let expected = from_iter;
assert_eq!;
# Ok::
This shows how to extract suffixes:
use ;
let hir = parse?;
let got = new.kind.extract;
// Since 'foo' gets to a match state, it is considered exact. But 'bar'
// does not because of the '[A-Z]+', and thus is marked inexact.
let expected = from_iter;
assert_eq!;
# Ok::
Implementations
impl Extractor
fn new() -> ExtractorCreate a new extractor with a default configuration.
The extractor can be optionally configured before calling
Extractor::extractto get a literal sequence.fn extract(self: &Self, hir: &Hir) -> SeqExecute the extractor and return a sequence of literals.
fn kind(self: &mut Self, kind: ExtractKind) -> &mut ExtractorSet the kind of literal sequence to extract from an
Hirexpression.The default is to extract prefixes, but suffixes can be selected instead. The contract for prefixes is that every match of the corresponding
Hirmust start with one of the literals in the sequence returned. Moreover, the order of the sequence returned corresponds to the preference order.Suffixes satisfy a similar contract in that every match of the corresponding
Hirmust end with one of the literals in the sequence returned. However, there is no guarantee that the literals are in preference order.Remember that a sequence can be infinite. For example, unless the limits are configured to be impractically large, attempting to extract prefixes (or suffixes) for the pattern
[A-Z]will return an infinite sequence. Generally speaking, if the sequence returned is infinite, then it is presumed to be unwise to do prefix (or suffix) optimizations for the pattern.fn limit_class(self: &mut Self, limit: usize) -> &mut ExtractorConfigure a limit on the length of the sequence that is permitted for a character class. If a character class exceeds this limit, then the sequence returned for it is infinite.
This prevents classes like
[A-Z]or\pLfrom getting turned into huge and likely unproductive sequences of literals.Example
This example shows how this limit can be lowered to decrease the tolerance for character classes being turned into literal sequences.
use ; let hir = parse?; let got = new.extract; let expected = new; assert_eq!; // Now let's shrink the limit and see how that changes things. let got = new.limit_class.extract; let expected = infinite; assert_eq!; # Ok::fn limit_repeat(self: &mut Self, limit: usize) -> &mut ExtractorConfigure a limit on the total number of repetitions that is permitted before literal extraction is stopped.
This is useful for limiting things like
(abcde){50}, or more insidiously,(?:){1000000000}. This limit prevents any one single repetition from adding too much to a literal sequence.With this limit set, repetitions that exceed it will be stopped and any literals extracted up to that point will be made inexact.
Example
This shows how to decrease the limit and compares it with the default.
use ; let hir = parse?; let got = new.extract; let expected = new; assert_eq!; // Now let's shrink the limit and see how that changes things. let got = new.limit_repeat.extract; let expected = from_iter; assert_eq!; # Ok::fn limit_literal_len(self: &mut Self, limit: usize) -> &mut ExtractorConfigure a limit on the maximum length of any literal in a sequence.
This is useful for limiting things like
(abcde){5}{5}{5}{5}. While each repetition or literal in that regex is small, when all the repetitions are applied, one ends up with a literal of length5^4 = 625.With this limit set, literals that exceed it will be made inexact and thus prevented from growing.
Example
This shows how to decrease the limit and compares it with the default.
use ; let hir = parse?; let got = new.extract; let expected = new; assert_eq!; // Now let's shrink the limit and see how that changes things. let got = new.limit_literal_len.extract; let expected = from_iter; assert_eq!; # Ok::fn limit_total(self: &mut Self, limit: usize) -> &mut ExtractorConfigure a limit on the total number of literals that will be returned.
This is useful as a practical measure for avoiding the creation of large sequences of literals. While the extractor will automatically handle local creations of large sequences (for example,
[A-Z]yields an infinite sequence by default), large sequences can be created through non-local means as well.For example,
[ab]{3}{3}would yield a sequence of length512 = 2^9despite each of the repetitions being small on their own. This limit thus represents a "catch all" for avoiding locally small sequences from combining into large sequences.Example
This example shows how reducing the limit will change the literal sequence returned.
use ; let hir = parse?; let got = new.extract; let expected = new; assert_eq!; // The default limit is not too big, but big enough to extract all // literals from '[ab]{2}{2}'. If we shrink the limit to less than 16, // then we'll get a truncated set. Notice that it returns a sequence of // length 4 even though our limit was 10. This is because the sequence // is difficult to increase without blowing the limit. Notice also // that every literal in the sequence is now inexact because they were // stripped of some suffix. let got = new.limit_total.extract; let expected = from_iter; assert_eq!; # Ok::
impl Clone for Extractor
fn clone(self: &Self) -> Extractor
impl Debug for Extractor
fn fmt(self: &Self, f: &mut Formatter<'_>) -> Result
impl Default for Extractor
fn default() -> Extractor
impl Freeze for Extractor
impl RefUnwindSafe for Extractor
impl Send for Extractor
impl Sync for Extractor
impl Unpin for Extractor
impl UnsafeUnpin for Extractor
impl UnwindSafe for Extractor
impl<T> Any for Extractor
fn type_id(self: &Self) -> TypeId
impl<T> Borrow for Extractor
fn borrow(self: &Self) -> &T
impl<T> BorrowMut for Extractor
fn borrow_mut(self: &mut Self) -> &mut T
impl<T> CloneToUninit for Extractor
unsafe fn clone_to_uninit(self: &Self, dest: *mut u8)
impl<T> From for Extractor
fn from(t: T) -> TReturns the argument unchanged.
impl<T> ToOwned for Extractor
fn to_owned(self: &Self) -> Tfn clone_into(self: &Self, target: &mut T)
impl<T, U> Into for Extractor
fn into(self: Self) -> UCalls
U::from(self).That is, this conversion is whatever the implementation of
[From]<T> for Uchooses to do.
impl<T, U> TryFrom for Extractor
fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>
impl<T, U> TryInto for Extractor
fn try_into(self: Self) -> Result<U, <U as TryFrom<T>>::Error>