Struct AhoCorasickBuilder
struct AhoCorasickBuilder { ... }
A builder for configuring an Aho-Corasick automaton.
Quick advice
- Use
AhoCorasickBuilder::match_kindto configure your searcher withMatchKind::LeftmostFirstif you want to match how backtracking regex engines execute searches forpat1|pat2|..|patN. UseMatchKind::LeftmostLongestif you want to match how POSIX regex engines do it. - If you need an anchored search, use
AhoCorasickBuilder::start_kindto set theStartKind::Anchoredmode sinceStartKind::Unanchoredis the default. Or just useStartKind::Bothto support both types of searches. - You might want to use
AhoCorasickBuilder::kindto set your searcher to always use aAhoCorasickKind::DFAif search speed is critical and memory usage isn't a concern. Otherwise, not setting a kind will probably make the right choice for you. Beware that if you useStartKind::Bothto build a searcher that supports both unanchored and anchored searches and you setAhoCorasickKind::DFA, then the DFA will essentially be duplicated to support both simultaneously. This results in very high memory usage. - For all other options, their defaults are almost certainly what you want.
Implementations
impl AhoCorasickBuilder
fn new() -> AhoCorasickBuilderCreate a new builder for configuring an Aho-Corasick automaton.
The builder provides a way to configure a number of things, including ASCII case insensitivity and what kind of match semantics are used.
fn build<I, P>(self: &Self, patterns: I) -> Result<AhoCorasick, BuildError> where I: IntoIterator<Item = P>, P: AsRef<[u8]>Build an Aho-Corasick automaton using the configuration set on this builder.
A builder may be reused to create more automatons.
Examples
Basic usage:
use ; let patterns = &; let ac = new.build.unwrap; assert_eq!;fn match_kind(self: &mut Self, kind: MatchKind) -> &mut AhoCorasickBuilderSet the desired match semantics.
The default is
MatchKind::Standard, which corresponds to the match semantics supported by the standard textbook description of the Aho-Corasick algorithm. Namely, matches are reported as soon as they are found. Moreover, this is the only way to get overlapping matches or do stream searching.The other kinds of match semantics that are supported are
MatchKind::LeftmostFirstandMatchKind::LeftmostLongest. The former corresponds to the match you would get if you were to try to match each pattern at each position in the haystack in the same order that you give to the automaton. That is, it returns the leftmost match corresponding to the earliest pattern given to the automaton. The latter corresponds to finding the longest possible match among all leftmost matches.For more details on match semantics, see the documentation for
MatchKind.Note that setting this to
MatchKind::LeftmostFirstorMatchKind::LeftmostLongestwill cause some search routines onAhoCorasickto return an error (or panic if you're using the infallible API). Notably, this includes stream and overlapping searches.Examples
In these examples, we demonstrate the differences between match semantics for a particular set of patterns in a specific order:
b,abc,abcd.Standard semantics:
use ; let patterns = &; let haystack = "abcd"; let ac = builder .match_kind // default, not necessary .build .unwrap; let mat = ac.find.expect; assert_eq!;Leftmost-first semantics:
use ; let patterns = &; let haystack = "abcd"; let ac = builder .match_kind .build .unwrap; let mat = ac.find.expect; assert_eq!;Leftmost-longest semantics:
use ; let patterns = &; let haystack = "abcd"; let ac = builder .match_kind .build .unwrap; let mat = ac.find.expect; assert_eq!;fn start_kind(self: &mut Self, kind: StartKind) -> &mut AhoCorasickBuilderSets the starting state configuration for the automaton.
Every Aho-Corasick automaton is capable of having two start states: one that is used for unanchored searches and one that is used for anchored searches. Some automatons, like the NFAs, support this with almost zero additional cost. Other automatons, like the DFA, require two copies of the underlying transition table to support both simultaneously.
Because there may be an added non-trivial cost to supporting both, it is possible to configure which starting state configuration is needed.
Indeed, since anchored searches tend to be somewhat more rare, only unanchored searches are supported by default. Thus,
StartKind::Unanchoredis the default.Note that when this is set to
StartKind::Unanchored, then running an anchored search will result in an error (or a panic if using the infallible APIs). Similarly, when this is set toStartKind::Anchored, then running an unanchored search will result in an error (or a panic if using the infallible APIs). WhenStartKind::Bothis used, then both unanchored and anchored searches are always supported.Also note that even if an
AhoCorasicksearcher is using an NFA internally (which always supports both unanchored and anchored searches), an error will still be reported for a search that isn't supported by the configuration set via this method. This means, for example, that an error is never dependent on which internal implementation of Aho-Corasick is used.Example: anchored search
This shows how to build a searcher that only supports anchored searches:
use ; let ac = builder .match_kind .start_kind .build .unwrap; // An unanchored search is not supported! An error here is guaranteed // given the configuration above regardless of which kind of // Aho-Corasick implementation ends up being used internally. let input = new.anchored; assert!; let input = new.anchored; assert_eq!; let input = new.anchored; assert_eq!; # Ok::Example: unanchored and anchored searches
This shows how to build a searcher that supports both unanchored and anchored searches:
use ; let ac = builder .match_kind .start_kind .build .unwrap; let input = new.anchored; assert_eq!; let input = new.anchored; assert_eq!; let input = new.anchored; assert_eq!; # Ok::fn ascii_case_insensitive(self: &mut Self, yes: bool) -> &mut AhoCorasickBuilderEnable ASCII-aware case insensitive matching.
When this option is enabled, searching will be performed without respect to case for ASCII letters (
a-zandA-Z) only.Enabling this option does not change the search algorithm, but it may increase the size of the automaton.
NOTE: It is unlikely that support for Unicode case folding will be added in the future. The ASCII case works via a simple hack to the underlying automaton, but full Unicode handling requires a fair bit of sophistication. If you do need Unicode handling, you might consider using the
regexcrate or the lower levelregex-automatacrate.Examples
Basic usage:
use AhoCorasick; let patterns = &; let haystack = "foo bar baz"; let ac = builder .ascii_case_insensitive .build .unwrap; assert_eq!;fn kind(self: &mut Self, kind: Option<AhoCorasickKind>) -> &mut AhoCorasickBuilderChoose the type of underlying automaton to use.
Currently, there are four choices:
AhoCorasickKind::NoncontiguousNFAinstructs the searcher to use anoncontiguous::NFA. A noncontiguous NFA is the fastest to be built, has moderate memory usage and is typically the slowest to execute a search.AhoCorasickKind::ContiguousNFAinstructs the searcher to use acontiguous::NFA. A contiguous NFA is a little slower to build than a noncontiguous NFA, has excellent memory usage and is typically a little slower than a DFA for a search.AhoCorasickKind::DFAinstructs the searcher to use adfa::DFA. A DFA is very slow to build, uses exorbitant amounts of memory, but will typically execute searches the fastest.None(the default) instructs the searcher to choose the "best" Aho-Corasick implementation. This choice is typically based primarily on the number of patterns.
Setting this configuration does not change the time complexity for constructing the Aho-Corasick automaton (which is
O(p)wherepis the total number of patterns being compiled). Setting this toAhoCorasickKind::DFAdoes however reduce the time complexity of non-overlapping searches fromO(n + p)toO(n), wherenis the length of the haystack.In general, you should probably stick to the default unless you have some kind of reason to use a specific Aho-Corasick implementation. For example, you might choose
AhoCorasickKind::DFAif you don't care about memory usage and want the fastest possible search times.Setting this guarantees that the searcher returned uses the chosen implementation. If that implementation could not be constructed, then an error will be returned. In contrast, when
Noneis used, it is possible for it to attempt to construct, for example, a contiguous NFA and have it fail. In which case, it will fall back to using a noncontiguous NFA.If
Noneis given, then one may useAhoCorasick::kindto determine which Aho-Corasick implementation was chosen.Note that the heuristics used for choosing which
AhoCorasickKindmay be changed in a semver compatible release.fn prefilter(self: &mut Self, yes: bool) -> &mut AhoCorasickBuilderEnable heuristic prefilter optimizations.
When enabled, searching will attempt to quickly skip to match candidates using specialized literal search routines. A prefilter cannot always be used, and is generally treated as a heuristic. It can be useful to disable this if the prefilter is observed to be sub-optimal for a particular workload.
Currently, prefilters are typically only active when building searchers with a small (less than 100) number of patterns.
This is enabled by default.
fn dense_depth(self: &mut Self, depth: usize) -> &mut AhoCorasickBuilderSet the limit on how many states use a dense representation for their transitions. Other states will generally use a sparse representation.
A dense representation uses more memory but is generally faster, since the next transition in a dense representation can be computed in a constant number of instructions. A sparse representation uses less memory but is generally slower, since the next transition in a sparse representation requires executing a variable number of instructions.
This setting is only used when an Aho-Corasick implementation is used that supports the dense versus sparse representation trade off. Not all do.
This limit is expressed in terms of the depth of a state, i.e., the number of transitions from the starting state of the automaton. The idea is that most of the time searching will be spent near the starting state of the automaton, so states near the start state should use a dense representation. States further away from the start state would then use a sparse representation.
By default, this is set to a low but non-zero number. Setting this to
0is almost never what you want, since it is likely to make searches very slow due to the start state itself being forced to use a sparse representation. However, it is unlikely that increasing this number will help things much, since the most active states have a small depth. More to the point, the memory usage increases superlinearly as this number increases.fn byte_classes(self: &mut Self, yes: bool) -> &mut AhoCorasickBuilderA debug settting for whether to attempt to shrink the size of the automaton's alphabet or not.
This option is enabled by default and should never be disabled unless one is debugging the underlying automaton.
When enabled, some (but not all) Aho-Corasick automatons will use a map from all possible bytes to their corresponding equivalence class. Each equivalence class represents a set of bytes that does not discriminate between a match and a non-match in the automaton.
The advantage of this map is that the size of the transition table can be reduced drastically from
#states * 256 * sizeof(u32)to#states * k * sizeof(u32)wherekis the number of equivalence classes (rounded up to the nearest power of 2). As a result, total space usage can decrease substantially. Moreover, since a smaller alphabet is used, automaton compilation becomes faster as well.WARNING: This is only useful for debugging automatons. Disabling this does not yield any speed advantages. Namely, even when this is disabled, a byte class map is still used while searching. The only difference is that every byte will be forced into its own distinct equivalence class. This is useful for debugging the actual generated transitions because it lets one see the transitions defined on actual bytes instead of the equivalence classes.
impl Clone for AhoCorasickBuilder
fn clone(self: &Self) -> AhoCorasickBuilder
impl Debug for AhoCorasickBuilder
fn fmt(self: &Self, f: &mut Formatter<'_>) -> Result
impl Default for AhoCorasickBuilder
fn default() -> AhoCorasickBuilder
impl Freeze for AhoCorasickBuilder
impl RefUnwindSafe for AhoCorasickBuilder
impl Send for AhoCorasickBuilder
impl Sync for AhoCorasickBuilder
impl Unpin for AhoCorasickBuilder
impl UnsafeUnpin for AhoCorasickBuilder
impl UnwindSafe for AhoCorasickBuilder
impl<T> Any for AhoCorasickBuilder
fn type_id(self: &Self) -> TypeId
impl<T> Borrow for AhoCorasickBuilder
fn borrow(self: &Self) -> &T
impl<T> BorrowMut for AhoCorasickBuilder
fn borrow_mut(self: &mut Self) -> &mut T
impl<T> CloneToUninit for AhoCorasickBuilder
unsafe fn clone_to_uninit(self: &Self, dest: *mut u8)
impl<T> From for AhoCorasickBuilder
fn from(t: T) -> TReturns the argument unchanged.
impl<T> ToOwned for AhoCorasickBuilder
fn to_owned(self: &Self) -> Tfn clone_into(self: &Self, target: &mut T)
impl<T, U> Into for AhoCorasickBuilder
fn into(self: Self) -> UCalls
U::from(self).That is, this conversion is whatever the implementation of
[From]<T> for Uchooses to do.
impl<T, U> TryFrom for AhoCorasickBuilder
fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>
impl<T, U> TryInto for AhoCorasickBuilder
fn try_into(self: Self) -> Result<U, <U as TryFrom<T>>::Error>