Struct Hir
struct Hir { ... }
A high-level intermediate representation (HIR) for a regular expression.
An HIR value is a combination of a HirKind and a set of Properties.
An HirKind indicates what kind of regular expression it is (a literal,
a repetition, a look-around assertion, etc.), where as a Properties
describes various facts about the regular expression. For example, whether
it matches UTF-8 or if it matches the empty string.
The HIR of a regular expression represents an intermediate step between
its abstract syntax (a structured description of the concrete syntax) and
an actual regex matcher. The purpose of HIR is to make regular expressions
easier to analyze. In particular, the AST is much more complex than the
HIR. For example, while an AST supports arbitrarily nested character
classes, the HIR will flatten all nested classes into a single set. The HIR
will also "compile away" every flag present in the concrete syntax. For
example, users of HIR expressions never need to worry about case folding;
it is handled automatically by the translator (e.g., by translating
(?i:A) to [aA]).
The specific type of an HIR expression can be accessed via its kind
or into_kind methods. This extra level of indirection exists for two
reasons:
- Construction of an HIR expression must use the constructor methods on
this
Hirtype instead of building theHirKindvalues directly. This permits construction to enforce invariants like "concatenations always consist of two or more sub-expressions." - Every HIR expression contains attributes that are defined inductively, and can be computed cheaply during the construction process. For example, one such attribute is whether the expression must match at the beginning of the haystack.
In particular, if you have an HirKind value, then there is intentionally
no way to build an Hir value from it. You instead need to do case
analysis on the HirKind value and build the Hir value using its smart
constructors.
UTF-8
If the HIR was produced by a translator with
TranslatorBuilder::utf8 enabled,
then the HIR is guaranteed to match UTF-8 exclusively for all non-empty
matches.
For empty matches, those can occur at any position. It is the responsibility of the regex engine to determine whether empty matches are permitted between the code units of a single codepoint.
Stack space
This type defines its own destructor that uses constant stack space and heap space proportional to the size of the HIR.
Also, an Hir's fmt::Display implementation prints an HIR as a regular
expression pattern string, and uses constant stack space and heap space
proportional to the size of the Hir. The regex it prints is guaranteed to
be semantically equivalent to the original concrete syntax, but it may
look very different. (And potentially not practically readable by a human.)
An Hir's fmt::Debug implementation currently does not use constant
stack space. The implementation will also suppress some details (such as
the Properties inlined into every Hir value to make it less noisy).
Implementations
impl Hir
fn kind(self: &Self) -> &HirKindReturns a reference to the underlying HIR kind.
fn into_kind(self: Self) -> HirKindConsumes ownership of this HIR expression and returns its underlying
HirKind.fn properties(self: &Self) -> &PropertiesReturns the properties computed for this
Hir.
impl Hir
fn empty() -> HirReturns an empty HIR expression.
An empty HIR expression always matches, including the empty string.
fn fail() -> HirReturns an HIR expression that can never match anything. That is, the size of the set of strings in the language described by the HIR returned is
0.This is distinct from
Hir::emptyin that the empty string matches the HIR returned byHir::empty. That is, the set of strings in the language describe described byHir::emptyis non-empty.Note that currently, the HIR returned uses an empty character class to indicate that nothing can match. An equivalent expression that cannot match is an empty alternation, but all such "fail" expressions are normalized (via smart constructors) to empty character classes. This is because empty character classes can be spelled in the concrete syntax of a regex (e.g.,
\P{any}or(?-u:[^\x00-\xFF])or[a&&b]), but empty alternations cannot.fn literal<B: Into<Box<[u8]>>>(lit: B) -> HirCreates a literal HIR expression.
This accepts anything that can be converted into a
Box<[u8]>.Note that there is no mechanism for storing a
charor aBox<str>in an HIR. Everything is "just bytes." Whether aLiteral(or any HIR node) matches valid UTF-8 exclusively can be queried viaProperties::is_utf8.Example
This example shows that concatenations of
LiteralHIR values will automatically get flattened and combined together. So for example, even if you concat multipleLiteralvalues that are themselves not valid UTF-8, they might add up to valid UTF-8. This also demonstrates just how "smart" Hir's smart constructors are.use ; let literals = vec!; // Each literal, on its own, is invalid UTF-8. assert!; let concat = concat; // But the concatenation is valid UTF-8! assert!; // And also notice that the literals have been concatenated into a // single `Literal`, to the point where there is no explicit `Concat`! let expected = Literal; assert_eq!;Example: building a literal from a
charThis example shows how to build a single
Hirliteral from acharvalue. Since aLiteralis just bytes, we just need to UTF-8 encode acharvalue:use ; let ch = '☃'; let got = literal; let expected = Literal; assert_eq!;fn class(class: Class) -> HirCreates a class HIR expression. The class may either be defined over ranges of Unicode codepoints or ranges of raw byte values.
Note that an empty class is permitted. An empty class is equivalent to
Hir::fail().fn look(look: Look) -> HirCreates a look-around assertion HIR expression.
fn repetition(rep: Repetition) -> HirCreates a repetition HIR expression.
fn capture(capture: Capture) -> HirCreates a capture HIR expression.
Note that there is no explicit HIR value for a non-capturing group. Since a non-capturing group only exists to override precedence in the concrete syntax and since an HIR already does its own grouping based on what is parsed, there is no need to explicitly represent non-capturing groups in the HIR.
fn concat(subs: Vec<Hir>) -> HirReturns the concatenation of the given expressions.
This attempts to flatten and simplify the concatenation as appropriate.
Example
This shows a simple example of basic flattening of both concatenations and literals.
use Hir; let hir = concat; let expected = literal; assert_eq!;fn alternation(subs: Vec<Hir>) -> HirReturns the alternation of the given expressions.
This flattens and simplifies the alternation as appropriate. This may include factoring out common prefixes or even rewriting the alternation as a character class.
Note that an empty alternation is equivalent to
Hir::fail(). (It is not possible for one to write an empty alternation, or even an alternation with a single sub-expression, in the concrete syntax of a regex.)Example
This is a simple example showing how an alternation might get simplified.
use ; let hir = alternation; let expected = class; assert_eq!;And another example showing how common prefixes might get factored out.
use ; let hir = alternation; let expected = concat; assert_eq!;Note that these sorts of simplifications are not guaranteed.
fn dot(dot: Dot) -> HirReturns an HIR expression for
..Dot::AnyCharmaps to(?su-R:.).Dot::AnyBytemaps to(?s-Ru:.).Dot::AnyCharExceptLFmaps to(?u-Rs:.).Dot::AnyCharExceptCRLFmaps to(?Ru-s:.).Dot::AnyByteExceptLFmaps to(?-Rsu:.).Dot::AnyByteExceptCRLFmaps to(?R-su:.).
Example
Note that this is a convenience routine for constructing the correct character class based on the value of
Dot. There is no explicit "dot" HIR value. It is just an abbreviation for a common character class.use ; let hir = dot; let expected = class; assert_eq!;
impl Clone for Hir
fn clone(self: &Self) -> Hir
impl Debug for Hir
fn fmt(self: &Self, f: &mut Formatter<'_>) -> Result
impl Display for Hir
fn fmt(self: &Self, f: &mut Formatter<'_>) -> Result
impl Drop for Hir
fn drop(self: &mut Self)
impl Eq for Hir
impl Freeze for Hir
impl PartialEq for Hir
fn eq(self: &Self, other: &Hir) -> bool
impl RefUnwindSafe for Hir
impl Send for Hir
impl StructuralPartialEq for Hir
impl Sync for Hir
impl Unpin for Hir
impl UnsafeUnpin for Hir
impl UnwindSafe for Hir
impl<T> Any for Hir
fn type_id(self: &Self) -> TypeId
impl<T> Borrow for Hir
fn borrow(self: &Self) -> &T
impl<T> BorrowMut for Hir
fn borrow_mut(self: &mut Self) -> &mut T
impl<T> CloneToUninit for Hir
unsafe fn clone_to_uninit(self: &Self, dest: *mut u8)
impl<T> From for Hir
fn from(t: T) -> TReturns the argument unchanged.
impl<T> ToOwned for Hir
fn to_owned(self: &Self) -> Tfn clone_into(self: &Self, target: &mut T)
impl<T> ToString for Hir
fn to_string(self: &Self) -> String
impl<T, U> Into for Hir
fn into(self: Self) -> UCalls
U::from(self).That is, this conversion is whatever the implementation of
[From]<T> for Uchooses to do.
impl<T, U> TryFrom for Hir
fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>
impl<T, U> TryInto for Hir
fn try_into(self: Self) -> Result<U, <U as TryFrom<T>>::Error>