Struct ByteClasses

struct ByteClasses(_)

A representation of byte oriented equivalence classes.

This is used in a DFA to reduce the size of the transition table. This can have a particularly large impact not only on the total size of a dense DFA, but also on compile times.

The essential idea here is that the alphabet of a DFA is shrunk from the usual 256 distinct byte values down to a set of equivalence classes. The guarantee you get is that any byte belonging to the same equivalence class can be treated as if it were any other byte in the same class, and the result of a search wouldn't change.

Example

This example shows how to get byte classes from an NFA and ask for the class of various bytes.

use regex_automata::nfa::thompson::NFA;

let nfa = NFA::new("[a-z]+")?;
let classes = nfa.byte_classes();
// 'a' and 'z' are in the same class for this regex.
assert_eq!(classes.get(b'a'), classes.get(b'z'));
// But 'a' and 'A' are not.
assert_ne!(classes.get(b'a'), classes.get(b'A'));

# Ok::<(), Box<dyn std::error::Error>>(())

Implementations

impl ByteClasses

fn empty() -> ByteClasses

Creates a new set of equivalence classes where all bytes are mapped to the same class.

fn singletons() -> ByteClasses

Creates a new set of equivalence classes where each byte belongs to its own equivalence class.

fn set(self: &mut Self, byte: u8, class: u8)

Set the equivalence class for the given byte.

fn get(self: &Self, byte: u8) -> u8

Get the equivalence class for the given byte.

fn get_by_unit(self: &Self, unit: Unit) -> usize

Get the equivalence class for the given haystack unit and return the class as a usize.

fn eoi(self: &Self) -> Unit

Create a unit that represents the "end of input" sentinel based on the number of equivalence classes.

fn alphabet_len(self: &Self) -> usize

Return the total number of elements in the alphabet represented by these equivalence classes. Equivalently, this returns the total number of equivalence classes.

fn stride2(self: &Self) -> usize

Returns the stride, as a base-2 exponent, required for these equivalence classes.

The stride is always the smallest power of 2 that is greater than or equal to the alphabet length, and the stride2 returned here is the exponent applied to 2 to get the smallest power. This is done so that converting between premultiplied state IDs and indices can be done with shifts alone, which is much faster than integer division.

fn is_singleton(self: &Self) -> bool

Returns true if and only if every byte in this class maps to its own equivalence class. Equivalently, there are 257 equivalence classes and each class contains either exactly one byte or corresponds to the singleton class containing the "end of input" sentinel.

fn iter(self: &Self) -> ByteClassIter<'_>

Returns an iterator over all equivalence classes in this set.

fn representatives<R: core::ops::RangeBounds<u8>>(self: &Self, range: R) -> ByteClassRepresentatives<'_>

Returns an iterator over a sequence of representative bytes from each equivalence class within the range of bytes given.

When the given range is unbounded on both sides, the iterator yields exactly N items, where N is equivalent to the number of equivalence classes. Each item is an arbitrary byte drawn from each equivalence class.

This is useful when one is determinizing an NFA and the NFA's alphabet hasn't been converted to equivalence classes. Picking an arbitrary byte from each equivalence class then permits a full exploration of the NFA instead of using every possible byte value and thus potentially saves quite a lot of redundant work.

Example

This shows an example of what a complete sequence of representatives might look like from a real example.

use regex_automata::{nfa::thompson::NFA, util::alphabet::Unit};

let nfa = NFA::new("[a-z]+")?;
let classes = nfa.byte_classes();
let reps: Vec<Unit> = classes.representatives(..).collect();
// Note that the specific byte values yielded are not guaranteed!
let expected = vec![
    Unit::u8(b'\x00'),
    Unit::u8(b'a'),
    Unit::u8(b'{'),
    Unit::eoi(3),
];
assert_eq!(expected, reps);

# Ok::<(), Box<dyn std::error::Error>>(())

Note though, that you can ask for an arbitrary range of bytes, and only representatives for that range will be returned:

use regex_automata::{nfa::thompson::NFA, util::alphabet::Unit};

let nfa = NFA::new("[a-z]+")?;
let classes = nfa.byte_classes();
let reps: Vec<Unit> = classes.representatives(b'A'..=b'z').collect();
// Note that the specific byte values yielded are not guaranteed!
let expected = vec![
    Unit::u8(b'A'),
    Unit::u8(b'a'),
];
assert_eq!(expected, reps);

# Ok::<(), Box<dyn std::error::Error>>(())
fn elements(self: &Self, class: Unit) -> ByteClassElements<'_>

Returns an iterator of the bytes in the given equivalence class.

This is useful when one needs to know the actual bytes that belong to an equivalence class. For example, conceptually speaking, accelerating a DFA state occurs when a state only has a few outgoing transitions. But in reality, what is required is that there are only a small number of distinct bytes that can lead to an outgoing transition. The difference is that any one transition can correspond to an equivalence class which may contains many bytes. Therefore, DFA state acceleration considers the actual elements in each equivalence class of each outgoing transition.

Example

This shows an example of how to get all of the elements in an equivalence class.

use regex_automata::{nfa::thompson::NFA, util::alphabet::Unit};

let nfa = NFA::new("[a-z]+")?;
let classes = nfa.byte_classes();
let elements: Vec<Unit> = classes.elements(Unit::u8(1)).collect();
let expected: Vec<Unit> = (b'a'..=b'z').map(Unit::u8).collect();
assert_eq!(expected, elements);

# Ok::<(), Box<dyn std::error::Error>>(())

impl Clone for ByteClasses

fn clone(self: &Self) -> ByteClasses

impl Copy for ByteClasses

impl Debug for ByteClasses

fn fmt(self: &Self, f: &mut Formatter<'_>) -> Result

impl Default for ByteClasses

fn default() -> ByteClasses

impl Freeze for ByteClasses

impl RefUnwindSafe for ByteClasses

impl Send for ByteClasses

impl Sync for ByteClasses

impl Unpin for ByteClasses

impl UnsafeUnpin for ByteClasses

impl UnwindSafe for ByteClasses

impl<T> Any for ByteClasses

fn type_id(self: &Self) -> TypeId

impl<T> Borrow for ByteClasses

fn borrow(self: &Self) -> &T

impl<T> BorrowMut for ByteClasses

fn borrow_mut(self: &mut Self) -> &mut T

impl<T> CloneToUninit for ByteClasses

unsafe fn clone_to_uninit(self: &Self, dest: *mut u8)

impl<T> From for ByteClasses

fn from(t: T) -> T

Returns the argument unchanged.

impl<T> ToOwned for ByteClasses

fn to_owned(self: &Self) -> T
fn clone_into(self: &Self, target: &mut T)

impl<T, U> Into for ByteClasses

fn into(self: Self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of [From]<T> for U chooses to do.

impl<T, U> TryFrom for ByteClasses

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

impl<T, U> TryInto for ByteClasses

fn try_into(self: Self) -> Result<U, <U as TryFrom<T>>::Error>