Enum regex_automata::SparseDFA
source · pub enum SparseDFA<T: AsRef<[u8]>, S: StateID = usize> {
Standard(Standard<T, S>),
ByteClass(ByteClass<T, S>),
// some variants omitted
}
Expand description
A sparse table-based deterministic finite automaton (DFA).
In contrast to a dense DFA, a sparse DFA uses a more space efficient representation for its transition table. Consequently, sparse DFAs can use much less memory than dense DFAs, but this comes at a price. In particular, reading the more space efficient transitions takes more work, and consequently, searching using a sparse DFA is typically slower than a dense DFA.
A sparse DFA can be built using the default configuration via the
SparseDFA::new
constructor. Otherwise,
one can configure various aspects of a dense DFA via
dense::Builder
, and then convert a dense
DFA to a sparse DFA using
DenseDFA::to_sparse
.
In general, a sparse DFA supports all the same operations as a dense DFA.
Making the choice between a dense and sparse DFA depends on your specific work load. If you can sacrifice a bit of search time performance, then a sparse DFA might be the best choice. In particular, while sparse DFAs are probably always slower than dense DFAs, you may find that they are easily fast enough for your purposes!
State size
A SparseDFA
has two type parameters, T
and S
. T
corresponds to
the type of the DFA’s transition table while S
corresponds to the
representation used for the DFA’s state identifiers as described by the
StateID
trait. This type parameter is typically
usize
, but other valid choices provided by this crate include u8
,
u16
, u32
and u64
. The primary reason for choosing a different state
identifier representation than the default is to reduce the amount of
memory used by a DFA. Note though, that if the chosen representation cannot
accommodate the size of your DFA, then building the DFA will fail and
return an error.
While the reduction in heap memory used by a DFA is one reason for choosing
a smaller state identifier representation, another possible reason is for
decreasing the serialization size of a DFA, as returned by
to_bytes_little_endian
,
to_bytes_big_endian
or
to_bytes_native_endian
.
The type of the transition table is typically either Vec<u8>
or &[u8]
,
depending on where the transition table is stored. Note that this is
different than a dense DFA, whose transition table is typically
Vec<S>
or &[S]
. The reason for this is that a sparse DFA always reads
its transition table from raw bytes because the table is compactly packed.
Variants
This DFA is defined as a non-exhaustive enumeration of different types of
dense DFAs. All of the variants use the same internal representation
for the transition table, but they vary in how the transition table is
read. A DFA’s specific variant depends on the configuration options set via
dense::Builder
. The default variant is
ByteClass
.
The DFA
trait
This type implements the DFA
trait, which means it
can be used for searching. For example:
use regex_automata::{DFA, SparseDFA};
let dfa = SparseDFA::new("foo[0-9]+")?;
assert_eq!(Some(8), dfa.find(b"foo12345"));
The DFA
trait also provides an assortment of other lower level methods
for DFAs, such as start_state
and next_state
. While these are correctly
implemented, it is an anti-pattern to use them in performance sensitive
code on the SparseDFA
type directly. Namely, each implementation requires
a branch to determine which type of sparse DFA is being used. Instead,
this branch should be pushed up a layer in the code since walking the
transitions of a DFA is usually a hot path. If you do need to use these
lower level methods in performance critical code, then you should match on
the variants of this DFA and use each variant’s implementation of the DFA
trait directly.
Variants§
Standard(Standard<T, S>)
A standard DFA that does not use byte classes.
ByteClass(ByteClass<T, S>)
A DFA that shrinks its alphabet to a set of equivalence classes instead of using all possible byte values. Any two bytes belong to the same equivalence class if and only if they can be used interchangeably anywhere in the DFA while never discriminating between a match and a non-match.
Unlike dense DFAs, sparse DFAs do not tend to benefit nearly as much from using byte classes. In some cases, using byte classes can even marginally increase the size of a sparse DFA’s transition table. The reason for this is that a sparse DFA already compacts each state’s transitions separate from whether byte classes are used.
Implementations§
source§impl SparseDFA<Vec<u8>, usize>
impl SparseDFA<Vec<u8>, usize>
sourcepub fn new(pattern: &str) -> Result<SparseDFA<Vec<u8>, usize>, Error>
pub fn new(pattern: &str) -> Result<SparseDFA<Vec<u8>, usize>, Error>
Parse the given regular expression using a default configuration and return the corresponding sparse DFA.
The default configuration uses usize
for state IDs and reduces the
alphabet size by splitting bytes into equivalence classes. The
resulting DFA is not minimized.
If you want a non-default configuration, then use the
dense::Builder
to set your own configuration, and then call
DenseDFA::to_sparse
to create a sparse DFA.
Example
use regex_automata::{DFA, SparseDFA};
let dfa = SparseDFA::new("foo[0-9]+bar")?;
assert_eq!(Some(11), dfa.find(b"foo12345bar"));
source§impl<S: StateID> SparseDFA<Vec<u8>, S>
impl<S: StateID> SparseDFA<Vec<u8>, S>
sourcepub fn empty() -> SparseDFA<Vec<u8>, S>
pub fn empty() -> SparseDFA<Vec<u8>, S>
Create a new empty sparse DFA that never matches any input.
Example
In order to build an empty DFA, callers must provide a type hint indicating their choice of state identifier representation.
use regex_automata::{DFA, SparseDFA};
let dfa: SparseDFA<Vec<u8>, usize> = SparseDFA::empty();
assert_eq!(None, dfa.find(b""));
assert_eq!(None, dfa.find(b"foo"));
source§impl<T: AsRef<[u8]>, S: StateID> SparseDFA<T, S>
impl<T: AsRef<[u8]>, S: StateID> SparseDFA<T, S>
sourcepub fn as_ref<'a>(&'a self) -> SparseDFA<&'a [u8], S>
pub fn as_ref<'a>(&'a self) -> SparseDFA<&'a [u8], S>
Cheaply return a borrowed version of this sparse DFA. Specifically, the
DFA returned always uses &[u8]
for its transition table while keeping
the same state identifier representation.
sourcepub fn to_owned(&self) -> SparseDFA<Vec<u8>, S>
pub fn to_owned(&self) -> SparseDFA<Vec<u8>, S>
Return an owned version of this sparse DFA. Specifically, the DFA
returned always uses Vec<u8>
for its transition table while keeping
the same state identifier representation.
Effectively, this returns a sparse DFA whose transition table lives on the heap.
sourcepub fn memory_usage(&self) -> usize
pub fn memory_usage(&self) -> usize
Returns the memory usage, in bytes, of this DFA.
The memory usage is computed based on the number of bytes used to represent this DFA’s transition table. This typically corresponds to heap memory usage.
This does not include the stack size used up by this DFA. To
compute that, used std::mem::size_of::<SparseDFA>()
.
source§impl<T: AsRef<[u8]>, S: StateID> SparseDFA<T, S>
impl<T: AsRef<[u8]>, S: StateID> SparseDFA<T, S>
Routines for converting a sparse DFA to other representations, such as smaller state identifiers or raw bytes suitable for persistent storage.
sourcepub fn to_u8(&self) -> Result<SparseDFA<Vec<u8>, u8>, Error>
pub fn to_u8(&self) -> Result<SparseDFA<Vec<u8>, u8>, Error>
Create a new sparse DFA whose match semantics are equivalent to
this DFA, but attempt to use u8
for the representation of state
identifiers. If u8
is insufficient to represent all state identifiers
in this DFA, then this returns an error.
This is a convenience routine for to_sized::<u8>()
.
sourcepub fn to_u16(&self) -> Result<SparseDFA<Vec<u8>, u16>, Error>
pub fn to_u16(&self) -> Result<SparseDFA<Vec<u8>, u16>, Error>
Create a new sparse DFA whose match semantics are equivalent to
this DFA, but attempt to use u16
for the representation of state
identifiers. If u16
is insufficient to represent all state
identifiers in this DFA, then this returns an error.
This is a convenience routine for to_sized::<u16>()
.
sourcepub fn to_u32(&self) -> Result<SparseDFA<Vec<u8>, u32>, Error>
pub fn to_u32(&self) -> Result<SparseDFA<Vec<u8>, u32>, Error>
Create a new sparse DFA whose match semantics are equivalent to
this DFA, but attempt to use u32
for the representation of state
identifiers. If u32
is insufficient to represent all state
identifiers in this DFA, then this returns an error.
This is a convenience routine for to_sized::<u32>()
.
sourcepub fn to_u64(&self) -> Result<SparseDFA<Vec<u8>, u64>, Error>
pub fn to_u64(&self) -> Result<SparseDFA<Vec<u8>, u64>, Error>
Create a new sparse DFA whose match semantics are equivalent to
this DFA, but attempt to use u64
for the representation of state
identifiers. If u64
is insufficient to represent all state
identifiers in this DFA, then this returns an error.
This is a convenience routine for to_sized::<u64>()
.
sourcepub fn to_sized<A: StateID>(&self) -> Result<SparseDFA<Vec<u8>, A>, Error>
pub fn to_sized<A: StateID>(&self) -> Result<SparseDFA<Vec<u8>, A>, Error>
Create a new sparse DFA whose match semantics are equivalent to
this DFA, but attempt to use A
for the representation of state
identifiers. If A
is insufficient to represent all state identifiers
in this DFA, then this returns an error.
An alternative way to construct such a DFA is to use
DenseDFA::to_sparse_sized
.
In general, picking the appropriate size upon initial construction of
a sparse DFA is preferred, since it will do the conversion in one
step instead of two.
sourcepub fn to_bytes_little_endian(&self) -> Result<Vec<u8>, Error>
pub fn to_bytes_little_endian(&self) -> Result<Vec<u8>, Error>
Serialize a sparse DFA to raw bytes in little endian format.
If the state identifier representation of this DFA has a size different
than 1, 2, 4 or 8 bytes, then this returns an error. All
implementations of StateID
provided by this crate satisfy this
requirement.
sourcepub fn to_bytes_big_endian(&self) -> Result<Vec<u8>, Error>
pub fn to_bytes_big_endian(&self) -> Result<Vec<u8>, Error>
Serialize a sparse DFA to raw bytes in big endian format.
If the state identifier representation of this DFA has a size different
than 1, 2, 4 or 8 bytes, then this returns an error. All
implementations of StateID
provided by this crate satisfy this
requirement.
sourcepub fn to_bytes_native_endian(&self) -> Result<Vec<u8>, Error>
pub fn to_bytes_native_endian(&self) -> Result<Vec<u8>, Error>
Serialize a sparse DFA to raw bytes in native endian format.
Generally, it is better to pick an explicit endianness using either
to_bytes_little_endian
or to_bytes_big_endian
. This routine is
useful in tests where the DFA is serialized and deserialized on the
same platform.
If the state identifier representation of this DFA has a size different
than 1, 2, 4 or 8 bytes, then this returns an error. All
implementations of StateID
provided by this crate satisfy this
requirement.
source§impl<'a, S: StateID> SparseDFA<&'a [u8], S>
impl<'a, S: StateID> SparseDFA<&'a [u8], S>
sourcepub unsafe fn from_bytes(buf: &'a [u8]) -> SparseDFA<&'a [u8], S>
pub unsafe fn from_bytes(buf: &'a [u8]) -> SparseDFA<&'a [u8], S>
Deserialize a sparse DFA with a specific state identifier representation.
Deserializing a DFA using this routine will never allocate heap memory. This is also guaranteed to be a constant time operation that does not vary with the size of the DFA.
The bytes given should be generated by the serialization of a DFA with
either the
to_bytes_little_endian
method or the
to_bytes_big_endian
endian, depending on the endianness of the machine you are
deserializing this DFA from.
If the state identifier representation is usize
, then deserialization
is dependent on the pointer size. For this reason, it is best to
serialize DFAs using a fixed size representation for your state
identifiers, such as u8
, u16
, u32
or u64
.
Panics
The bytes given should be trusted. In particular, if the bytes are not a valid serialization of a DFA, or if the endianness of the serialized bytes is different than the endianness of the machine that is deserializing the DFA, then this routine will panic. Moreover, it is possible for this deserialization routine to succeed even if the given bytes do not represent a valid serialized sparse DFA.
Safety
This routine is unsafe because it permits callers to provide an arbitrary transition table with possibly incorrect transitions. While the various serialization routines will never return an incorrect transition table, there is no guarantee that the bytes provided here are correct. While deserialization does many checks (as documented above in the panic conditions), this routine does not check that the transition table is correct. Given an incorrect transition table, it is possible for the search routines to access out-of-bounds memory because of explicit bounds check elision.
Example
This example shows how to serialize a DFA to raw bytes, deserialize it
and then use it for searching. Note that we first convert the DFA to
using u16
for its state identifier representation before serializing
it. While this isn’t strictly necessary, it’s good practice in order to
decrease the size of the DFA and to avoid platform specific pitfalls
such as differing pointer sizes.
use regex_automata::{DFA, DenseDFA, SparseDFA};
let sparse = SparseDFA::new("foo[0-9]+")?;
let bytes = sparse.to_u16()?.to_bytes_native_endian()?;
let dfa: SparseDFA<&[u8], u16> = unsafe {
SparseDFA::from_bytes(&bytes)
};
assert_eq!(Some(8), dfa.find(b"foo12345"));
Trait Implementations§
source§impl<T: AsRef<[u8]>, S: StateID> DFA for SparseDFA<T, S>
impl<T: AsRef<[u8]>, S: StateID> DFA for SparseDFA<T, S>
source§fn start_state(&self) -> S
fn start_state(&self) -> S
source§fn is_match_state(&self, id: S) -> bool
fn is_match_state(&self, id: S) -> bool
source§fn is_dead_state(&self, id: S) -> bool
fn is_dead_state(&self, id: S) -> bool
source§fn is_match_or_dead_state(&self, id: S) -> bool
fn is_match_or_dead_state(&self, id: S) -> bool
is_match_state(id)
or is_dead_state(id)
must return true. Read moresource§fn is_anchored(&self) -> bool
fn is_anchored(&self) -> bool
source§fn next_state(&self, current: S, input: u8) -> S
fn next_state(&self, current: S, input: u8) -> S
source§unsafe fn next_state_unchecked(&self, current: S, input: u8) -> S
unsafe fn next_state_unchecked(&self, current: S, input: u8) -> S
next_state
, but its implementation may look up the next state
without memory safety checks such as bounds checks. As such, callers
must ensure that the given identifier corresponds to a valid DFA
state. Implementors must, in turn, ensure that this routine is safe
for all valid state identifiers and for all possible u8
values.source§fn is_match_at(&self, bytes: &[u8], start: usize) -> bool
fn is_match_at(&self, bytes: &[u8], start: usize) -> bool
is_match
, but starts the search at the given
offset. Read moresource§fn shortest_match_at(&self, bytes: &[u8], start: usize) -> Option<usize>
fn shortest_match_at(&self, bytes: &[u8], start: usize) -> Option<usize>
shortest_match
, but starts the search at the
given offset. Read moresource§fn find_at(&self, bytes: &[u8], start: usize) -> Option<usize>
fn find_at(&self, bytes: &[u8], start: usize) -> Option<usize>
find
, but starts the search at the given
offset. Read moresource§fn rfind_at(&self, bytes: &[u8], start: usize) -> Option<usize>
fn rfind_at(&self, bytes: &[u8], start: usize) -> Option<usize>
rfind
, but starts the search at the given
offset. Read moresource§fn is_match(&self, bytes: &[u8]) -> bool
fn is_match(&self, bytes: &[u8]) -> bool
source§fn shortest_match(&self, bytes: &[u8]) -> Option<usize>
fn shortest_match(&self, bytes: &[u8]) -> Option<usize>
source§fn find(&self, bytes: &[u8]) -> Option<usize>
fn find(&self, bytes: &[u8]) -> Option<usize>
None
is returned. Read moresource§fn rfind(&self, bytes: &[u8]) -> Option<usize>
fn rfind(&self, bytes: &[u8]) -> Option<usize>
None
is returned. In other words, this has the same
match semantics as find
, but in reverse. Read more