Expand description
Blazingly fast API-compatible UTF-8 validation for Rust using SIMD extensions, based on the implementation from simdjson. Originally ported to Rust by the developers of simd-json.rs, but now heavily improved.
§Quick start
Add the dependency to your Cargo.toml file:
[dependencies]
simdutf8 = { version = "0.1.3" }
or on ARM64 with Rust Nightly:
[dependencies]
simdutf8 = { version = "0.1.3", features = ["aarch64_neon"] }
Use basic::from_utf8()
as a drop-in replacement for std::str::from_utf8()
.
use simdutf8::basic::from_utf8;
println!("{}", from_utf8(b"I \xE2\x9D\xA4\xEF\xB8\x8F UTF-8!").unwrap());
If you need detailed information on validation failures, use compat::from_utf8()
instead.
use simdutf8::compat::from_utf8;
let err = from_utf8(b"I \xE2\x9D\xA4\xEF\xB8 UTF-8!").unwrap_err();
assert_eq!(err.valid_up_to(), 5);
assert_eq!(err.error_len(), Some(2));
§APIs
§Basic flavor
Use the basic
API flavor for maximum speed. It is fastest on valid UTF-8, but only checks
for errors after processing the whole byte sequence and does not provide detailed information if the data
is not valid UTF-8. basic::Utf8Error
is a zero-sized error struct.
§Compat flavor
The compat
flavor is fully API-compatible with std::str::from_utf8()
. In particular, compat::from_utf8()
returns a compat::Utf8Error
, which has valid_up_to()
and
error_len()
methods. The first is useful for verification of streamed data. The
second is useful e.g. for replacing invalid byte sequences with a replacement character.
It also fails early: errors are checked on the fly as the string is processed and once
an invalid UTF-8 sequence is encountered, it returns without processing the rest of the data.
This comes at a slight performance penalty compared to the basic
API even if the input is valid UTF-8.
§Implementation selection
§X86
The fastest implementation is selected at runtime using the std::is_x86_feature_detected!
macro, unless the CPU
targeted by the compiler supports the fastest available implementation.
So if you compile with RUSTFLAGS="-C target-cpu=native"
on a recent x86-64 machine, the AVX 2 implementation is selected at
compile-time and runtime selection is disabled.
For no-std support (compiled with --no-default-features
) the implementation is always selected at compile time based on
the targeted CPU. Use RUSTFLAGS="-C target-feature=+avx2"
for the AVX 2 implementation or RUSTFLAGS="-C target-feature=+sse4.2"
for the SSE 4.2 implementation.
§ARM64
For ARM64 support Nightly Rust is needed and the crate feature aarch64_neon
needs to be enabled. CAVE: If this features is
not turned on the non-SIMD std library implementation is used.
§Access to low-level functionality
If you want to be able to call a SIMD implementation directly, use the public_imp
feature flag. The validation
implementations are then accessible via [basic::imp
] and [compat::imp
]. Traits facilitating streaming validation are available
there as well.
§Optimisation flags
Do not use opt-level = "z"
, which prevents inlining and makes
the code quite slow.
§Minimum Supported Rust Version (MSRV)
This crate’s minimum supported Rust version is 1.38.0.
§Algorithm
See Validating UTF-8 In Less Than One Instruction Per Byte, Software: Practice and Experience 51 (5), 2021 https://arxiv.org/abs/2010.03090
Modules§
- The
basic
API flavor provides barebones UTF-8 checking at the highest speed. - The
compat
API flavor provides full compatibility withstd::str::from_utf8()
and detailed validation errors.