Expand description
Blazingly fast API-compatible UTF-8 validation for Rust using SIMD extensions, based on the implementation from simdjson. Originally ported to Rust by the developers of simd-json.rs, but now heavily improved.
§Quick start
Add the dependency to your Cargo.toml file:
[dependencies]
simdutf8 = "0.1.5"
Use basic::from_utf8()
as a drop-in replacement for std::str::from_utf8()
.
use simdutf8::basic::from_utf8;
println!("{}", from_utf8(b"I \xE2\x9D\xA4\xEF\xB8\x8F UTF-8!").unwrap());
If you need detailed information on validation failures, use compat::from_utf8()
instead.
use simdutf8::compat::from_utf8;
let err = from_utf8(b"I \xE2\x9D\xA4\xEF\xB8 UTF-8!").unwrap_err();
assert_eq!(err.valid_up_to(), 5);
assert_eq!(err.error_len(), Some(2));
§APIs
§Basic flavor
Use the basic
API flavor for maximum speed. It is fastest on valid UTF-8, but only checks
for errors after processing the whole byte sequence and does not provide detailed information if the data
is not valid UTF-8. basic::Utf8Error
is a zero-sized error struct.
§Compat flavor
The compat
flavor is fully API-compatible with std::str::from_utf8()
. In particular, compat::from_utf8()
returns a compat::Utf8Error
, which has valid_up_to()
and
error_len()
methods. The first is useful for verification of streamed data. The
second is useful e.g. for replacing invalid byte sequences with a replacement character.
It also fails early: errors are checked on the fly as the string is processed and once
an invalid UTF-8 sequence is encountered, it returns without processing the rest of the data.
This comes at a slight performance penalty compared to the basic
API even if the input is valid UTF-8.
§Implementation selection
§X86
The fastest implementation is selected at runtime using the std::is_x86_feature_detected!
macro, unless the CPU
targeted by the compiler supports the fastest available implementation.
So if you compile with RUSTFLAGS="-C target-cpu=native"
on a recent x86-64 machine, the AVX 2 implementation is selected at
compile-time and runtime selection is disabled.
For no-std support (compiled with --no-default-features
) the implementation is always selected at compile time based on
the targeted CPU. Use RUSTFLAGS="-C target-feature=+avx2"
for the AVX 2 implementation or RUSTFLAGS="-C target-feature=+sse4.2"
for the SSE 4.2 implementation.
§ARM64
The SIMD implementation is used automatically since Rust 1.61.
§WASM32
For wasm32 support, the implementation is selected at compile time based on the presence of the simd128
target feature.
Use RUSTFLAGS="-C target-feature=+simd128"
to enable the WASM SIMD implementation. WASM, at
the time of this writing, doesn’t have a way to detect SIMD through WASM itself. Although this capability
is available in various WASM host environments (e.g., wasm-feature-detect in the web browser), there is no portable
way from within the library to detect this.
§Access to low-level functionality
If you want to be able to call a SIMD implementation directly, use the public_imp
feature flag. The validation
implementations are then accessible via basic::imp
and compat::imp
. Traits facilitating streaming validation are available
there as well.
§Optimisation flags
Do not use opt-level = "z"
, which prevents inlining and makes
the code quite slow.
§Minimum Supported Rust Version (MSRV)
This crate’s minimum supported Rust version is 1.38.0.
§Algorithm
See Validating UTF-8 In Less Than One Instruction Per Byte, Software: Practice and Experience 51 (5), 2021 https://arxiv.org/abs/2010.03090
Modules§
- The
basic
API flavor provides barebones UTF-8 checking at the highest speed. - The
compat
API flavor provides full compatibility withstd::str::from_utf8()
and detailed validation errors.