pub fn detect_encoding(bytes: &[u8]) -> Option<(&'static Encoding, usize)>
Available on crate feature
encoding
only.Expand description
Automatic encoding detection of XML files based using the recommended algorithm.
If encoding is detected, Some
is returned with an encoding and size of BOM
in bytes, if detection was performed using BOM, or zero, if detection was
performed without BOM.
IF encoding was not recognized, None
is returned.
Because the encoding_rs
crate supports only subset of those encodings, only
the supported subset are detected, which is UTF-8, UTF-16 BE and UTF-16 LE.
The algorithm suggests examine up to the first 4 bytes to determine encoding according to the following table:
Bytes | Detected encoding |
---|---|
BOM | |
FE_FF_##_## | UTF-16, big-endian |
FF FE ## ## | UTF-16, little-endian |
EF BB BF | UTF-8 |
No BOM | |
00 3C 00 3F | UTF-16 BE or ISO-10646-UCS-2 BE or similar 16-bit BE (use declared encoding to find the exact one) |
3C 00 3F 00 | UTF-16 LE or ISO-10646-UCS-2 LE or similar 16-bit LE (use declared encoding to find the exact one) |
3C 3F 78 6D | UTF-8, ISO 646, ASCII, some part of ISO 8859, Shift-JIS, EUC, or any other 7-bit, 8-bit, or mixed-width encoding which ensures that the characters of ASCII have their normal positions, width, and values; the actual encoding declaration must be read to detect which of these applies, but since all of these encodings use the same bit patterns for the relevant ASCII characters, the encoding declaration itself may be read reliably |