pub struct Encoding { /* private fields */ }
Expand description
An encoding as defined in the Encoding Standard.
An encoding defines a mapping from a u8
sequence to a char
sequence
and, in most cases, vice versa. Each encoding has a name, an output
encoding, and one or more labels.
Labels are ASCII-case-insensitive strings that are used to identify an
encoding in formats and protocols. The name of the encoding is the
preferred label in the case appropriate for returning from the
characterSet
property of the Document
DOM interface.
The output encoding is the encoding used for form submission and URL parsing on Web pages in the encoding. This is UTF-8 for the replacement, UTF-16LE and UTF-16BE encodings and the encoding itself for other encodings.
§Streaming vs. Non-Streaming
When you have the entire input in a single buffer, you can use the
methods decode()
, decode_with_bom_removal()
,
decode_without_bom_handling()
,
decode_without_bom_handling_and_without_replacement()
and
encode()
. (These methods are available to Rust callers only and are
not available in the C API.) Unlike the rest of the API available to Rust,
these methods perform heap allocations. You should the Decoder
and
Encoder
objects when your input is split into multiple buffers or when
you want to control the allocation of the output buffers.
§Instances
All instances of Encoding
are statically allocated and have the 'static
lifetime. There is precisely one unique Encoding
instance for each
encoding defined in the Encoding Standard.
To obtain a reference to a particular encoding whose identity you know at
compile time, use a static
that refers to encoding. There is a static
for each encoding. The static
s are named in all caps with hyphens
replaced with underscores (and in C/C++ have _ENCODING
appended to the
name). For example, if you know at compile time that you will want to
decode using the UTF-8 encoding, use the UTF_8
static
(UTF_8_ENCODING
in C/C++).
Additionally, there are non-reference-typed forms ending with _INIT
to
work around the problem that static
s of the type &'static Encoding
cannot be used to initialize items of an array whose type is
[&'static Encoding; N]
.
If you don’t know what encoding you need at compile time and need to
dynamically get an encoding by label, use
Encoding::for_label(label)
.
Instances of Encoding
can be compared with ==
(in both Rust and in
C/C++).
Implementations§
Source§impl Encoding
impl Encoding
Sourcepub fn for_label(label: &[u8]) -> Option<&'static Encoding>
pub fn for_label(label: &[u8]) -> Option<&'static Encoding>
Implements the get an encoding algorithm.
If, after ASCII-lowercasing and removing leading and trailing
whitespace, the argument matches a label defined in the Encoding
Standard, Some(&'static Encoding)
representing the corresponding
encoding is returned. If there is no match, None
is returned.
This is the right method to use if the action upon the method returning
None
is to use a fallback encoding (e.g. WINDOWS_1252
) instead.
When the action upon the method returning None
is not to proceed with
a fallback but to refuse processing, for_label_no_replacement()
is more
appropriate.
The argument is of type &[u8]
instead of &str
to save callers
that are extracting the label from a non-UTF-8 protocol the trouble
of conversion to UTF-8. (If you have a &str
, just call .as_bytes()
on it.)
Available via the C wrapper.
Sourcepub fn for_label_no_replacement(label: &[u8]) -> Option<&'static Encoding>
pub fn for_label_no_replacement(label: &[u8]) -> Option<&'static Encoding>
This method behaves the same as for_label()
, except when for_label()
would return Some(REPLACEMENT)
, this method returns None
instead.
This method is useful in scenarios where a fatal error is required upon invalid label, because in those cases the caller typically wishes to treat the labels that map to the replacement encoding as fatal errors, too.
It is not OK to use this method when the action upon the method returning
None
is to use a fallback encoding (e.g. WINDOWS_1252
). In such a
case, the for_label()
method should be used instead in order to avoid
unsafe fallback for labels that for_label()
maps to Some(REPLACEMENT)
.
Available via the C wrapper.
Sourcepub fn for_bom(buffer: &[u8]) -> Option<(&'static Encoding, usize)>
pub fn for_bom(buffer: &[u8]) -> Option<(&'static Encoding, usize)>
Performs non-incremental BOM sniffing.
The argument must either be a buffer representing the entire input stream (non-streaming case) or a buffer representing at least the first three bytes of the input stream (streaming case).
Returns Some((UTF_8, 3))
, Some((UTF_16LE, 2))
or
Some((UTF_16BE, 2))
if the argument starts with the UTF-8, UTF-16LE
or UTF-16BE BOM or None
otherwise.
Available via the C wrapper.
Sourcepub fn name(&'static self) -> &'static str
pub fn name(&'static self) -> &'static str
Returns the name of this encoding.
This name is appropriate to return as-is from the DOM
document.characterSet
property.
Available via the C wrapper.
Sourcepub fn can_encode_everything(&'static self) -> bool
pub fn can_encode_everything(&'static self) -> bool
Checks whether the output encoding of this encoding can encode every
char
. (Only true if the output encoding is UTF-8.)
Available via the C wrapper.
Sourcepub fn is_ascii_compatible(&'static self) -> bool
pub fn is_ascii_compatible(&'static self) -> bool
Checks whether the bytes 0x00…0x7F map exclusively to the characters U+0000…U+007F and vice versa.
Available via the C wrapper.
Sourcepub fn is_single_byte(&'static self) -> bool
pub fn is_single_byte(&'static self) -> bool
Checks whether this encoding maps one byte to one Basic Multilingual Plane code point (i.e. byte length equals decoded UTF-16 length) and vice versa (for mappable characters).
true
iff this encoding is on the list of Legacy single-byte
encodings
in the spec or x-user-defined.
Available via the C wrapper.
Sourcepub fn output_encoding(&'static self) -> &'static Encoding
pub fn output_encoding(&'static self) -> &'static Encoding
Returns the output encoding of this encoding. This is UTF-8 for UTF-16BE, UTF-16LE and replacement and the encoding itself otherwise.
Available via the C wrapper.
Sourcepub fn decode<'a>(
&'static self,
bytes: &'a [u8],
) -> (Cow<'a, str>, &'static Encoding, bool)
pub fn decode<'a>( &'static self, bytes: &'a [u8], ) -> (Cow<'a, str>, &'static Encoding, bool)
Decode complete input to Cow<'a, str>
with BOM sniffing and with
malformed sequences replaced with the REPLACEMENT CHARACTER when the
entire input is available as a single buffer (i.e. the end of the
buffer marks the end of the stream).
This method implements the (non-streaming version of) the decode spec concept.
The second item in the returned tuple is the encoding that was actually used (which may differ from this encoding thanks to BOM sniffing).
The third item in the returned tuple indicates whether there were malformed sequences (that were replaced with the REPLACEMENT CHARACTER).
Note: It is wrong to use this when the input buffer represents only
a segment of the input instead of the whole input. Use new_decoder()
when decoding segmented input.
This method performs a one or two heap allocations for the backing
buffer of the String
when unable to borrow. (One allocation if not
errors and potentially another one in the presence of errors.) The
first allocation assumes jemalloc and may not be optimal with
allocators that do not use power-of-two buckets. A borrow is performed
if decoding UTF-8 and the input is valid UTF-8, if decoding an
ASCII-compatible encoding and the input is ASCII-only, or when decoding
ISO-2022-JP and the input is entirely in the ASCII state without state
transitions.
§Panics
If the size calculation for a heap-allocated backing buffer overflows
usize
.
Available to Rust only.
Sourcepub fn decode_with_bom_removal<'a>(
&'static self,
bytes: &'a [u8],
) -> (Cow<'a, str>, bool)
pub fn decode_with_bom_removal<'a>( &'static self, bytes: &'a [u8], ) -> (Cow<'a, str>, bool)
Decode complete input to Cow<'a, str>
with BOM removal and with
malformed sequences replaced with the REPLACEMENT CHARACTER when the
entire input is available as a single buffer (i.e. the end of the
buffer marks the end of the stream).
When invoked on UTF_8
, this method implements the (non-streaming
version of) the
UTF-8 decode spec
concept.
The second item in the returned pair indicates whether there were malformed sequences (that were replaced with the REPLACEMENT CHARACTER).
Note: It is wrong to use this when the input buffer represents only
a segment of the input instead of the whole input. Use
new_decoder_with_bom_removal()
when decoding segmented input.
This method performs a one or two heap allocations for the backing
buffer of the String
when unable to borrow. (One allocation if not
errors and potentially another one in the presence of errors.) The
first allocation assumes jemalloc and may not be optimal with
allocators that do not use power-of-two buckets. A borrow is performed
if decoding UTF-8 and the input is valid UTF-8, if decoding an
ASCII-compatible encoding and the input is ASCII-only, or when decoding
ISO-2022-JP and the input is entirely in the ASCII state without state
transitions.
§Panics
If the size calculation for a heap-allocated backing buffer overflows
usize
.
Available to Rust only.
Sourcepub fn decode_without_bom_handling<'a>(
&'static self,
bytes: &'a [u8],
) -> (Cow<'a, str>, bool)
pub fn decode_without_bom_handling<'a>( &'static self, bytes: &'a [u8], ) -> (Cow<'a, str>, bool)
Decode complete input to Cow<'a, str>
without BOM handling and
with malformed sequences replaced with the REPLACEMENT CHARACTER when
the entire input is available as a single buffer (i.e. the end of the
buffer marks the end of the stream).
When invoked on UTF_8
, this method implements the (non-streaming
version of) the
UTF-8 decode without BOM
spec concept.
The second item in the returned pair indicates whether there were malformed sequences (that were replaced with the REPLACEMENT CHARACTER).
Note: It is wrong to use this when the input buffer represents only
a segment of the input instead of the whole input. Use
new_decoder_without_bom_handling()
when decoding segmented input.
This method performs a one or two heap allocations for the backing
buffer of the String
when unable to borrow. (One allocation if not
errors and potentially another one in the presence of errors.) The
first allocation assumes jemalloc and may not be optimal with
allocators that do not use power-of-two buckets. A borrow is performed
if decoding UTF-8 and the input is valid UTF-8, if decoding an
ASCII-compatible encoding and the input is ASCII-only, or when decoding
ISO-2022-JP and the input is entirely in the ASCII state without state
transitions.
§Panics
If the size calculation for a heap-allocated backing buffer overflows
usize
.
Available to Rust only.
Sourcepub fn decode_without_bom_handling_and_without_replacement<'a>(
&'static self,
bytes: &'a [u8],
) -> Option<Cow<'a, str>>
pub fn decode_without_bom_handling_and_without_replacement<'a>( &'static self, bytes: &'a [u8], ) -> Option<Cow<'a, str>>
Decode complete input to Cow<'a, str>
without BOM handling and
with malformed sequences treated as fatal when the entire input is
available as a single buffer (i.e. the end of the buffer marks the end
of the stream).
When invoked on UTF_8
, this method implements the (non-streaming
version of) the
UTF-8 decode without BOM or fail
spec concept.
Returns None
if a malformed sequence was encountered and the result
of the decode as Some(String)
otherwise.
Note: It is wrong to use this when the input buffer represents only
a segment of the input instead of the whole input. Use
new_decoder_without_bom_handling()
when decoding segmented input.
This method performs a single heap allocation for the backing
buffer of the String
when unable to borrow. A borrow is performed if
decoding UTF-8 and the input is valid UTF-8, if decoding an
ASCII-compatible encoding and the input is ASCII-only, or when decoding
ISO-2022-JP and the input is entirely in the ASCII state without state
transitions.
§Panics
If the size calculation for a heap-allocated backing buffer overflows
usize
.
Available to Rust only.
Sourcepub fn encode<'a>(
&'static self,
string: &'a str,
) -> (Cow<'a, [u8]>, &'static Encoding, bool)
pub fn encode<'a>( &'static self, string: &'a str, ) -> (Cow<'a, [u8]>, &'static Encoding, bool)
Encode complete input to Cow<'a, [u8]>
with unmappable characters
replaced with decimal numeric character references when the entire input
is available as a single buffer (i.e. the end of the buffer marks the
end of the stream).
This method implements the (non-streaming version of) the
encode spec concept. For
the UTF-8 encode
spec concept, it is slightly more efficient to use
string.as_bytes()
instead of invoking this
method on UTF_8
.
The second item in the returned tuple is the encoding that was actually used (which may differ from this encoding thanks to some encodings having UTF-8 as their output encoding).
The third item in the returned tuple indicates whether there were unmappable characters (that were replaced with HTML numeric character references).
Note: It is wrong to use this when the input buffer represents only
a segment of the input instead of the whole input. Use new_encoder()
when encoding segmented output.
When encoding to UTF-8 or when encoding an ASCII-only input to a
ASCII-compatible encoding, this method returns a borrow of the input
without a heap allocation. Otherwise, this method performs a single
heap allocation for the backing buffer of the Vec<u8>
if there are no
unmappable characters and potentially multiple heap allocations if
there are. These allocations are tuned for jemalloc and may not be
optimal when using a different allocator that doesn’t use power-of-two
buckets.
§Panics
If the size calculation for a heap-allocated backing buffer overflows
usize
.
Available to Rust only.
Sourcepub fn new_decoder(&'static self) -> Decoder
pub fn new_decoder(&'static self) -> Decoder
Instantiates a new decoder for this encoding with BOM sniffing enabled.
BOM sniffing may cause the returned decoder to morph into a decoder for UTF-8, UTF-16LE or UTF-16BE instead of this encoding.
Available via the C wrapper.
Sourcepub fn new_decoder_with_bom_removal(&'static self) -> Decoder
pub fn new_decoder_with_bom_removal(&'static self) -> Decoder
Instantiates a new decoder for this encoding with BOM removal.
If the input starts with bytes that are the BOM for this encoding, those bytes are removed. However, the decoder never morphs into a decoder for another encoding: A BOM for another encoding is treated as (potentially malformed) input to the decoding algorithm for this encoding.
Available via the C wrapper.
Sourcepub fn new_decoder_without_bom_handling(&'static self) -> Decoder
pub fn new_decoder_without_bom_handling(&'static self) -> Decoder
Instantiates a new decoder for this encoding with BOM handling disabled.
If the input starts with bytes that look like a BOM, those bytes are not treated as a BOM. (Hence, the decoder never morphs into a decoder for another encoding.)
Note: If the caller has performed BOM sniffing on its own but has not
removed the BOM, the caller should use new_decoder_with_bom_removal()
instead of this method to cause the BOM to be removed.
Available via the C wrapper.
Sourcepub fn new_encoder(&'static self) -> Encoder
pub fn new_encoder(&'static self) -> Encoder
Instantiates a new encoder for the output encoding of this encoding.
Available via the C wrapper.
Sourcepub fn utf8_valid_up_to(bytes: &[u8]) -> usize
pub fn utf8_valid_up_to(bytes: &[u8]) -> usize
Validates UTF-8.
Returns the index of the first byte that makes the input malformed as UTF-8 or the length of the slice if the slice is entirely valid.
This is currently faster than the corresponding standard library functionality. If this implementation gets upstreamed to the standard library, this method may be removed in the future.
Available via the C wrapper.
Sourcepub fn ascii_valid_up_to(bytes: &[u8]) -> usize
pub fn ascii_valid_up_to(bytes: &[u8]) -> usize
Validates ASCII.
Returns the index of the first byte that makes the input malformed as ASCII or the length of the slice if the slice is entirely valid.
Available via the C wrapper.
Sourcepub fn iso_2022_jp_ascii_valid_up_to(bytes: &[u8]) -> usize
pub fn iso_2022_jp_ascii_valid_up_to(bytes: &[u8]) -> usize
Validates ISO-2022-JP ASCII-state data.
Returns the index of the first byte that makes the input not representable in the ASCII state of ISO-2022-JP or the length of the slice if the slice is entirely representable in the ASCII state of ISO-2022-JP.
Available via the C wrapper.