encoding

Module types

Source
Expand description

Interface to the character encoding.

§Raw incremental interface

Methods which name starts with raw_ constitute the raw incremental interface, the lowest-available API for encoders and decoders. This interface divides the entire input to four parts:

  • Processed bytes do not affect the future result.
  • Unprocessed bytes may affect the future result and can be a part of problematic sequence according to the future input.
  • Problematic byte is the first byte that causes an error condition.
  • Remaining bytes are not yet processed nor read, so the caller should feed any remaining bytes again.

The following figure illustrates an example of successive raw_feed calls:

1st raw_feed   :2nd raw_feed   :3rd raw_feed
----------+----:---------------:--+--+---------
          |    :               :  |  |
----------+----:---------------:--+--+---------
processed  unprocessed             |  remaining
                              problematic

Since these parts can span the multiple input sequences to raw_feed, raw_feed returns two offsets (one optional) with that the caller can track the problematic sequence. The first offset (the first usize in the tuple) points to the first unprocessed bytes, or is zero when unprocessed bytes have started before the current call. (The first unprocessed byte can also be at offset 0, which doesn’t make a difference for the caller.) The second offset (upto field in the CodecError struct), if any, points to the first remaining bytes.

If the caller needs to recover the error via the problematic sequence, then the caller starts to save the unprocessed bytes when the first offset < the input length, appends any new unprocessed bytes while the first offset is zero, and discards unprocessed bytes when first offset becomes non-zero while saving new unprocessed bytes when the first offset < the input length. Then the caller checks for the error condition and can use the saved unprocessed bytes for error recovery. Alternatively, if the caller only wants to replace the problematic sequence with a fixed string (like U+FFFD), then it can just discard the first sequence and can emit the fixed string on an error. It still has to feed the input bytes starting at the second offset again.

Structs§

  • Error information from either encoder or decoder.

Enums§

Traits§

  • Byte writer used by encoders. In most cases this will be an owned vector of u8.
  • Character encoding.
  • Decoder converting a byte sequence into a Unicode string. This is a lower level interface, and normally Encoding::decode should be used instead.
  • Encoder converting a Unicode string into a byte sequence. This is a lower level interface, and normally Encoding::encode should be used instead.
  • String writer used by decoders. In most cases this will be an owned string.

Functions§

  • Determine the encoding by looking for a Byte Order Mark (BOM) and decoded a single string in memory. Return the result and the used encoding.

Type Aliases§

  • A type of the bare function in DecoderTrap values.
  • A type of the bare function in EncoderTrap values.
  • A trait object using dynamic dispatch which is a sendable reference to the encoding, for code where the encoding is not known at compile-time.