Module mz_persist_types::columnar
source · Expand description
Columnar understanding of persisted data
For efficiency/performance, we directly expose the columnar structure of
persist’s internal encoding to users during encoding and decoding. For
ergonomics, we wrap the arrow
crate we use to read and write parquet data.
Some of the requirements that led to this design:
- Support a separation of data and schema because Row is not self-describing: e.g. a Datum::Null can be one of many possible column types. A RelationDesc is necessary to describe a Row schema.
- Narrow down
arrow::datatypes::DataType
(the arrow “logical” types) to a set we want to support in persist. - Associate a
parquet::basic::Encoding
with each of those types. - Do
dyn Any
downcasting of columns once per part, not once per update. - Unlike
arrow
, be precise about whether each column is optional or not.
The primary presentation of this abstraction is a sealed trait Data, which
is implemented for the owned version of each type of data that can be stored
in persist: int64
, Option<String>
, etc.
Under the hood, it’s necessary to store something like a map of name -> column
. A natural instinct is to make Data object safe, but I couldn’t
figure out a way to make that work without severe limitations. As a result,
the DataType enum is introduced with a 1:1 relationship between variants and
implementations of Data. This allows for easy type erasure and guardrails
when downcasting the types back.
Note: The “Data” strategy is roughly how columnation works and the
“DataType” strategy is roughly how arrow
works. Doing both of them gets us
the benefits of both, while the downside is code duplication and cognitive
overhead.
The Data trait has associated types for the exclusive “builder” type for the column and for the shared “reader” type. These also implement some common traits to make relationships between types more structured.
Finally, the Schema trait maps an implementor of Codec to the underlying column structure. It also provides a PartEncoder and PartDecoder for amortizing any downcasting that does need to happen.
Modules§
- sealed 🔒
Structs§
- A description of a type understood by persist.
- Opaque binary encoded data.
Enums§
- The in-memory rust type of a column of data.
Traits§
- If necessary, whatever information beyond the type of
Self
needed to produce a columnar schema for this type. - A decoder for values of a fixed schema.
- An encoder for values of a fixed schema
- A type that may be retrieved from a column of
[T]
. - A type that may be added into a column of
[T]
. - A type understood by persist.
- A stable encoding for a type that gets durably persisted in an
arrow::array::FixedSizeBinaryArray
. - A decoder for values of a fixed schema.
- An encoder for values of a fixed schema
- A description of the structure of a crate::Codec implementor.
- Description of a type that we encode into Persist.
Functions§
- Helper to convert from codec-encoded data to structured data.
- Helper to convert from structured data to codec-encoded data.
- A helper for writing tests that validate that a piece of data roundtrips through the columnar format.