Expand description
Conversion from Avro schemas to Materialize RelationDescs.
A few notes for posterity on how this conversion happens are in order.
If the schema is an Avro record, we flatten it to its fields, which become the columns of the relation.
Each individual field is then converted to its SQL equivalent. For most types, this conversion is the obvious one. The only non-trivial counterexample is Avro unions.
Since Avro types are not nullable by default, the typical way normal (i.e., nullable)
SQL fields are represented in Avro is by a union of the underlying type with the
singleton type { Null }; in Avro schema notation, this is ["null", "TheType"].
We shall call union types following this pattern Nullability-Pattern Unions.
We shall call all other union types (e.g. ["MyType1", "MyType2"] or ["null", "MyType1", "MyType2"]) Essential Unions.
Since there is an obvious way to represent Nullability-Pattern Unions, but not Essential Unions, in the SQL type system,
we must handle Essential Unions with a bit of a hack (at least until Materialize supports union or sum types, which may be never).
When an Essential Union appears as one of the fields of a record, we expand
it to n columns in SQL, where n is the number of non-null variants in the union. These
columns will be given names created by pasting their index at the end of the overall name
of the field. For example, if an Essential Union in a field named "Foo" has schema [int, bool], it will expand to the columns "Foo1": bool, "Foo2": int. There is an implicit constraint upheld be the source pipeline that only one such column will be non-null at a time
When an Essential Union appears elsewhere than as one of the fields of a record, there is nothing we can do, because we expect to be able to turn it into exactly one SQL type, not a series of them. Thus, in these cases, we just bail. For example, it’s not possible to ingest an array or map whose element type is an Essential Union.
Structs§
- Avro
Schema Resolver - Glue
Schema Cache - Glue-side analogue of
SchemaCache. - Schema
Cache - Cache of writer schemas fetched from a Confluent Schema Registry. Held
inside
WriterSchemaProvider::Confluent; the type is named pub because that variant’s field is reachable through the pub enum, but it has no pub constructor or methods — onlyWriterSchemaProvider::confluentcan build one.
Enums§
- Writer
Schema Key - Identifier carried in a wire-format header that points at the writer’s
schema. Different schema registries key their writer schemas differently:
Confluent uses a sequential
i32, AWS Glue uses a UUID. Callers do not have to care which kind of key they’re holding — the resolver routes it back to the matching cache. - Writer
Schema Provider - Provides writer schemas to an
AvroSchemaResolver.
Functions§
- get_
named_ 🔒columns - get_
union_ 🔒columns - Get the series of (one or more) SQL columns corresponding to an Avro union. See module comments for details.
- parse_
schema - registry_
name_ 🔒from_ schema_ arn - Parse the registry name out of a Glue
SchemaArn. - schema_
to_ relationdesc - Converts an Apache Avro schema into a list of column names and types.
- validate_
schema_ 🔒1 - Convert an Avro schema to a series of columns and names, flattening the top-level record, if the top node is indeed a record.
- validate_
schema_ 🔒2 - Get the single column corresponding to a schema node. It is an error if this node should correspond to more than one column (because it is an Essential Union in the sense described in the module docs).
- with_
recursion_ 🔒guard - Runs
fwithnodemarked as on the current resolution path, bailing if it’s already on the path (a cycle). The mark is cleared on exit so sibling reuse of a named type isn’t flagged.