Module oneshot_source

Source
Expand description

“Oneshot” sources are a one-time ingestion of data from an external system, unlike traditional sources, they do not run continuously. Oneshot sources are generally used for COPY FROM SQL statements.

The implementation of reading and parsing data is behind the OneshotSource and OneshotFormat traits, respectively. Users looking to add new sources or formats, should only need to add new implementations for these traits.

  • OneshotSource is an interface for listing and reading from an external system, e.g. an HTTP server.
  • OneshotFormat is an interface for how to parallelize and parse data, e.g. CSV.

Given a OneshotSource and a OneshotFormat we build a dataflow structured like the following:

            ┏━━━━━━━━━━━━━━━┓
            ┃    Discover   ┃
            ┃    objects    ┃
            ┗━━━━━━━┯━━━━━━━┛
          ┌───< Distribute >───┐
          │                    │
    ┏━━━━━v━━━━┓         ┏━━━━━v━━━━┓
    ┃  Split   ┃   ...   ┃  Split   ┃
    ┃  Work 1  ┃         ┃  Work n  ┃
    ┗━━━━━┯━━━━┛         ┗━━━━━┯━━━━┛
          │                    │
          ├───< Distribute >───┤
          │                    │
    ┏━━━━━v━━━━┓         ┏━━━━━v━━━━┓
    ┃  Fetch   ┃   ...   ┃  Fetch   ┃
    ┃  Work 1  ┃         ┃  Work n  ┃
    ┗━━━━━┯━━━━┛         ┗━━━━━┯━━━━┛
          │                    │
          ├───< Distribute >───┤
          │                    │
    ┏━━━━━v━━━━┓         ┏━━━━━v━━━━┓
    ┃  Decode  ┃   ...   ┃  Decode  ┃
    ┃  Chunk 1 ┃         ┃  Chunk n ┃
    ┗━━━━━┯━━━━┛         ┗━━━━━┯━━━━┛
          │                    │
          │                    │
    ┏━━━━━v━━━━┓         ┏━━━━━v━━━━┓
    ┃  Stage   ┃   ...   ┃  Stage   ┃
    ┃  Batch 1 ┃         ┃  Batch n ┃
    ┗━━━━━┯━━━━┛         ┗━━━━━┯━━━━┛
          │                    │
          └─────────┬──────────┘
              ┏━━━━━v━━━━┓
              ┃  Result  ┃
              ┃ Callback ┃
              ┗━━━━━━━━━━┛

Modules§

aws_source
AWS S3 OneshotSource.
csv
CSV to Row Decoder.
http_source
Generic HTTP oneshot source that will fetch a file from the public internet.
parquet
Parquet OneshotFormat.
util 🔒
Utility functions for Oneshot sources.

Structs§

StorageErrorX
Experimental Error Type.

Enums§

ChecksumKind 🔒
Enum wrapper for OneshotSource::Checksum, see SourceKind for more details.
Encoding
Encoding of a OneshotObject.
FormatKind 🔒
An enum wrapper around OneshotFormats.
ObjectFilter 🔒
ObjectKind 🔒
Enum wrapper for OneshotSource::Object, see SourceKind for more details.
RecordChunkKind 🔒
RequestKind 🔒
SourceKind 🔒
An enum wrapper around OneshotSources.
StorageErrorXKind
Experimental Error Type, see StorageErrorX.

Traits§

OneshotFormat
Defines a format that we fetch for a “one time” ingestion.
OneshotObject
An object that will be fetched from a OneshotSource.
OneshotSource
Defines a remote system that we can fetch data from for a “one time” ingestion.
StorageErrorXContext 🔒

Functions§

render
Render a dataflow to do a “oneshot” ingestion.
render_completion_operator
Render an operator that given a stream of ProtoBatches will call our worker_callback to report the results upstream.
render_decode_chunk
Render an operator that given a stream of OneshotFormat::RecordChunks will decode these chunks into a stream of Rows.
render_discover_objects
Render an operator that using a OneshotSource will discover what objects are available for fetching.
render_fetch_work
Render an operator that given a stream OneshotFormat::WorkRequests will fetch chunks of the remote OneshotSource::Object and return a stream of OneshotFormat::RecordChunks that can be decoded into Rows.
render_split_work
Render an operator that given a stream of OneshotSource::Objects will split them into units of work based on the provided OneshotFormat.
render_stage_batches_operator
Render an operator that given a stream of Rows will stage them in Persist and return a stream of ProtoBatches that can later be linked into a shard.