Expand description
Code to render the ingestion dataflow of a PostgresSourceConnection. The dataflow consists
of multiple operators in order to take advantage of all the available workers.
§Snapshot
One part of the dataflow deals with snapshotting the tables involved in the ingestion. Each
table that needs a snapshot is assigned to a specific worker which performs a COPY query
and distributes the raw COPY bytes to all workers to decode the text encoded rows.
For all tables that ended up being snapshotted the snapshot reader also emits a rewind request to the replication reader which will ensure that the requested portion of the replication stream is subtracted from the snapshot.
See the snapshot module for more information on the snapshot strategy.
§Replication
The other part of the dataflow deals with reading the logical replication slot, which must happen from a single worker. The minimum amount of processing is performed from that worker and the data is then distributed among all workers for decoding.
See the replication module for more information on the replication strategy.
§Error handling
There are two kinds of errors that can happen during ingestion that are represented as two separate error types:
DefiniteErrors are errors that happen during processing of a specific
collection record at a specific LSN. These are the only errors that can ever end up in the
error collection of a subsource.
Transient errors are any errors that can happen for reasons that are unrelated to the data
itself. This could be authentication failures, connection failures, etc. The only operators
that can emit such errors are the TableReader and the ReplicationReader operators, which
are the ones that talk to the external world. Both of these operators are built with the
AsyncOperatorBuilder::build_fallible method which allows transient errors to be propagated
upwards with the standard ? operator without risking downgrading the capability and producing
bogus frontiers.
The error streams from both of those operators are published to the source status and also trigger a restart of the dataflow.
   ┏━━━━━━━━━━━━━━┓
   ┃    table     ┃
   ┃    reader    ┃
   ┗━┯━━━━━━━━━━┯━┛
     │          │rewind
     │          │requests
     │          ╰────╮
     │             ┏━v━━━━━━━━━━━┓
     │             ┃ replication ┃
     │             ┃   reader    ┃
     │             ┗━┯━━━━━━━━━┯━┛
 COPY│           slot│         │
 data│           data│         │
┏━━━━v━━━━━┓ ┏━━━━━━━v━━━━━┓   │
┃  COPY    ┃ ┃ replication ┃   │
┃ decoder  ┃ ┃   decoder   ┃   │
┗━━━━┯━━━━━┛ ┗━━━━━┯━━━━━━━┛   │
     │snapshot     │replication│
     │updates      │updates    │
     ╰────╮    ╭───╯           │
         ╭┴────┴╮              │
         │concat│              │
         ╰──┬───╯              │
            │ data             │progress
            │ output           │output
            v                  vModules§
- replication 🔒
- Renders the logical replication side of the PostgresSourceConnectioningestion dataflow.
- snapshot 🔒
- Renders the table snapshot side of the PostgresSourceConnectioningestion dataflow.
Structs§
- SlotMetadata 🔒
- The state of a replication slot.
- SourceOutput 🔒Info 
Enums§
- DefiniteError 
- A definite error that always ends up in the collection of a specific table.
- ReplicationError 
- TransientError 
- A transient error that never ends up in the collection of a specific table.
Functions§
- cast_row 🔒
- Casts a text row into the target types
- decode_utf8_ 🔒text 
- Converts raw bytes that are expected to be UTF8 encoded into a Datum::String
- ensure_replication_ 🔒slot 
- fetch_max_ 🔒lsn 
- Fetch the pg_current_wal_lsn, used to report metrics.
- fetch_slot_ 🔒metadata 
- Fetches the minimum LSN at which this slot can safely resume.
- verify_schema 🔒