Module cdc

Source
Expand description

Replicate a table from SQL Server using their Change-Data-Capture (CDC) primitives.

This module provides a CdcStream type that provides the following API for replicating a table:

  1. CdcStream::snapshot returns an initial snapshot of a table and the Lsn at which the snapshot was taken.
  2. CdcStream::into_stream returns a futures::Stream of CdcEvents optionally from the Lsn returned in step 1.

The snapshot process is responsible for identifying an Lsn that corresponds to a point-in-time view of the data for the table(s) being copied. Similarly to MySQL, Microsoft SQL server, as far as we know, does not provide an API to achieve this.

SQL Server SNAPSHOT isolation provides guarantees that a reader will only see writes committed before the transaction began. More specficially, this snapshot is implemented using versions that are visibile based on the transaction sequence number (XSN). The XSN is set at the first read or write, not at BEGIN TRANSACTION, see here. This provides us a suitable starting point for capturing the table data. To force an XSN to be assigned, experiments have shown that a table must be read. We choose a well-known table that we should already have access to, cdc.change_tables, and read a single value from it.

Due to the asynchronous nature of CDC, we can assume that the Lsn returned from any CDC tables or CDC functions will always be stale, in relation to the source table that CDC is tracking. The system table sys.dm_tran_database_transactions will contain an Lsn for any transaction that performs a write operation. Creating a savepoint using SAVE TRANSACTION is sufficient to generate an Lsn in this case.

To ensure that the the point-in-time view is established atomically with collection of the Lsn, we lock the tables to prevent writes from being interleaved between the 2 commands (read to establish XSN and creation of the savepoint).

SQL server supports table locks, but those will only be released once the outermost transaction completes. For this reason, this module uses two connections for the snapshot process. The first connection is used to initiate a transaction and lock the upstream tables under TransactionIsolationLevel::ReadCommitted isolation. While the first connection maintains the locks, the second connection starts a transaction with TransactionIsolationLevel::Snapshot isolation and creates a savepoint. Once the savepoint is created, SQL server has assigned an Lsn and the the first connection rolls back the transaction. The Lsn and snapshot are captured by the second connection within the existing transaction.

After completing the snapshot we use crate::inspect::get_changes_asc which will return all changes between a [lower, upper) bound of Lsns.

Structs§

CdcStream
A stream of changes from a table in SQL Server that has CDC enabled.
Lsn
This type is used to represent the progress of each SQL Server instance in the ingestion dataflow.
StructuredLsn
Structured format of an Lsn.

Enums§

CdcError
CdcEvent
A change event from a CdcStream.
Operation
Identifies what change was made to the SQL Server table tracked by CDC.
RowFilterOption
When querying CDC functions like cdc.fn_cdc_get_all_changes_<capture_instance> this governs what content is returned.