Module mz_compute::render::continual_task

source ·
Expand description

A continual task presents as something like a BEFORE TRIGGER: it watches some input and whenever it changes at time T, executes a SQL txn, writing to some output at the same time T. It can also read anything in materialize as a reference, most notably including the output.

Only reacting to new inputs (and not the full history) makes a CT’s rehydration time independent of the size of the inputs (NB this is not true for references), enabling things like writing UPSERT on top of an append-only shard in SQL (ignore the obvious bug with my upsert impl):

CREATE CONTINUAL TASK upsert (key INT, val INT) ON INPUT append_only AS (
    DELETE FROM upsert WHERE key IN (SELECT key FROM append_only);
    INSERT INTO upsert SELECT key, max(val) FROM append_only GROUP BY key;
)

Unlike a materialized view, the continual task does not update outputs if references later change. This enables things like auditing:

CREATE CONTINUAL TASK audit_log (count INT8) ON INPUT anomalies AS (
    INSERT INTO audit_log SELECT * FROM anomalies;
)

Rough implementation overview:

  • A CT is created and starts at some start_ts optionally later dropped and stopped at some end_ts.
  • A CT takes one or more _input_s. These must be persist shards (i.e. TABLE, SOURCE, MV, but not VIEW).
  • A CT has one or more _output_s. The outputs are (initially) owned by the task and cannot be written to by other parts of the system.
  • The task is run for each time one of the inputs changes starting at start_ts.
  • It is given the changes in its inputs at time T as diffs.
    • These are presented as two SQL relations with just the inserts/deletes.
    • NB: A full collection for the input can always be recovered by also using the input as a “reference” (see below) and applying the diffs.
  • The task logic is expressed as a SQL transaction that does all reads at time T-1 and commits all writes at T
    • Intuition is that the logic runs before the input is written, like a CREATE TRIGGER ... BEFORE.
  • This logic can reference any nameable object in the system, not just the inputs.
    • However, the logic/transaction can mutate only the outputs.
  • Summary of differences between inputs and references:
    • The task receives snapshot + changes for references (like regular dataflow inputs today) but only changes for inputs.
    • The task only produces output in response to changes in the inputs but not in response to changes in the references.
    • Inputs are reclocked by subtracting 1 from their timestamps, references are not.
  • Instead of re-evaluating the task logic from scratch for each input time, we maintain the collection representing desired writes to the output(s) as a dataflow.
  • The task dataflow is tied to a CLUSTER and runs on each REPLICA.
    • HA strategy: multi-replica clusters race to commit and the losers throw away the result.

Structs§

Functions§