Module mz_compute::render::continual_task
source · Expand description
A continual task presents as something like a BEFORE TRIGGER
: it watches
some input and whenever it changes at time T
, executes a SQL txn,
writing to some output at the same time T
. It can also read anything in
materialize as a reference, most notably including the output.
Only reacting to new inputs (and not the full history) makes a CT’s rehydration time independent of the size of the inputs (NB this is not true for references), enabling things like writing UPSERT on top of an append-only shard in SQL (ignore the obvious bug with my upsert impl):
CREATE CONTINUAL TASK upsert (key INT, val INT) ON INPUT append_only AS (
DELETE FROM upsert WHERE key IN (SELECT key FROM append_only);
INSERT INTO upsert SELECT key, max(val) FROM append_only GROUP BY key;
)
Unlike a materialized view, the continual task does not update outputs if references later change. This enables things like auditing:
CREATE CONTINUAL TASK audit_log (count INT8) ON INPUT anomalies AS (
INSERT INTO audit_log SELECT * FROM anomalies;
)
Rough implementation overview:
- A CT is created and starts at some
start_ts
optionally later dropped and stopped at someend_ts
. - A CT takes one or more _input_s. These must be persist shards (i.e. TABLE, SOURCE, MV, but not VIEW).
- A CT has one or more _output_s. The outputs are (initially) owned by the task and cannot be written to by other parts of the system.
- The task is run for each time one of the inputs changes starting at
start_ts
. - It is given the changes in its inputs at time
T
as diffs.- These are presented as two SQL relations with just the inserts/deletes.
- NB: A full collection for the input can always be recovered by also using the input as a “reference” (see below) and applying the diffs.
- The task logic is expressed as a SQL transaction that does all reads at
time
T-1
and commits all writes atT
- Intuition is that the logic runs before the input is written, like a
CREATE TRIGGER ... BEFORE
.
- Intuition is that the logic runs before the input is written, like a
- This logic can reference any nameable object in the system, not just the
inputs.
- However, the logic/transaction can mutate only the outputs.
- Summary of differences between inputs and references:
- The task receives snapshot + changes for references (like regular dataflow inputs today) but only changes for inputs.
- The task only produces output in response to changes in the inputs but not in response to changes in the references.
- Inputs are reclocked by subtracting 1 from their timestamps, references are not.
- Instead of re-evaluating the task logic from scratch for each input time, we maintain the collection representing desired writes to the output(s) as a dataflow.
- The task dataflow is tied to a
CLUSTER
and runs on eachREPLICA
.- HA strategy: multi-replica clusters race to commit and the losers throw away the result.