Expand description
Repair Role rows left in an inconsistent state by the v80->v81 migration.
§Background
The catalog persist shard requires that every (key, ts) tuple
consolidate to Diff::ONE. Catalog writers retract by re-serializing the
in-memory parsed value through the current proto; this only consolidates
cleanly if the round-trip is byte-exact (database-issues#7179). Whenever
a proto adds a field, that invariant breaks for rows written before the
field existed: the stored row lacks the key entirely while the
re-serialized retraction writes it as explicit null, so the retraction
never cancels its target.
§The specific failure this migration targets
v80_to_v81::upgrade was supposed to backfill auto_provision_source on
every existing Role row. That backfill was gated on an is_cloud
heuristic that required the mz_system cluster to be
ClusterVariant::Managed; on envs where it wasn’t, the heuristic returned
false and the migration silently no-opped. The version bump committed
anyway, but every Role row kept its v80 form.
After v26.18 any DDL touching such a row (ALTER ROLE, role membership
changes, DROP ROLE) parses the row, then writes a retract+insert pair
through current protos that do include the new field. The retraction
doesn’t cancel, and the shard ends up holding three rows per affected
role:
- a stale
+1in the pre-v81 form, - a dangling
-1in the current form (the retraction that missed), - a live
+1in the current form reflecting whatever the DDL did.
For DROP ROLE the third row is absent — the role is gone, but the first
two persist forever.
§The repair
Two passes over the snapshot.
Pass 1 — cancel already-dangling retractions. For every Role with the
structural signature of the bug — a dangling -1 plus at least one +1
whose parsed RoleValue equals it, plus at most one other +1 with a
different parsed value — we emit:
+1of the dangling row, cancelling the dangling-1.-1of every parsed-equal stale+1, completing the retraction the original DDL intended.
Pass 2 — normalize untouched stale rows. Every remaining +1 Role
row whose stored form differs from what re-serializing its parsed value
through the current proto would produce is retracted and re-inserted in
canonical form. Without this, a Role still in pre-v81 form that hasn’t
yet had any DDL run against it would survive the migration unchanged,
and the next ALTER ROLE/DROP ROLE after v83 would manufacture a
fresh dangling -1 against bytes that no migration runs against
anymore. After pass 2, every Role row in the shard has the byte form
that future retractions will also produce, so consolidation cancels.
After commit, each affected RoleKey has either one live canonical
+1 or no rows at all (for the dropped case).
Anything that doesn’t fit the fingerprint — no parsed-equal sibling,
multiple distinct live candidates, non-Role kinds, |diff| > 1 — is
logged at WARN and left for human review, and pass 2 also leaves every
+1 for any such key alone (rewriting a subset of an unrepaired key
would just produce a new dangling diff). Better to under-clean and
surface unknown shapes for triage than over-clean and retire live state.
Structs§
- Repair
Stats 🔒 - Outcome counters for the repair, returned for logging and assertable in tests.
- Role
Plus 🔒One - A
+1Role row borrowed from the snapshot. Carries both the stored form (so retractions can target the exact row) and the parsed value (so we can compare semantic equality across stored forms).
Constants§
Functions§
- apply_
v80_ 🔒to_ v81_ backfill - Replays the
v80_to_v81migration’sauto_provision_sourcebackfill on a single Role. Mirrors that migration exactly: only acts on cloud envs, only fills the field when it’s currentlyNone, only setsFronteggfor names matching the email heuristic. - auto_
provision_ 🔒source_ for_ name - Email-name heuristic shared with the original
v80_to_v81migration: a Role whose name matches.+@.+\..+(case-insensitive) is treated as auto-provisioned via Frontegg. Anything else (e.g.,admin,prod_app) is left withauto_provision_source: None. - compute_
repairs 🔒 - Inspect a consolidated snapshot and return the updates needed to converge
every affected Role onto a single canonical-form
+1(or zero rows, for the dropped case), and every untouched Role onto canonical form so future writers’ retractions consolidate. - is_
cloud_ 🔒env - Detects whether the snapshot belongs to a Materialize Cloud environment.
Extends the original
v80_to_v81heuristic to recognize the additional shape we missed back then: - try_
as_ 🔒role - Returns the parsed Role iff
kind_jsonis one. ReturnsNonefor any other kind, or for rows we can’t deserialize as the current Role shape (which we treat as “leave alone” — losing that row to the repair would be worse than the soft_assert noise). - upgrade
- version_
update_ 🔒kind - Produces the
Configrow encoding a user-version bump. Identical to the helper inupgrade.rs, duplicated here so this module is self-contained for testing.