Expand description
Repair Role rows left in an inconsistent state by the v80->v81 migration.
§Background
The catalog persist shard requires that every (key, ts) tuple
consolidate to Diff::ONE. Catalog writers retract by re-serializing the
in-memory parsed value through the current proto; this only consolidates
cleanly if the round-trip is byte-exact (database-issues#7179). Whenever
a proto adds a field, that invariant breaks for rows written before the
field existed: the stored row lacks the key entirely while the
re-serialized retraction writes it as explicit null, so the retraction
never cancels its target.
§The specific failure this migration targets
v80_to_v81::upgrade was supposed to backfill auto_provision_source on
every existing Role row. That backfill was gated on an is_cloud
heuristic that required the mz_system cluster to be
ClusterVariant::Managed; on envs where it wasn’t, the heuristic returned
false and the migration silently no-opped. The version bump committed
anyway, but every Role row kept its v80 form.
After v26.18 any DDL touching such a row (ALTER ROLE, role membership
changes, DROP ROLE) parses the row, then writes a retract+insert pair
through current protos that do include the new field. The retraction
doesn’t cancel, and the shard ends up holding three rows per affected
role:
- a stale
+1in the pre-v81 form, - a dangling
-1in the current form (the retraction that missed), - a live
+1in the current form reflecting whatever the DDL did.
For DROP ROLE the third row is absent — the role is gone, but the first
two persist forever.
§The repair
For every Role with the structural signature of this bug — a dangling -1
plus at least one +1 whose parsed RoleValue equals it, plus at most
one other +1 with a different parsed value — we emit:
+1of the dangling row, cancelling the dangling-1.-1of every parsed-equal stale+1, completing the retraction the original DDL intended.
After commit, each affected RoleKey has either one live +1 or no rows
at all (for the dropped case).
Anything that doesn’t fit the fingerprint — no parsed-equal sibling,
multiple distinct live candidates, non-Role kinds, |diff| > 1 — is
logged at WARN and left for human review. Better to under-clean and
surface unknown shapes for triage than over-clean and retire live state.
Structs§
- Repair
Stats 🔒 - Outcome counters for the repair, returned for logging and assertable in tests.
- Role
Plus 🔒One - A
+1Role row borrowed from the snapshot. Carries both the stored form (so retractions can target the exact row) and the parsed value (so we can compare semantic equality across stored forms).
Constants§
Functions§
- compute_
repairs 🔒 - Inspect a consolidated snapshot and return the updates needed to converge
every affected Role onto a single live
+1(or zero rows, for the dropped case). - try_
as_ 🔒role - Returns the parsed Role iff
kind_jsonis one. ReturnsNonefor any other kind, or for rows we can’t deserialize as the current Role shape (which we treat as “leave alone” — losing that row to the repair would be worse than the soft_assert noise). - upgrade
- version_
update_ 🔒kind - Produces the
Configrow encoding a user-version bump. Identical to the helper inupgrade.rs, duplicated here so this module is self-contained for testing.