Skip to main content

Module v82_to_v83

Module v82_to_v83 

Source
Expand description

Repair Role rows left in an inconsistent state by the v80->v81 migration.

§Background

The catalog persist shard requires that every (key, ts) tuple consolidate to Diff::ONE. Catalog writers retract by re-serializing the in-memory parsed value through the current proto; this only consolidates cleanly if the round-trip is byte-exact (database-issues#7179). Whenever a proto adds a field, that invariant breaks for rows written before the field existed: the stored row lacks the key entirely while the re-serialized retraction writes it as explicit null, so the retraction never cancels its target.

§The specific failure this migration targets

v80_to_v81::upgrade was supposed to backfill auto_provision_source on every existing Role row. That backfill was gated on an is_cloud heuristic that required the mz_system cluster to be ClusterVariant::Managed; on envs where it wasn’t, the heuristic returned false and the migration silently no-opped. The version bump committed anyway, but every Role row kept its v80 form.

After v26.18 any DDL touching such a row (ALTER ROLE, role membership changes, DROP ROLE) parses the row, then writes a retract+insert pair through current protos that do include the new field. The retraction doesn’t cancel, and the shard ends up holding three rows per affected role:

  • a stale +1 in the pre-v81 form,
  • a dangling -1 in the current form (the retraction that missed),
  • a live +1 in the current form reflecting whatever the DDL did.

For DROP ROLE the third row is absent — the role is gone, but the first two persist forever.

§The repair

For every Role with the structural signature of this bug — a dangling -1 plus at least one +1 whose parsed RoleValue equals it, plus at most one other +1 with a different parsed value — we emit:

  1. +1 of the dangling row, cancelling the dangling -1.
  2. -1 of every parsed-equal stale +1, completing the retraction the original DDL intended.

After commit, each affected RoleKey has either one live +1 or no rows at all (for the dropped case).

Anything that doesn’t fit the fingerprint — no parsed-equal sibling, multiple distinct live candidates, non-Role kinds, |diff| > 1 — is logged at WARN and left for human review. Better to under-clean and surface unknown shapes for triage than over-clean and retire live state.

Structs§

RepairStats 🔒
Outcome counters for the repair, returned for logging and assertable in tests.
RolePlusOne 🔒
A +1 Role row borrowed from the snapshot. Carries both the stored form (so retractions can target the exact row) and the parsed value (so we can compare semantic equality across stored forms).

Constants§

FROM_VERSION 🔒
TO_VERSION 🔒

Functions§

compute_repairs 🔒
Inspect a consolidated snapshot and return the updates needed to converge every affected Role onto a single live +1 (or zero rows, for the dropped case).
try_as_role 🔒
Returns the parsed Role iff kind_json is one. Returns None for any other kind, or for rows we can’t deserialize as the current Role shape (which we treat as “leave alone” — losing that row to the repair would be worse than the soft_assert noise).
upgrade
version_update_kind 🔒
Produces the Config row encoding a user-version bump. Identical to the helper in upgrade.rs, duplicated here so this module is self-contained for testing.