Skip to main content

Module v83_to_v84

Module v83_to_v84 

Source
Expand description

Repair Role rows left in an inconsistent state by the v80->v81 migration.

§Background

The catalog persist shard requires that every (key, ts) tuple consolidate to Diff::ONE. Catalog writers retract by re-serializing the in-memory parsed value through the current proto; this only consolidates cleanly if the round-trip is byte-exact (database-issues#7179). Whenever a proto adds a field, that invariant breaks for rows written before the field existed: the stored row lacks the key entirely while the re-serialized retraction writes it as explicit null, so the retraction never cancels its target.

§The specific failure this migration targets

v80_to_v81::upgrade was supposed to backfill auto_provision_source on every existing Role row. That backfill was gated on an is_cloud heuristic that required the mz_system cluster to be ClusterVariant::Managed; on envs where it wasn’t, the heuristic returned false and the migration silently no-opped. The version bump committed anyway, but every Role row kept its v80 form.

After v26.18 any DDL touching such a row (ALTER ROLE, role membership changes, DROP ROLE) parses the row, then writes a retract+insert pair through current protos that do include the new field. The retraction doesn’t cancel, and the shard ends up holding three rows per affected role:

  • a stale +1 in the pre-v81 form,
  • a dangling -1 in the current form (the retraction that missed),
  • a live +1 in the current form reflecting whatever the DDL did.

For DROP ROLE the third row is absent — the role is gone, but the first two persist forever.

§The repair

Two passes over the snapshot.

Pass 1 — cancel already-dangling retractions. For every Role with the structural signature of the bug — a dangling -1 plus at least one +1 whose parsed RoleValue equals it, plus at most one other +1 with a different parsed value — we emit:

  1. +1 of the dangling row, cancelling the dangling -1.
  2. -1 of every parsed-equal stale +1, completing the retraction the original DDL intended.

Pass 2 — normalize untouched stale rows. Every remaining +1 Role row whose stored form differs from what re-serializing its parsed value through the current proto would produce is retracted and re-inserted in canonical form. Without this, a Role still in pre-v81 form that hasn’t yet had any DDL run against it would survive the migration unchanged, and the next ALTER ROLE/DROP ROLE after v83 would manufacture a fresh dangling -1 against bytes that no migration runs against anymore. After pass 2, every Role row in the shard has the byte form that future retractions will also produce, so consolidation cancels.

After commit, each affected RoleKey has either one live canonical +1 or no rows at all (for the dropped case).

Anything that doesn’t fit the fingerprint — no parsed-equal sibling, multiple distinct live candidates, non-Role kinds, |diff| > 1 — is logged at WARN and left for human review, and pass 2 also leaves every +1 for any such key alone (rewriting a subset of an unrepaired key would just produce a new dangling diff). Better to under-clean and surface unknown shapes for triage than over-clean and retire live state.

Structs§

RepairStats 🔒
Outcome counters for the repair, returned for logging and assertable in tests.
RolePlusOne 🔒
A +1 Role row borrowed from the snapshot. Carries both the stored form (so retractions can target the exact row) and the parsed value (so we can compare semantic equality across stored forms).

Constants§

FROM_VERSION 🔒
TO_VERSION 🔒

Functions§

apply_v80_to_v81_backfill 🔒
Replays the v80_to_v81 migration’s auto_provision_source backfill on a single Role. Mirrors that migration exactly: only acts on cloud envs, only fills the field when it’s currently None, only sets Frontegg for names matching the email heuristic.
auto_provision_source_for_name 🔒
Email-name heuristic shared with the original v80_to_v81 migration: a Role whose name matches .+@.+\..+ (case-insensitive) is treated as auto-provisioned via Frontegg. Anything else (e.g., admin, prod_app) is left with auto_provision_source: None.
compute_repairs 🔒
Inspect a consolidated snapshot and return the updates needed to converge every affected Role onto a single canonical-form +1 (or zero rows, for the dropped case), and every untouched Role onto canonical form so future writers’ retractions consolidate.
is_cloud_env 🔒
Detects whether the snapshot belongs to a Materialize Cloud environment. Extends the original v80_to_v81 heuristic to recognize the additional shape we missed back then:
try_as_role 🔒
Returns the parsed Role iff kind_json is one. Returns None for any other kind, or for rows we can’t deserialize as the current Role shape (which we treat as “leave alone” — losing that row to the repair would be worse than the soft_assert noise).
upgrade
version_update_kind 🔒
Produces the Config row encoding a user-version bump. Identical to the helper in upgrade.rs, duplicated here so this module is self-contained for testing.