const DRAIN_CHUNK_ROWS: usize = 16;Expand description
Drain rows from staging in chunks of this size. Inside each chunk we do three sequential
per-column passes — long enough for autovectorization (4–8 vector iterations on NEON / SVE2
128-bit / AVX2 / SVE 256-bit) — while the outer per-chunk size check bounds overshoot to
K * row_words. With a 1 KiB row (128 words) and K=16, worst-case overshoot is 2 KiB,
well under the 10% slop budget on a 2 MiB target.