mz_storage/source/mysql/snapshot.rs
1// Copyright Materialize, Inc. and contributors. All rights reserved.
2//
3// Use of this software is governed by the Business Source License
4// included in the LICENSE file.
5//
6// As of the Change Date specified in that file, in accordance with
7// the Business Source License, use of this software will be governed
8// by the Apache License, Version 2.0.
9
10//! Renders the table snapshot side of the [`MySqlSourceConnection`] dataflow.
11//!
12//! # Snapshot reading
13//!
14//! Depending on the `source_outputs resume_upper` parameters this dataflow decides which tables to
15//! snapshot and performs a simple `SELECT * FROM table` on them in order to get a snapshot.
16//! There are a few subtle points about this operation, described below.
17//!
18//! It is crucial for correctness that we always perform the snapshot of all tables at a specific
19//! point in time. This must be true even in the presence of restarts or partially committed
20//! snapshots. The consistent point that the snapshot must happen at is discovered and durably
21//! recorded during planning of the source and is exposed to this ingestion dataflow via the
22//! `initial_gtid_set` field in `MySqlSourceDetails`.
23//!
24//! Unfortunately MySQL does not provide an API to perform a transaction at a specific point in
25//! time. Instead, MySQL allows us to perform a snapshot of a table and let us know at which point
26//! in time the snapshot was taken. Using this information we can take a snapshot at an arbitrary
27//! point in time and then rewind it to the desired `initial_gtid_set` by "rewinding" it. These two
28//! phases are described in the following section.
29//!
30//! ## Producing a snapshot at a known point in time.
31//!
32//! Ideally we would like to start a transaction and ask MySQL to tell us the point in time this
33//! transaction is running at. As far as we know there isn't such API so we achieve this using
34//! table locks instead.
35//!
36//! The full set of tables that are meant to be snapshotted are partitioned among the workers. Each
37//! worker initiates a connection to the server and acquires a table lock on all the tables that
38//! have been assigned to it. By doing so we establish a moment in time where we know no writes are
39//! happening to the tables we are interested in. After the locks are taken each worker reads the
40//! current upper frontier (`snapshot_upper`) using the `@@gtid_executed` system variable. This
41//! frontier establishes an upper bound on any possible write to the tables of interest until the
42//! lock is released.
43//!
44//! Each worker now starts a transaction via a new connection with 'REPEATABLE READ' and
45//! 'CONSISTENT SNAPSHOT' semantics. Due to linearizability we know that this transaction's view of
46//! the database must some time `t_snapshot` such that `snapshot_upper <= t_snapshot`. We don't
47//! actually know the exact value of `t_snapshot` and it might be strictly greater than
48//! `snapshot_upper`. However, because this transaction will only be used to read the locked tables
49//! and we know that `snapshot_upper` is an upper bound on all the writes that have happened to
50//! them we can safely pretend that the transaction's `t_snapshot` is *equal* to `snapshot_upper`.
51//! We have therefore succeeded in starting a transaction at a known point in time!
52//!
53//! At this point it is safe for each worker to unlock the tables, since the transaction has
54//! established a point in time, and close the initial connection. Each worker can then read the
55//! snapshot of the tables it is responsible for and publish it downstream.
56//!
57//! TODO: Other software products hold the table lock for the duration of the snapshot, and some do
58//! not. We should figure out why and if we need to hold the lock longer. This may be because of a
59//! difference in how REPEATABLE READ works in some MySQL-compatible systems (e.g. Aurora MySQL).
60//!
61//! ## Rewinding the snapshot to a specific point in time.
62//!
63//! Having obtained a snapshot of a table at some `snapshot_upper` we are now tasked with
64//! transforming this snapshot into one at `initial_gtid_set`. In other words we have produced a
65//! snapshot containing all updates that happened at `t: !(snapshot_upper <= t)` but what we
66//! actually want is a snapshot containing all updates that happened at `t: !(initial_gtid <= t)`.
67//!
68//! If we assume that `initial_gtid_set <= snapshot_upper`, which is a fair assumption since the
69//! former is obtained before the latter, then we can observe that the snapshot we produced
70//! contains all updates at `t: !(initial_gtid <= t)` (i.e the snapshot we want) and some additional
71//! unwanted updates at `t: initial_gtid <= t && !(snapshot_upper <= t)`. We happen to know exactly
72//! what those additional unwanted updates are because those will be obtained by reading the
73//! replication stream in the replication operator and so all we need to do to "rewind" our
74//! `snapshot_upper` snapshot to `initial_gtid` is to ask the replication operator to "undo" any
75//! updates that falls in the undesirable region.
76//!
77//! This is exactly what `RewindRequest` is about. It informs the replication operator that a
78//! particular table has been snapshotted at `snapshot_upper` and would like all the updates
79//! discovered during replication that happen at `t: initial_gtid <= t && !(snapshot_upper <= t)`.
80//! to be cancelled. In Differential Dataflow this is as simple as flipping the sign of the diff
81//! field.
82//!
83//! The snapshot reader emits updates at the minimum timestamp (by convention) to allow the
84//! updates to be potentially negated by the replication operator, which will emit negated
85//! updates at the minimum timestamp (by convention) when it encounters rows from a table that
86//! occur before the GTID frontier in the Rewind Request for that table.
87use std::collections::{BTreeMap, BTreeSet};
88use std::rc::Rc;
89use std::sync::Arc;
90
91use differential_dataflow::AsCollection;
92use futures::TryStreamExt;
93use itertools::Itertools;
94use mysql_async::prelude::Queryable;
95use mysql_async::{IsolationLevel, Row as MySqlRow, TxOpts};
96use mz_mysql_util::{
97 ER_NO_SUCH_TABLE, MySqlError, pack_mysql_row, query_sys_var, quote_identifier,
98};
99use mz_ore::cast::CastFrom;
100use mz_ore::future::InTask;
101use mz_ore::iter::IteratorExt;
102use mz_ore::metrics::MetricsFutureExt;
103use mz_repr::{Diff, Row};
104use mz_storage_types::errors::DataflowError;
105use mz_storage_types::sources::MySqlSourceConnection;
106use mz_storage_types::sources::mysql::{GtidPartition, gtid_set_frontier};
107use mz_timely_util::antichain::AntichainExt;
108use mz_timely_util::builder_async::{OperatorBuilder as AsyncOperatorBuilder, PressOnDropButton};
109use mz_timely_util::containers::stack::AccountedStackBuilder;
110use timely::dataflow::operators::core::Map;
111use timely::dataflow::operators::{CapabilitySet, Concat};
112use timely::dataflow::{Scope, Stream};
113use timely::progress::Timestamp;
114use tracing::{error, trace};
115
116use crate::metrics::source::mysql::MySqlSnapshotMetrics;
117use crate::source::RawSourceCreationConfig;
118use crate::source::types::{SignaledFuture, SourceMessage, StackedCollection};
119use crate::statistics::SourceStatistics;
120
121use super::schemas::verify_schemas;
122use super::{
123 DefiniteError, MySqlTableName, ReplicationError, RewindRequest, SourceOutputInfo,
124 TransientError, return_definite_error, validate_mysql_repl_settings,
125};
126
127/// Renders the snapshot dataflow. See the module documentation for more information.
128pub(crate) fn render<G: Scope<Timestamp = GtidPartition>>(
129 scope: G,
130 config: RawSourceCreationConfig,
131 connection: MySqlSourceConnection,
132 source_outputs: Vec<SourceOutputInfo>,
133 metrics: MySqlSnapshotMetrics,
134) -> (
135 StackedCollection<G, (usize, Result<SourceMessage, DataflowError>)>,
136 Stream<G, RewindRequest>,
137 Stream<G, ReplicationError>,
138 PressOnDropButton,
139) {
140 let mut builder =
141 AsyncOperatorBuilder::new(format!("MySqlSnapshotReader({})", config.id), scope.clone());
142
143 let (raw_handle, raw_data) = builder.new_output::<AccountedStackBuilder<_>>();
144 let (rewinds_handle, rewinds) = builder.new_output();
145 // Captures DefiniteErrors that affect the entire source, including all outputs
146 let (definite_error_handle, definite_errors) = builder.new_output();
147
148 // A global view of all outputs that will be snapshot by all workers.
149 let mut all_outputs = vec![];
150 // A map containing only the table infos that this worker should snapshot.
151 let mut reader_snapshot_table_info = BTreeMap::new();
152 // Maps MySQL table name to export `SourceStatistics`. Same info exists in reader_snapshot_table_info,
153 // but this avoids having to iterate + map each time the statistics are needed.
154 let mut export_statistics = BTreeMap::new();
155 for output in source_outputs.into_iter() {
156 // Determine which outputs need to be snapshot and which already have been.
157 if *output.resume_upper != [GtidPartition::minimum()] {
158 // Already has been snapshotted.
159 continue;
160 }
161 all_outputs.push(output.output_index);
162 if config.responsible_for(&output.table_name) {
163 let export_stats = config
164 .statistics
165 .get(&output.export_id)
166 .expect("statistics have been intialized")
167 .clone();
168 export_statistics
169 .entry(output.table_name.clone())
170 .or_insert_with(Vec::new)
171 .push(export_stats);
172
173 reader_snapshot_table_info
174 .entry(output.table_name.clone())
175 .or_insert_with(Vec::new)
176 .push(output);
177 }
178 }
179
180 let (button, transient_errors): (_, Stream<G, Rc<TransientError>>) =
181 builder.build_fallible(move |caps| {
182 let busy_signal = Arc::clone(&config.busy_signal);
183 Box::pin(SignaledFuture::new(busy_signal, async move {
184 let [data_cap_set, rewind_cap_set, definite_error_cap_set]: &mut [_; 3] =
185 caps.try_into().unwrap();
186
187 let id = config.id;
188 let worker_id = config.worker_id;
189
190 if !all_outputs.is_empty() {
191 // A worker *must* emit a count even if not responsible for snapshotting a table
192 // as statistic summarization will return null if any worker hasn't set a value.
193 // This will also reset snapshot stats for any exports not snapshotting.
194 for statistics in config.statistics.values() {
195 statistics.set_snapshot_records_known(0);
196 statistics.set_snapshot_records_staged(0);
197 }
198 }
199
200 // If this worker has no tables to snapshot then there is nothing to do.
201 if reader_snapshot_table_info.is_empty() {
202 trace!(%id, "timely-{worker_id} initializing table reader \
203 with no tables to snapshot, exiting");
204 return Ok(());
205 } else {
206 trace!(%id, "timely-{worker_id} initializing table reader \
207 with {} tables to snapshot",
208 reader_snapshot_table_info.len());
209 }
210
211 let connection_config = connection
212 .connection
213 .config(
214 &config.config.connection_context.secrets_reader,
215 &config.config,
216 InTask::Yes,
217 )
218 .await?;
219 let task_name = format!("timely-{worker_id} MySQL snapshotter");
220
221 let lock_clauses = reader_snapshot_table_info
222 .keys()
223 .map(|t| format!("{} READ", t))
224 .collect::<Vec<String>>()
225 .join(", ");
226 let mut lock_conn = connection_config
227 .connect(
228 &task_name,
229 &config.config.connection_context.ssh_tunnel_manager,
230 )
231 .await?;
232 if let Some(timeout) = config
233 .config
234 .parameters
235 .mysql_source_timeouts
236 .snapshot_lock_wait_timeout
237 {
238 lock_conn
239 .query_drop(format!(
240 "SET @@session.lock_wait_timeout = {}",
241 timeout.as_secs()
242 ))
243 .await?;
244 }
245
246 trace!(%id, "timely-{worker_id} acquiring table locks: {lock_clauses}");
247 match lock_conn
248 .query_drop(format!("LOCK TABLES {lock_clauses}"))
249 .await
250 {
251 // Handle the case where a table we are snapshotting has been dropped or renamed.
252 Err(mysql_async::Error::Server(mysql_async::ServerError {
253 code,
254 message,
255 ..
256 })) if code == ER_NO_SUCH_TABLE => {
257 trace!(%id, "timely-{worker_id} received unknown table error from \
258 lock query");
259 let err = DefiniteError::TableDropped(message);
260 return Ok(return_definite_error(
261 err,
262 &all_outputs,
263 &raw_handle,
264 data_cap_set,
265 &definite_error_handle,
266 definite_error_cap_set,
267 )
268 .await);
269 }
270 e => e?,
271 };
272
273 // Record the frontier of future GTIDs based on the executed GTID set at the start
274 // of the snapshot
275 let snapshot_gtid_set =
276 query_sys_var(&mut lock_conn, "global.gtid_executed").await?;
277 let snapshot_gtid_frontier = match gtid_set_frontier(&snapshot_gtid_set) {
278 Ok(frontier) => frontier,
279 Err(err) => {
280 let err = DefiniteError::UnsupportedGtidState(err.to_string());
281 // If we received a GTID Set with non-consecutive intervals this breaks all
282 // our assumptions, so there is nothing else we can do.
283 return Ok(return_definite_error(
284 err,
285 &all_outputs,
286 &raw_handle,
287 data_cap_set,
288 &definite_error_handle,
289 definite_error_cap_set,
290 )
291 .await);
292 }
293 };
294
295 // TODO(roshan): Insert metric for how long it took to acquire the locks
296 trace!(%id, "timely-{worker_id} acquired table locks at: {}",
297 snapshot_gtid_frontier.pretty());
298
299 let mut conn = connection_config
300 .connect(
301 &task_name,
302 &config.config.connection_context.ssh_tunnel_manager,
303 )
304 .await?;
305
306 // Verify the MySQL system settings are correct for consistent row-based replication using GTIDs
307 match validate_mysql_repl_settings(&mut conn).await {
308 Err(err @ MySqlError::InvalidSystemSetting { .. }) => {
309 return Ok(return_definite_error(
310 DefiniteError::ServerConfigurationError(err.to_string()),
311 &all_outputs,
312 &raw_handle,
313 data_cap_set,
314 &definite_error_handle,
315 definite_error_cap_set,
316 )
317 .await);
318 }
319 Err(err) => Err(err)?,
320 Ok(()) => (),
321 };
322
323 trace!(%id, "timely-{worker_id} starting transaction with \
324 consistent snapshot at: {}", snapshot_gtid_frontier.pretty());
325
326 // Start a transaction with REPEATABLE READ and 'CONSISTENT SNAPSHOT' semantics
327 // so we can read a consistent snapshot of the table at the specific GTID we read.
328 let mut tx_opts = TxOpts::default();
329 tx_opts
330 .with_isolation_level(IsolationLevel::RepeatableRead)
331 .with_consistent_snapshot(true)
332 .with_readonly(true);
333 let mut tx = conn.start_transaction(tx_opts).await?;
334 // Set the session time zone to UTC so that we can read TIMESTAMP columns as UTC
335 // From https://dev.mysql.com/doc/refman/8.0/en/datetime.html: "MySQL converts TIMESTAMP values
336 // from the current time zone to UTC for storage, and back from UTC to the current time zone
337 // for retrieval. (This does not occur for other types such as DATETIME.)"
338 tx.query_drop("set @@session.time_zone = '+00:00'").await?;
339
340 // Configure query execution time based on param. We want to be able to
341 // override the server value here in case it's set too low,
342 // respective to the size of the data we need to copy.
343 if let Some(timeout) = config
344 .config
345 .parameters
346 .mysql_source_timeouts
347 .snapshot_max_execution_time
348 {
349 tx.query_drop(format!(
350 "SET @@session.max_execution_time = {}",
351 timeout.as_millis()
352 ))
353 .await?;
354 }
355
356 // We have started our transaction so we can unlock the tables.
357 lock_conn.query_drop("UNLOCK TABLES").await?;
358 lock_conn.disconnect().await?;
359
360 trace!(%id, "timely-{worker_id} started transaction");
361
362 // Verify the schemas of the tables we are snapshotting
363 let errored_outputs =
364 verify_schemas(&mut tx, reader_snapshot_table_info.iter().collect()).await?;
365 let mut removed_outputs = BTreeSet::new();
366 for (output, err) in errored_outputs {
367 // Publish the error for this table and stop ingesting it
368 raw_handle
369 .give_fueled(
370 &data_cap_set[0],
371 (
372 (output.output_index, Err(err.clone().into())),
373 GtidPartition::minimum(),
374 Diff::ONE,
375 ),
376 )
377 .await;
378 trace!(%id, "timely-{worker_id} stopping snapshot of output {output:?} \
379 due to schema mismatch");
380 removed_outputs.insert(output.output_index);
381 }
382 for (_, outputs) in reader_snapshot_table_info.iter_mut() {
383 outputs.retain(|output| !removed_outputs.contains(&output.output_index));
384 }
385 reader_snapshot_table_info.retain(|_, outputs| !outputs.is_empty());
386
387 let snapshot_total = fetch_snapshot_size(
388 &mut tx,
389 reader_snapshot_table_info
390 .iter()
391 .map(|(name, outputs)| {
392 (
393 name.clone(),
394 outputs.len(),
395 export_statistics.get(name).unwrap(),
396 )
397 })
398 .collect(),
399 metrics,
400 )
401 .await?;
402
403 // This worker has nothing else to do
404 if reader_snapshot_table_info.is_empty() {
405 return Ok(());
406 }
407
408 // Read the snapshot data from the tables
409 let mut final_row = Row::default();
410
411 let mut snapshot_staged_total = 0;
412 for (table, outputs) in &reader_snapshot_table_info {
413 let mut snapshot_staged = 0;
414 let query = build_snapshot_query(outputs);
415 trace!(%id, "timely-{worker_id} reading snapshot query='{}'", query);
416 let mut results = tx.exec_stream(query, ()).await?;
417 while let Some(row) = results.try_next().await? {
418 let row: MySqlRow = row;
419 snapshot_staged += 1;
420 for (output, row_val) in outputs.iter().repeat_clone(row) {
421 let event = match pack_mysql_row(&mut final_row, row_val, &output.desc)
422 {
423 Ok(row) => Ok(SourceMessage {
424 key: Row::default(),
425 value: row,
426 metadata: Row::default(),
427 }),
428 // Produce a DefiniteError in the stream for any rows that fail to decode
429 Err(err @ MySqlError::ValueDecodeError { .. }) => {
430 Err(DataflowError::from(DefiniteError::ValueDecodeError(
431 err.to_string(),
432 )))
433 }
434 Err(err) => Err(err)?,
435 };
436 raw_handle
437 .give_fueled(
438 &data_cap_set[0],
439 (
440 (output.output_index, event),
441 GtidPartition::minimum(),
442 Diff::ONE,
443 ),
444 )
445 .await;
446 }
447 // This overcounting maintains existing behavior but will be removed one readers no longer rely on the value.
448 snapshot_staged_total += u64::cast_from(outputs.len());
449 if snapshot_staged_total % 1000 == 0 {
450 for statistics in export_statistics.get(table).unwrap() {
451 statistics.set_snapshot_records_staged(snapshot_staged);
452 }
453 }
454 }
455 for statistics in export_statistics.get(table).unwrap() {
456 statistics.set_snapshot_records_staged(snapshot_staged);
457 }
458 trace!(%id, "timely-{worker_id} snapshotted {} records from \
459 table '{table}'", snapshot_staged * u64::cast_from(outputs.len()));
460 }
461
462 // We are done with the snapshot so now we will emit rewind requests. It is
463 // important that this happens after the snapshot has finished because this is what
464 // unblocks the replication operator and we want this to happen serially. It might
465 // seem like a good idea to read the replication stream concurrently with the
466 // snapshot but it actually leads to a lot of data being staged for the future,
467 // which needlesly consumed memory in the cluster.
468 for (table, outputs) in reader_snapshot_table_info {
469 for output in outputs {
470 trace!(%id, "timely-{worker_id} producing rewind request for {table}\
471 output {}", output.output_index);
472 let req = RewindRequest {
473 output_index: output.output_index,
474 snapshot_upper: snapshot_gtid_frontier.clone(),
475 };
476 rewinds_handle.give(&rewind_cap_set[0], req);
477 }
478 }
479 *rewind_cap_set = CapabilitySet::new();
480
481 // TODO (maz): Should we remove this to match Postgres?
482 if snapshot_staged_total < snapshot_total {
483 error!(%id, "timely-{worker_id} snapshot size {snapshot_total} is somehow \
484 bigger than records staged {snapshot_staged_total}");
485 }
486
487 Ok(())
488 }))
489 });
490
491 // TODO: Split row decoding into a separate operator that can be distributed across all workers
492
493 let errors = definite_errors.concat(&transient_errors.map(ReplicationError::from));
494
495 (
496 raw_data.as_collection(),
497 rewinds,
498 errors,
499 button.press_on_drop(),
500 )
501}
502
503/// Fetch the size of the snapshot on this worker and emits the appropriate emtrics and statistics
504/// for each table.
505async fn fetch_snapshot_size<Q>(
506 conn: &mut Q,
507 tables: Vec<(MySqlTableName, usize, &Vec<SourceStatistics>)>,
508 metrics: MySqlSnapshotMetrics,
509) -> Result<u64, anyhow::Error>
510where
511 Q: Queryable,
512{
513 let mut total = 0;
514 for (table, num_outputs, export_statistics) in tables {
515 let stats = collect_table_statistics(conn, &table).await?;
516 metrics.record_table_count_latency(table.1, table.0, stats.count_latency);
517 for export_stat in export_statistics {
518 export_stat.set_snapshot_records_known(stats.count);
519 export_stat.set_snapshot_records_staged(0);
520 }
521 total += stats.count * u64::cast_from(num_outputs);
522 }
523 Ok(total)
524}
525
526/// Builds the SQL query to be used for creating the snapshot using the first entry in outputs.
527///
528/// Expect `outputs` to contain entries for a single table, and to have at least 1 entry.
529/// Expect that each MySqlTableDesc entry contains all columns described in information_schema.columns.
530#[must_use]
531fn build_snapshot_query(outputs: &[SourceOutputInfo]) -> String {
532 let info = outputs.first().expect("MySQL table info");
533 for output in &outputs[1..] {
534 // the columns are decoded solely based on position, so we just need to ensure that
535 // all columns are accounted for.
536 assert!(
537 info.desc.columns.len() == output.desc.columns.len(),
538 "Mismatch in table descriptions for {}",
539 info.table_name
540 );
541 }
542 let columns = info
543 .desc
544 .columns
545 .iter()
546 .map(|col| quote_identifier(&col.name))
547 .join(", ");
548 format!("SELECT {} FROM {}", columns, info.table_name)
549}
550
551#[derive(Default)]
552struct TableStatistics {
553 count_latency: f64,
554 count: u64,
555}
556
557async fn collect_table_statistics<Q>(
558 conn: &mut Q,
559 table: &MySqlTableName,
560) -> Result<TableStatistics, anyhow::Error>
561where
562 Q: Queryable,
563{
564 let mut stats = TableStatistics::default();
565
566 let count_row: Option<u64> = conn
567 .query_first(format!("SELECT COUNT(*) FROM {}", table))
568 .wall_time()
569 .set_at(&mut stats.count_latency)
570 .await?;
571 stats.count = count_row.ok_or_else(|| anyhow::anyhow!("failed to COUNT(*) {table}"))?;
572
573 Ok(stats)
574}
575
576#[cfg(test)]
577mod tests {
578 use super::*;
579 use mz_mysql_util::{MySqlColumnDesc, MySqlTableDesc};
580 use timely::progress::Antichain;
581
582 #[mz_ore::test]
583 fn snapshot_query_duplicate_table() {
584 let schema_name = "myschema".to_string();
585 let table_name = "mytable".to_string();
586 let table = MySqlTableName(schema_name.clone(), table_name.clone());
587 let columns = ["c1", "c2", "c3"]
588 .iter()
589 .map(|col| MySqlColumnDesc {
590 name: col.to_string(),
591 column_type: None,
592 meta: None,
593 })
594 .collect::<Vec<_>>();
595 let desc = MySqlTableDesc {
596 schema_name: schema_name.clone(),
597 name: table_name.clone(),
598 columns,
599 keys: BTreeSet::default(),
600 };
601 let info = SourceOutputInfo {
602 output_index: 1, // ignored
603 table_name: table.clone(),
604 desc,
605 text_columns: vec![],
606 exclude_columns: vec![],
607 initial_gtid_set: Antichain::default(),
608 resume_upper: Antichain::default(),
609 export_id: mz_repr::GlobalId::User(1),
610 };
611 let query = build_snapshot_query(&[info.clone(), info]);
612 assert_eq!(
613 format!(
614 "SELECT `c1`, `c2`, `c3` FROM `{}`.`{}`",
615 &schema_name, &table_name
616 ),
617 query
618 );
619 }
620}