Expand description

An abstraction presenting as a durable time-varying collection (aka shard)

Persist

Persist is an implementation detail of STORAGE. Its “public” API is used only by STORAGE and code outside of STORAGE should be talking to STORAGE, not persist. However, having this strong API boundary between the two allows for the two teams to execute independently.

Persist’s primary abstraction is a “shard”, which is a durable and definite Time-Varying Collection (TVC). Persist requires that the collection’s “data” is key-value structured (a () unit value is fine) but otherwise allows the key, value, time, and diff to be abstract in terms of the Codec and Codec64 encode/decode traits. As a result, persist is independent of Materialize’s internal data formats: Row etc.

Along with talking to the outside world, STORAGE pilots persist shards around to present a “STORAGE collection”, which is a durable TVC that handles reshardings and version upgrades. A persist shard is exactly a storage shard and they can be used interchangeably.

More details available in the persist design doc.

FAQ: What is persist’s throughput?

In general, with proper usage and hardware (and once we finish tuning), persist should be able to saturate 75% or more of the available network bandwidth on both writes and reads.

Experimentally, the s3_pg configuration has sustained 64 MiB/s of goodput for 10+ hours in an open loop benchmark. This is nowhere near our max, but should easily be sufficient for M1. TODO: Add numbers under contention.

cargo run -p mz-persist-client --bin persist_open_loop_benchmark --blob_uri=... --consensus_uri=...

FAQ: What is persist’s latency?

Materialize is not an OLTP database, so our initial tunings are for throughput over latency. There are some tricks we can play in the future to get these latencies down, but here’s a quick idea of where we’re starting.

The vertical axis:

  • mem_mem uses im-memory implementations of “external durability”. These exist for testing but here they’re nice because they show the overhead of persist itself.
  • file_pg uses files for blob and Postgres for consensus. This is what you might expect in local development or in CI. (These numbers, like the rest, are from a persist-benchmarking ./bin/scratch box. Maybe this one should be from a laptop?)
  • s3_pg uses s3 for blob and AWS Postgres Aurora for consensus. This is what you might expect in production.

The horizontal axis:

  • write is an un-contended small write (append/compare_and_append).
  • wtl (write_to_listen) is the total latency between the beginning of a small write and it being emitted by a listener. Think of this as persist’s contribution to the latency between INSERT-ing a row into Materialize and getting it back out with SELECT.
  • The (est) variant is whatever Criterion uses to select its “best estimate” and (p95) is the higher end of Criterion’s confidence interval (not actually a p95 but sorta like one).
  • TODO: Real p50/p95/p99/max.
write (est)write (p95)wtl (est)wtl (p95)
mem_mem< 1ms<1ms5.1ms5.2ms
file_pg5.5ms5.6ms6.4ms6.5ms
s3_pg45ms47ms79ms82ms

These numbers are from our micro-benchmarks.

cargo bench -p mz-persist-client --bench=benches

Larger writes are expected to take the above latency floor plus however long it takes to write the extra data to e.g s3. TODO: Get real numbers for larger batches.

Modules

Errors for the crate

Read capabilities and handles

Write capabilities and handles

Structs

A handle for interacting with the set of persist shard made durable at a single PersistLocation.

A location in s3, other cloud storage, or otherwise “durable storage” used by persist.

An opaque identifier for a persist durable TVC (aka shard).

Wrapper for Antichain that represents a Since

Wrapper for Antichain that represents an Upper