An abstraction presenting as a durable time-varying collection (aka shard)
Persist is an implementation detail of STORAGE. Its “public” API is used only by STORAGE and code outside of STORAGE should be talking to STORAGE, not persist. However, having this strong API boundary between the two allows for the two teams to execute independently.
Persist’s primary abstraction is a “shard”, which is a durable and definite
Time-Varying Collection (TVC). Persist requires that the collection’s “data”
is key-value structured (a
() unit value is fine) but otherwise allows the
key, value, time, and diff to be abstract in terms of the Codec and Codec64
encode/decode traits. As a result, persist is independent of Materialize’s
internal data formats:
Along with talking to the outside world, STORAGE pilots persist shards around to present a “STORAGE collection”, which is a durable TVC that handles reshardings and version upgrades. A persist shard is exactly a storage shard and they can be used interchangeably.
More details available in the persist design doc.
In general, with proper usage and hardware (and once we finish tuning), persist should be able to saturate 75% or more of the available network bandwidth on both writes and reads.
s3_pg configuration has sustained 64 MiB/s of goodput for
10+ hours in an open loop benchmark. This is nowhere near our max, but should
easily be sufficient for M1. TODO: Add numbers under contention.
cargo run -p mz-persist-client --bin persist_open_loop_benchmark --blob_uri=... --consensus_uri=...
Materialize is not an OLTP database, so our initial tunings are for throughput over latency. There are some tricks we can play in the future to get these latencies down, but here’s a quick idea of where we’re starting.
The vertical axis:
mem_memuses im-memory implementations of “external durability”. These exist for testing but here they’re nice because they show the overhead of persist itself.
file_pguses files for blob and Postgres for consensus. This is what you might expect in local development or in CI. (These numbers, like the rest, are from a persist-benchmarking ./bin/scratch box. Maybe this one should be from a laptop?)
s3_pguses s3 for blob and AWS Postgres Aurora for consensus. This is what you might expect in production.
The horizontal axis:
writeis an un-contended small write (append/compare_and_append).
wtl(write_to_listen) is the total latency between the beginning of a small write and it being emitted by a listener. Think of this as persist’s contribution to the latency between
INSERT-ing a row into Materialize and getting it back out with
(est)variant is whatever Criterion uses to select its “best estimate” and
(p95)is the higher end of Criterion’s confidence interval (not actually a p95 but sorta like one).
- TODO: Real p50/p95/p99/max.
|write (est)||write (p95)||wtl (est)||wtl (p95)|
These numbers are from our micro-benchmarks.
cargo bench -p mz-persist-client --bench=benches
Larger writes are expected to take the above latency floor plus however long it takes to write the extra data to e.g s3. TODO: Get real numbers for larger batches.
Errors for the crate
An implementation of the public crate interface.
Read capabilities and handles
Write capabilities and handles
A handle for interacting with the set of persist shard made durable at a single PersistLocation.
A location in s3, other cloud storage, or otherwise “durable storage” used by persist.
An opaque identifier for a persist durable TVC (aka shard).
Wrapper for Antichain that represents a Since
Wrapper for Antichain that represents an Upper