Struct mz_persist::indexed::arrangement::Arrangement

source · [−]

pub struct Arrangement { /* private fields */ }

Expand description

A persistent, compacting data structure containing indexed (Key, Value, Time, i64) entries.

The data is logically and physically separated into two “buckets”: unsealed and trace. It first enters and is initially placed into unsealed, which is a holding pen roughly corresponding to the in-memory buffer of a differential dataflow arrangement operator. At some point, the arranged collection is sealed, which advances the upper timestamp of the collection and logically (but not physically) moves the data into trace. The trace bucket indexes the data by (key, value, time). At some later point, unsealed_step is called, which physically moves the data from unsealed to trace.

There are two notable differences between a persisted arrangement and a differential in-mem one (besides the obvious durability):

Because in-mem operations are so much faster than ones on durable storage, the act of advancing the frontier and moving data into trace, one step in differential, is split into separate steps in persist.
The differential arrangement keeps the data arranged for efficient indexed access (hence the name). Persist also keeps the data arranged the same way, but finishing up the plumbing for indexed access is still a TODO.

Further details below.

Unsealed

Unsealed exists to hold data that has been added to the persistent collection but not yet “seal“ed into a trace. We store incoming data as immutable batches of updates, corresponding to non-empty, sorted intervals of crate::location::SeqNos.

As times get sealed and the corresponding updates get moved into the trace, Unsealed can remove those updates, and eventually, entire batches. The approach to removing sealed data optimizes for the common case, for which we assume that:

data arrives roughly in order,
unsealed batches contain data for a small range of distinct times. Every unsealed batch tracks the minimum and maximum update timestamp contained within its list of updates, and we eagerly drop batches that only contain data prior to the sealed frontier whenever possible. In the expected case, this should be sufficient to ensure that Unsealed maintains a bounded storage footprint. If either of the two assumptions are violated, either because updates arrive out of order, or batches contain data at many distinct timestamps, we periodically try to remove the updates strictly behind the current sealed frontier from a given unsealed batch and replace it with a “trimmed” batch that uses less storage.

This approach intentionally does nothing to physically coalesce multiple unsealed batches into a single unsealed batch. Doing so has many potential downsides; for example physically merging a batch containing updates 5 seconds ahead of the current sealed frontier with another batch containing updates 5 hours ahead of the current sealed frontier would only hurt 5 seconds later, when the previously unmerged batch would have been dropped. Instead, the merged batch has to be trimmed, which requires an extra read and write. If we end up having significant amounts of data far ahead of the current sealed frontier we likely will need a different structure that can hold batches of updates organized by overlapping ranges of times and physically merge unsealed batches using an approach similar to trace physical compaction.

Trace

An append-only list of immutable batches that describe updates corresponding to sorted, contiguous, non-overlapping ranges of times. The since frontier defines a time before which we can compact the history of updates (and correspondingly no longer answer queries about).

We can compact the updates prior to the since frontier physically, by combining batches representing consecutive intervals into one large batch representing the union of those intervals, and logically, by forwarding updates at times before the since frontier to the since frontier.

We also want to achieve a balance between the compactness of the representation with the computational effort required to maintain the representation. Specifically, if we have N batches of data already compacted, we don’t want every additional batch to perform O(N) work (i.e. merge with N batches worth of data) in order to get compacted. Instead, we would like to keep a geometrically decreasing (when viewed from oldest to most recent) sequence of batches and perform O(N) work every N calls to append. Thankfully, we can achieve all of this with a few simple rules:

Batches are able to be compacted once the since frontier is in advance of all of the data in the batch.
All batches are assigned a nonzero compaction level. When a new batch is appended to the trace it is assigned a compaction level of 0.
We periodically merge consecutive batches at the same level L representing time intervals [lo, mid) and [mid, hi) into a single batch representing all of the updates in [lo, hi) with level L + 1 iff the new batch contains more data than both of its parents, and L otherwise.. Once two batches are merged they are removed from the trace and replaced with the merged batch.
Perform merges for the oldest batches possible first.

NB: this approach assumes that all batches are roughly uniformly sized when they are first appended.

Invariants

New updates less than the seal frontier are never added to unsealed.
Unsealed batches have non-overlapping SeqNo ranges.
All trace updates are less than the seal frontier.
Trace batches are sorted by time and represent a sorted, consecutive, non-overlapping list of time intervals.
Individual batches are immutable, and their set of updates, the time interval they describe and their compaction level all remain constant as long as the batch remains in the trace.
The compaction levels across the list of batches in a trace are weakly decreasing (non-increasing) when iterating from oldest to most recent time intervals.
TODO: Space usage.

Struct mz_persist::indexed::arrangement::Arrangement

Implementations

impl Arrangement

pub fn new(meta: ArrangementMeta) -> Self

pub fn new_blob_key() -> String

pub fn meta(&self) -> ArrangementMeta

pub fn unsealed_seqno_upper(&self) -> SeqNo

pub fn snapshot<L: BlobRead>( &self, seqno: SeqNo, blob: &BlobCache<L>) -> Result<ArrangementSnapshot, Error>

pub fn unsealed_append<L: Blob>( &mut self, batch: BlobUnsealedBatch, blob: &mut BlobCache<L>) -> Result<(), Error>

pub fn unsealed_snapshot( &self, ts_lower: Antichain<u64>, ts_upper: Antichain<u64>) -> Result<UnsealedSnapshotMeta, Error>

pub fn unsealed_drain<L: Blob>( &mut self, blob: &mut BlobCache<L>) -> Result<(), Error>

pub fn unsealed_next_drain_req(&self) -> Result<Option<DrainUnsealedReq>, Error>

pub fn drain_unsealed_blocking<B: Blob>( blob: &BlobCache<B>, req: DrainUnsealedReq) -> Result<DrainUnsealedRes, Error>

pub fn unsealed_handle_drain_response(&mut self, res: DrainUnsealedRes)

pub fn unsealed_evict(&mut self) -> Vec<UnsealedBatchMeta>ⓘNotable traits for Vec<u8, A>impl<A> Write for Vec<u8, A> where A: Allocator,

pub fn unsealed_step<L: Blob>( &mut self, blob: &mut BlobCache<L>) -> Result<bool, Error>

pub fn trace_ts_upper(&self) -> Antichain<u64>

pub fn get_seal(&self) -> Antichain<u64>

pub fn update_seal(&mut self, ts: u64)

pub fn validate_seal(&self, ts: u64) -> Result<(), String>

pub fn since(&self) -> Antichain<u64>

pub fn validate_allow_compaction( &self, since: &Antichain<u64>) -> Result<(), String>

pub fn allow_compaction(&mut self, since: Antichain<u64>)

pub fn trace_snapshot<B: BlobRead>(&self, blob: &BlobCache<B>) -> TraceSnapshot

pub fn trace_next_compact_req(&self) -> Result<Option<CompactTraceReq>, Error>

pub fn trace_handle_compact_response( &mut self, res: CompactTraceRes) -> Vec<TraceBatchMeta>ⓘNotable traits for Vec<u8, A>impl<A> Write for Vec<u8, A> where A: Allocator,

Trait Implementations

impl Debug for Arrangement

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Auto Trait Implementations

impl RefUnwindSafe for Arrangement

impl Send for Arrangement

impl Sync for Arrangement

impl Unpin for Arrangement

impl UnwindSafe for Arrangement

Blanket Implementations

impl<T> Any for T where T: 'static + ?Sized,

fn type_id(&self) -> TypeId

impl<T> Borrow<T> for T where T: ?Sized,

fn borrow(&self) -> &T

impl<T> BorrowMut<T> for T where T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

impl<T> From<T> for T

fn from(t: T) -> T

impl<T> FutureExt for T

fn with_context(self, otel_cx: Context) -> WithContext<Self>

fn with_current_context(self) -> WithContext<Self>

impl<T> Instrument for T

fn instrument(self, span: Span) -> Instrumented<Self>

fn in_current_span(self) -> Instrumented<Self>

impl<T, U> Into<U> for T where U: From<T>,

fn into(self) -> U

impl<T> IntoRequest<T> for T

fn into_request(self) -> Request<T>

impl<T> Same<T> for T

type Output = T

impl<T, U> TryFrom<U> for T where U: Into<T>,

type Error = Infallible

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

impl<T, U> TryInto<U> for T where U: TryFrom<T>,

type Error = <U as TryFrom<T>>::Error

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

impl<V, T> VZip<V> for T where V: MultiLane<T>,

fn vzip(self) -> V

impl<T> WithSubscriber for T

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self> where S: Into<Dispatch>,

fn with_current_subscriber(self) -> WithDispatch<Self>

pub fn snapshot<L: BlobRead>(
&self,
seqno: SeqNo,
blob: &BlobCache<L>
) -> Result<ArrangementSnapshot, Error>

pub fn unsealed_append<L: Blob>(
&mut self,
batch: BlobUnsealedBatch,
blob: &mut BlobCache<L>
) -> Result<(), Error>

pub fn unsealed_snapshot(
&self,
ts_lower: Antichain<u64>,
ts_upper: Antichain<u64>
) -> Result<UnsealedSnapshotMeta, Error>

pub fn unsealed_drain<L: Blob>(
&mut self,
blob: &mut BlobCache<L>
) -> Result<(), Error>

pub fn drain_unsealed_blocking<B: Blob>(
blob: &BlobCache<B>,
req: DrainUnsealedReq
) -> Result<DrainUnsealedRes, Error>

pub fn unsealed_evict(&mut self) -> Vec<UnsealedBatchMeta>ⓘNotable traits for Vec<u8, A>`impl<A> Write for Vec<u8, A> where A: Allocator,`

pub fn unsealed_step<L: Blob>(
&mut self,
blob: &mut BlobCache<L>
) -> Result<bool, Error>

pub fn validate_allow_compaction(
&self,
since: &Antichain<u64>
) -> Result<(), String>

pub fn trace_handle_compact_response(
&mut self,
res: CompactTraceRes
) -> Vec<TraceBatchMeta>ⓘNotable traits for Vec<u8, A>`impl<A> Write for Vec<u8, A> where A: Allocator,`

impl<T> Any for T where
T: 'static + ?Sized,

impl<T> Borrow<T> for T where
T: ?Sized,

impl<T> BorrowMut<T> for T where
T: ?Sized,

impl<T, U> Into<U> for T where
U: From<T>,

impl<T, U> TryFrom<U> for T where
U: Into<T>,

impl<T, U> TryInto<U> for T where
U: TryFrom<T>,

impl<V, T> VZip<V> for T where
V: MultiLane<T>,

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self> where
S: Into<Dispatch>,