Expand description
Dataflow operator for delta joins over partially ordered timestamps.
Given multiple streams of updates (data, time, diff)
that are each
defined over the same partially ordered time
, we want to form the
full cross-join of all relations (we will later apply some filters
and instead equijoin on keys).
The “correct” output is the outer join of these triples, where
- The
data
entries are just tuple’d up together, - The
time
entries are subjected to the latticejoin
operator, - The
diff
entries are multiplied.
One way to produce the correct output is to form independent dataflow fragments for each input stream, such that each intended output is then produced by exactly one of these input streams.
There are several incorrect ways one might do this, but here is one way that I hope is not incorrect:
Each input stream of updates is joined with each other input collection,
where each input update is matched against each other input update that
has a time
that is less-than the input update’s time
, UNDER A TOTAL
ORDER ON time
. The output are the (data, time, diff)
entries that
follow the rules above, except that we additionally preserve the input’s
initial time
as well, for use in subsequent joins with the other input
collections.
There are some caveats about ties, and we should treat each time
for
each input as occurring at distinct times, one after the other (so that
ties are resolved by the index of the input). There is also the matter
of logical compaction, which should not be done in a way that prevents
the correct determination of the total order comparison.
Functions§
- A binary equijoin that responds to updates on only its first input.
- An unsafe variant of
half_join
where theoutput_func
closure takes additional arguments fortime
anddiff
as input and returns an iterator over(data, time, diff)
triplets. This allows for more flexibility, but is more error-prone.