Module dogsdogsdogs::operators::half_join
source · Expand description
Dataflow operator for delta joins over partially ordered timestamps.
Given multiple streams of updates (data, time, diff) that are each
defined over the same partially ordered time, we want to form the
full cross-join of all relations (we will later apply some filters
and instead equijoin on keys).
The “correct” output is the outer join of these triples, where
- The
dataentries are just tuple’d up together, - The
timeentries are subjected to the latticejoinoperator, - The
diffentries are multiplied.
One way to produce the correct output is to form independent dataflow fragments for each input stream, such that each intended output is then produced by exactly one of these input streams.
There are several incorrect ways one might do this, but here is one way that I hope is not incorrect:
Each input stream of updates is joined with each other input collection,
where each input update is matched against each other input update that
has a time that is less-than the input update’s time, UNDER A TOTAL
ORDER ON time. The output are the (data, time, diff) entries that
follow the rules above, except that we additionally preserve the input’s
initial time as well, for use in subsequent joins with the other input
collections.
There are some caveats about ties, and we should treat each time for
each input as occurring at distinct times, one after the other (so that
ties are resolved by the index of the input). There is also the matter
of logical compaction, which should not be done in a way that prevents
the correct determination of the total order comparison.
Functions§
- A binary equijoin that responds to updates on only its first input.
- An unsafe variant of
half_joinwhere theoutput_funcclosure takes additional arguments fortimeanddiffas input and returns an iterator over(data, time, diff)triplets. This allows for more flexibility, but is more error-prone.