Expand description
Reduction execution planning and dataflow construction.
We build ReducePlan
s to manage the complexity of planning the generated dataflow for a
given reduce expression. The intent here is that each creating a ReducePlan
should capture
all of the decision making about what kind of dataflow do we need to render and what each
operator needs to do, and then actually rendering the plan can be a relatively simple application
of (as much as possible) straight line code.
Materialize needs to be able to maintain reductions incrementally (roughly, using
time proportional to the number of changes in the input) and ideally, with a
memory footprint proportional to the number of reductions being computed. We have to employ
several tricks to achieve that, and these tricks constitute most of the complexity involved
with planning and rendering reduce expressions. There’s some additional complexity involved
in handling aggregations with DISTINCT
correctly so that we can efficiently suppress
duplicate updates.
In order to optimize the performance of our rendered dataflow, we divide all aggregations into three distinct types. Each type gets rendered separately, with its own specialized plan and dataflow. The three types are as follows:
-
Accumulable: Accumulable reductions can be computed inline in a Differential update’s
difference
field because they basically boil down to tracking counts of things.sum()
is an example of an accumulable reduction, and when some elementx
is removed from the set of elements being summed, we can introduce-x
to incrementally maintain the sum. More formally, accumulable reductions correspond to instances of commutative Abelian groups. -
Hierarchical: Hierarchical reductions don’t have a meaningful negation like accumulable reductions do, but they are still commutative and associative, which lets us compute the reduction over subsets of the input, and then compute the reduction again on those results. For example:
min[2, 5, 1, 10]
is the same asmin[ min[2, 5], min[1, 10]]
. When we compute hierarchical reductions this way, we can maintain the computation in sublinear time with respect to the overall input.min
andmax
are two examples of hierarchical reductions. More formally, hierarchical reductions correspond to instances of semigroups, in that they are associative, but in order to benefit from being computed hierarchically, they need to have some reduction in data size as well. A function like “concat-everything-to-a-string” wouldn’t benefit from hierarchical evaluation.When the input is append-only, or monotonic, reductions that would otherwise have to be computed hierarchically can instead be computed in-place, because we only need to keep the value that’s better than the “best” (minimal or maximal for min and max) seen so far.
-
Basic: Basic reductions are a bit like the Hufflepuffs of this trifecta. They are neither accumulable nor hierarchical (most likely they are associative but don’t involve any data reduction) and so for these we can’t do much more than just defer to Differential’s reduce operator and eat a large maintenance cost.
When we render these reductions we want to limit the number of arrangements we produce. When we build a dataflow for a reduction containing multiple types of reductions, we have no choice but to divide up the requested aggregations by type, render each type separately and then take those results and collate them back in the requested output order. However, if we only need to perform aggregations of a single reduction type, we can specialize and render the dataflow to compute those aggregations in the correct order, and return the output arrangement directly and avoid the extra collation arrangement.
Modules§
- proto_
accumulable_ plan - Nested message and enum types in
ProtoAccumulablePlan
. - proto_
basic_ plan - Nested message and enum types in
ProtoBasicPlan
. - proto_
hierarchical_ plan - Nested message and enum types in
ProtoHierarchicalPlan
. - proto_
reduce_ plan - Nested message and enum types in
ProtoReducePlan
. - proto_
reduction_ type - Nested message and enum types in
ProtoReductionType
.
Structs§
- Accumulable
Plan - Plan for computing a set of accumulable aggregations.
- Bucketed
Plan - Plan for computing a set of hierarchical aggregations with non-monotonic inputs.
- Collation
Plan - Plan for collating the results of computing multiple aggregation types.
- KeyVal
Plan - Plan for extracting keys and values in preparation for a reduction.
- Monotonic
Plan - Plan for computing a set of hierarchical aggregations with a monotonic input.
- Proto
Accumulable Plan - Proto
Basic Plan - Proto
Bucketed Plan - Proto
Collation Plan - Proto
Hierarchical Plan - Proto
KeyVal Plan - Proto
Monotonic Plan - Proto
Reduce Plan - Proto
Reduction Type - Single
Basic Plan - Plan for rendering a single basic aggregation, with possibly fusing a
FlatMap UnnestList
with this aggregation.
Enums§
- Basic
Plan - Plan for computing a set of basic aggregations.
- Hierarchical
Plan - Plan for computing a set of hierarchical aggregations.
- Reduce
Plan - A
ReducePlan
provides a concise description for how we will execute a given reduce expression. - Reduction
Type - This enum represents the three potential types of aggregations.
Functions§
- any_
group_ 🔒size expected_group_size
is a u64, but instead of a uniform distribution, we want a logarithmic distribution so that we have an even distribution in the number of layers of buckets that a hierarchical plan would have.- convert_
indexes_ to_ skips - Transforms a vector containing indexes of needed columns into one containing the “skips” an iterator over a Row would need to perform to see those values.
- reduction_
type - Determines whether a function can be accumulated in an update’s “difference” field, and whether it can be subjected to recursive (hierarchical) aggregation.