Module mz_compute_types::plan::reduce
source · Expand description
Reduction execution planning and dataflow construction.
We build ReducePlan
s to manage the complexity of planning the generated dataflow for a
given reduce expression. The intent here is that each creating a ReducePlan
should capture
all of the decision making about what kind of dataflow do we need to render and what each
operator needs to do, and then actually rendering the plan can be a relatively simple application
of (as much as possible) straight line code.
Materialize needs to be able to maintain reductions incrementally (roughly, using
time proportional to the number of changes in the input) and ideally, with a
memory footprint proportional to the number of reductions being computed. We have to employ
several tricks to achieve that, and these tricks constitute most of the complexity involved
with planning and rendering reduce expressions. There’s some additional complexity involved
in handling aggregations with DISTINCT
correctly so that we can efficiently suppress
duplicate updates.
In order to optimize the performance of our rendered dataflow, we divide all aggregations into three distinct types. Each type gets rendered separately, with its own specialized plan and dataflow. The three types are as follows:
-
Accumulable: Accumulable reductions can be computed inline in a Differential update’s
difference
field because they basically boil down to tracking counts of things.sum()
is an example of an accumulable reduction, and when some elementx
is removed from the set of elements being summed, we can introduce-x
to incrementally maintain the sum. More formally, accumulable reductions correspond to instances of commutative Abelian groups. -
Hierarchical: Hierarchical reductions don’t have a meaningful negation like accumulable reductions do, but they are still commutative and associative, which lets us compute the reduction over subsets of the input, and then compute the reduction again on those results. For example:
min[2, 5, 1, 10]
is the same asmin[ min[2, 5], min[1, 10]]
. When we compute hierarchical reductions this way, we can maintain the computation in sublinear time with respect to the overall input.min
andmax
are two examples of hierarchical reductions. More formally, hierarchical reductions correspond to instances of semigroups, in that they are associative, but in order to benefit from being computed hierarchically, they need to have some reduction in data size as well. A function like “concat-everything-to-a-string” wouldn’t benefit from hierarchical evaluation.When the input is append-only, or monotonic, reductions that would otherwise have to be computed hierarchically can instead be computed in-place, because we only need to keep the value that’s better than the “best” (minimal or maximal for min and max) seen so far.
-
Basic: Basic reductions are a bit like the Hufflepuffs of this trifecta. They are neither accumulable nor hierarchical (most likely they are associative but don’t involve any data reduction) and so for these we can’t do much more than just defer to Differential’s reduce operator and eat a large maintenance cost.
When we render these reductions we want to limit the number of arrangements we produce. When we build a dataflow for a reduction containing multiple types of reductions, we have no choice but to divide up the requested aggregations by type, render each type separately and then take those results and collate them back in the requested output order. However, if we only need to perform aggregations of a single reduction type, we can specialize and render the dataflow to compute those aggregations in the correct order, and return the output arrangement directly and avoid the extra collation arrangement.
Modules§
- Nested message and enum types in
ProtoAccumulablePlan
. - Nested message and enum types in
ProtoBasicPlan
. - Nested message and enum types in
ProtoHierarchicalPlan
. - Nested message and enum types in
ProtoReducePlan
. - Nested message and enum types in
ProtoReductionType
.
Structs§
- Plan for computing a set of accumulable aggregations.
- Plan for computing a set of hierarchical aggregations with non-monotonic inputs.
- Plan for collating the results of computing multiple aggregation types.
- Plan for extracting keys and values in preparation for a reduction.
- Plan for computing a set of hierarchical aggregations with a monotonic input.
- Plan for rendering a single basic aggregation, with possibly fusing a
FlatMap UnnestList
with this aggregation.
Enums§
- Plan for computing a set of basic aggregations.
- Plan for computing a set of hierarchical aggregations.
- A
ReducePlan
provides a concise description for how we will execute a given reduce expression. - This enum represents the three potential types of aggregations.
Functions§
expected_group_size
is a u64, but instead of a uniform distribution, we want a logarithmic distribution so that we have an even distribution in the number of layers of buckets that a hierarchical plan would have.- Transforms a vector containing indexes of needed columns into one containing the “skips” an iterator over a Row would need to perform to see those values.
- Determines whether a function can be accumulated in an update’s “difference” field, and whether it can be subjected to recursive (hierarchical) aggregation.