Expand description

Reduction execution planning and dataflow construction. We build ReducePlans to manage the complexity of planning the generated dataflow for a given reduce expression. The intent here is that each creating a ReducePlan should capture all of the decision making about what kind of dataflow do we need to render and what each operator needs to do, and then actually rendering the plan can be a relatively simple application of (as much as possible) straight line code.

Materialize needs to be able to maintain reductions incrementally (roughly, using time proportional to the number of changes in the input) and ideally, with a memory footprint proportional to the number of reductions being computed. We have to employ several tricks to achieve that, and these tricks constitute most of the complexity involved with planning and rendering reduce expressions. There’s some additional complexity involved in handling aggregations with DISTINCT correctly so that we can efficiently suppress duplicate updates.

In order to optimize the performance of our rendered dataflow, we divide all aggregations into three distinct types. Each type gets rendered separately, with its own specialized plan and dataflow. The three types are as follows:

  1. Accumulable: Accumulable reductions can be computed inline in a Differential update’s difference field because they basically boil down to tracking counts of things. sum() is an example of an accumulable reduction, and when some element x is removed from the set of elements being summed, we can introduce -x to incrementally maintain the sum. More formally, accumulable reductions correspond to instances of commutative Abelian groups.

  2. Hierarchical: Hierarchical reductions don’t have a meaningful negation like accumulable reductions do, but they are still commutative and associative, which lets us compute the reduction over subsets of the input, and then compute the reduction again on those results. For example: min[2, 5, 1, 10] is the same as min[ min[2, 5], min[1, 10]]. When we compute hierarchical reductions this way, we can maintain the computation in sublinear time with respect to the overall input. min and max are two examples of hierarchical reductions. More formally, hierarchical reductions correspond to instances of semigroups, in that they are associative, but in order to benefit from being computed hierarchically, they need to have some reduction in data size as well. A function like “concat-everything-to-a-string” wouldn’t benefit from hierarchical evaluation.

    When the input is append-only, or monotonic, reductions that would otherwise have to be computed hierarchically can instead be computed in-place, because we only need to keep the value that’s better than the “best” (minimal or maximal for min and max) seen so far.

  3. Basic: Basic reductions are a bit like the Hufflepuffs of this trifecta. They are neither accumulable nor hierarchical (most likely they are associative but don’t involve any data reduction) and so for these we can’t do much more than just defer to Differential’s reduce operator and eat a large maintenance cost.

When we render these reductions we want to limit the number of arrangements we produce. When we build a dataflow for a reduction containing multiple types of reductions, we have no choice but to divide up the requested aggregations by type, render each type separately and then take those results and collate them back in the requested output order. However, if we only need to perform aggregations of a single reduction type, we can specialize and render the dataflow to compute those aggregations in the correct order, and return the output arrangement directly and avoid the extra collation arrangement.

Modules

Structs

Enums

  • Plan for computing a set of basic aggregations.
  • Plan for computing a set of hierarchical aggregations.
  • A ReducePlan provides a concise description for how we will execute a given reduce expression.
  • This enum represents the three potential types of aggregations.

Constants

Functions

  • expected_group_size is a u64, but instead of a uniform distribution, we want a logarithmic distribution so that we have an even distribution in the number of layers of buckets that a hierarchical plan would have.
  • Transforms a vector containing indexes of needed columns into one containing the “skips” an iterator over a Row would need to perform to see those values.
  • Determines whether a function can be accumulated in an update’s “difference” field, and whether it can be subjected to recursive (hierarchical) aggregation.