JDB/Aggregation/Formal
JDB: Layout Columns | Queries Σ | API: Structural Records | Client/Server | Implementation | Examples
A few observations from experience with handling of TPC-H queries. See tpch demo folder, query.ijs file.
Here's a working prototype of aggregation prototype. It allowed to significantly simplify the aggregation plan below.
Contents
Definition of Aggregation
Given a sets of by-columns and residual columns, the latter each having a scalar-valued aggregator (count, sum, avg, min, max, etc.), aggregation replaces by-columns with their compound key (or nub), and their grouping map i.~ reduces each residual column with its aggregator.
Notes.
- i.~ is the simplest form of grouping map that is invariant
under = Self-Classify, which is part of /. Key definition: x u/. y is (=x) u@# y. It allows to combine several different-type columns into homogeneous compound map, to which i.~ can again be applied.
- The result set of aggregation has a natural key of
the by-columns.
Aggregation as Query Stage
In the trivial case of one by-column and one residual
sum(rc) by bc where ...
and rv,bv being filtered values, the result is obtained as
(~.bv) ; bv sum/. rv
However, typically there are multiple heterogeneous by-columns and/or multiple residuals. So the following aggregation plan will use index mapping.
Given a query with { T: target table, W: where filters, A,,1,, R,,1,, ,.., A,,m,, R,,m,,: aggregators with residuals, B,,1,,,..B,,n,,: by-columns }
1. Obtain S : list of column values filtered by W, by-columns first B,,1,,,..,B,,n,,,R,,1,,,..,R,,m,,
2. Build C : compound grouping map from S,,1,,,..,S,,n,, by-column results
- For each j in 1 .. n
- C,,j,, : i.~ S,,j,, is column self index
- C : i.~ C,,1,, ,. ... ,. C,,n,, is compound self-index
3. Build F : aggregator list for all n+m columns by prepending {. for each by-column F : (n # '{.') , A,,1,, , .. , A,,n,,
2. Calculate Z,,k,, : aggregates by F,,k,, of S,,k,, under map C
- For each k in 1 .. n + m
- Z,,k,, : C F,,k,, /. S,,k,, is aggregation by map C
Return results of aggregation Z,,1,, ,.., Z,,n+m,,
Query Notation
The new syntax affects the Select Clause:
Alias:Aggregator Column,... by Column,...
It should be easy to parse first by splitting with reserved word by and next splitting by comma, then isolating aggregator from residuals and finally cutting alias with : in resuduals.
If alias is given, it is used as column name in result set.
Reads__d 'Q1:sum R1,Q2:avg R2 by B1,B2 from t where Gender=Male' --+--+--+-- B1|B2|Q1|Q2 --+--+--+-- V1|V3|12|67 V2|V4|45|89
If alias is not given, and columns are different, their orginal names are used.
Reads__d 'min R1,max R2 by B1,B2 from t' --+--+--+-- B1|B2|R1|R2 --+--+--+-- V1|V3|12|67 V2|V4|45|89
If alias is not given, the same columns are prefixed with name of aggregator.
Reads__d 'min R,max R by B1,B2 from t where Gender=Male' --+--+----+---- B1|B2|minR|maxR --+--+----+---- V1|V3|12 |67 V2|V4|45 |89
More examples.
Read__d 'sum Salary by Department from t where Gender=Male' ----------+------- Department|Salary ----------+------- DEPT1 |1000000 DEPT2 |1500000 Read__d 'SalTotal:sum Salary by Department from t where Gender=Male' ----------+-------- Department|SalTotal ----------+-------- DEPT1 |1000000 DEPT2 |1500000
Note. In the future, column expressions can be used in place of input columns (B,,j,, and R,,k,,), which are applied after where-filter. For example, year_of(dt), uppercase(name), or (price*discount) etc. Result set names of same columns should be prefixed or alias required.
Hierarchical Aggregation
Each by-column can be thought of as a dimension and aggregation result is a sparse cube, each cell containg a residual tuple addressed by the "coordinates" formed with values of by-columns.
The definition above can be called simple or one-level aggregation. In contrast, multi-level (hierarchical, pivot-table, OLAP cube) aggregation, is when more than one set of by-columns is given. It can be executed as a sequence of simple aggregations for each level, where the by-sets above are added to the filter, current by-set is used as simple by-columns and lower levels ignored, which corresponds to NULLs in the dimensions coordinate. The process can be optimized by re-using the aggregations from lower levels.