# JDB/Aggregation/Formal

JDB: Layout Columns | Queries Σ | API: Structural Records | Client/Server | Implementation | Examples

A few observations from experience with handling of TPC-H queries.
See tpch demo folder, `query.ijs` file.

Here's a working prototype of aggregation prototype. It allowed to significantly simplify the aggregation plan below.

## Contents

### Definition of Aggregation

Given a sets of *by*-columns and *residual* columns,
the latter each having a scalar-valued aggregator
(count, sum, avg, min, max, etc.), *aggregation* replaces
by-columns with their compound key (or nub), and their
grouping map `i.~` reduces each residual column with
its aggregator.

Notes.

`i.~`is the simplest form of grouping map that is invariant

under `=` Self-Classify, which is part of `/.`
Key definition: ` x u/. y is (=x) u@# y`.
It allows to combine several different-type columns into
homogeneous compound map, to which `i.~` can again be applied.

- The result set of aggregation has a natural key of

the by-columns.

### Aggregation as Query Stage

In the trivial case of one by-column and one residual

` sum(rc) by bc where ...`

and `rv,bv` being filtered values, the result is obtained as

` (~.bv) ; bv sum/. rv `

However, typically there are multiple heterogeneous by-columns and/or multiple residuals. So the following aggregation plan will use index mapping.

Given a query with { *T*: target table, *W*: where filters,
*A,,1,, R,,1,, ,.., A,,m,, R,,m,,*: aggregators with residuals,
*B,,1,,,..B,,n,,*: by-columns }

1. Obtain *S* : list of column values filtered by *W*, by-columns first *B,,1,,*,..,*B,,n,,*,*R,,1,,*,..,*R,,m,,*

2. Build *C* : compound grouping map from *S,,1,,*,..,*S,,n,,* by-column results

- For each
*j*in 1 ..*n*

- For each

*C,,j,,*:`i.~`*S,,j,,*is column self index*C*:`i.~`*C,,1,,*`,.`...`,.`*C,,n,,*is compound self-index

3. Build *F* : aggregator list for all *n*+*m* columns by prepending `{.` for each by-column
*F* : (*n* `#` '`{.`') , *A,,1,,* , .. , *A,,n,,*

2. Calculate *Z,,k,,* : aggregates by *F,,k,,* of *S,,k,,* under map *C*

- For each
*k*in 1 ..*n*+*m* *Z,,k,,*:*C**F,,k,,*`/.`*S,,k,,*is aggregation by map*C*

Return results of aggregation *Z,,1,,* ,.., *Z,,n+m,,*

### Query Notation

The new syntax affects the Select Clause:

`Alias:Aggregator Column,... by Column,...`

It should be easy to parse first by splitting with reserved word `by` and
next splitting by comma, then isolating aggregator from residuals and
finally cutting alias with `:` in resuduals.

If alias is given, it is used as column name in result set.

Reads__d 'Q1:sum R1,Q2:avg R2 by B1,B2 from t where Gender=Male' --+--+--+-- B1|B2|Q1|Q2 --+--+--+-- V1|V3|12|67 V2|V4|45|89

If alias is not given, and columns are different, their orginal names are used.

Reads__d 'min R1,max R2 by B1,B2 from t' --+--+--+-- B1|B2|R1|R2 --+--+--+-- V1|V3|12|67 V2|V4|45|89

If alias is not given, the same columns are prefixed with name of aggregator.

Reads__d 'min R,max R by B1,B2 from t where Gender=Male' --+--+----+---- B1|B2|minR|maxR --+--+----+---- V1|V3|12 |67 V2|V4|45 |89

More examples.

Read__d 'sum Salary by Department from t where Gender=Male' ----------+------- Department|Salary ----------+------- DEPT1 |1000000 DEPT2 |1500000 Read__d 'SalTotal:sum Salary by Department from t where Gender=Male' ----------+-------- Department|SalTotal ----------+-------- DEPT1 |1000000 DEPT2 |1500000

Note. In the future, column expressions can be used in
place of input columns (*B,,j,,* and *R,,k,,*), which are applied after
where-filter. For example, `year_of(dt)`, `uppercase(name)`,
or `(price*discount)` etc.
Result set names of same columns should be prefixed or
alias required.

### Hierarchical Aggregation

Each by-column can be thought of as a dimension and aggregation result is a sparse cube, each cell containg a residual tuple addressed by the "coordinates" formed with values of by-columns.

The definition above can be called simple or one-level aggregation. In contrast, multi-level (hierarchical, pivot-table, OLAP cube) aggregation, is when more than one set of by-columns is given. It can be executed as a sequence of simple aggregations for each level, where the by-sets above are added to the filter, current by-set is used as simple by-columns and lower levels ignored, which corresponds to NULLs in the dimensions coordinate. The process can be optimized by re-using the aggregations from lower levels.