Aggregate

Aggregate is a transformation component that groups together multiple input rows into a single output row. Multiple columns can be grouped or aggregated at once.

Default output names are chosen by combining the source column name and the aggregation type. If they are not appropriate, consider using a Rename component immediately after the Aggregate component to choose more appropriate names.

Properties

SnowflakeDatabricksAmazon Redshift

Name = string

A human-readable name for the component.

Groupings = dual listbox

Select one or more columns from the source table that will form the groupings. The output flow will contain one row for each distinct combination of values within the grouping columns.

Aggregations = column editor

Select the input (source) columns and the aggregation types that should be applied to each. Note that certain aggregation types expect certain data types. For example, you cannot average non-numeric data.

Column Name: Select the input (source) column for the summary function.
Aggregate Type: The summary function to use for the column. Available functions include:
- Any Value
- Approximate Count
- Approximate Count Distinct (HLL)
- Array Aggregate
- Array Aggregate Distinct
- Average
- Bit AND Aggregate
- Bit OR Aggregate
- Bit XOR Aggregate
- Bool AND Aggregate
- Bool OR Aggregate
- Bool XOR Aggregate
- Count
- Count Distinct
- Hash Aggregate
- List Aggregate
- List Aggregate Distinct
- Max
- Median
- Min
- Mode
- Skew
- Standard Deviation - Population
- Standard Deviation - Sample
- Sum
- Variance - Population
- Variance - Sample

For more information on the Aggregation Types, read the Snowflake Aggregate Functions documentation.

Grouping Type = drop-down

Select one of the following methods for grouping rows:

GROUP_BY: Groups rows with the same group-by-item expressions and computes aggregate functions for the resulting group. A Snowflake GROUP_BY expression can be a column name, a number referencing a position in the SELECT list, or a general expression.
GROUP_BY_CUBE: The Snowflake GROUP_BY_CUBE clause is an extension of the GROUP_BY clause similar to GROUP_BY_ROLLUP. In addition to producing all the rows of a GROUP_BY_ROLLUP, GROUP_BY_CUBE adds all the "cross-tabulations" rows. Subtotal rows are rows that further aggregate, whose values are derived by computing the same aggregate functions that were used to produce the grouped rows.
GROUP_BY_GROUPING_SETS: A Snowflake GROUP_BY_GROUPING_SETS clause is a powerful extension of the GROUP_BY clause that allows computing multiple group-by clauses in a single statement. The group set is a set of dimension columns.
GROUP_BY_ROLLUP: A Snowflake GROUP_BY_ROLLUP clause is an extension of the GROUP_BY clause that produces subtotal rows (in addition to the grouped rows). Subtotal rows are rows that further aggregate, whose values are derived by computing the same aggregate functions that were used to produce the grouped rows.

Name = string

A human-readable name for the component.

Groupings = dual listbox

Select one or more columns from the source table that will form the groupings. The output flow will contain one row for each distinct combination of values within the grouping columns.

Aggregations = column editor

Select the input (source) columns and the aggregation types that should be applied to each. Note that certain aggregation types expect certain data types. For example, you cannot average non-numeric data.

Column Name: Select the input (source) column for the summary function.
Aggregate Type: The summary function to use for the column. Available functions include:
- Approximate Count
- Average
- Count
- Count Distinct
- Max
- Min
- Standard Deviation - Sample
- Standard Deviation - Population
- Sum
- Variance - Sample
- Variance - Population
- Sum Distinct
- Skewness
- Kurtosis

You can add the same source column multiple times, each with a different aggregate type. The summary is calculated for each combination of the source column values.

For more information on each available function, read this guide to SQL functions.

Name = string

A human-readable name for the component.

Groupings = dual listbox

Select one or more items from the Groupings column to confirm the Selected Groupings.

Aggregations = column editor

Select the input (source) columns and the aggregation types that should be applied to each. Note that certain aggregation types expect certain data types. For example, you cannot average non-numeric data.

Column Name: Select the input (source) column for the summary function.
Aggregate Type: The summary function to use for the column. Available SQL aggregate functions include:
- Approximate Count: This is a variant of Count Distinct that may have better performance in cases where an exact count is not required. For more information, read the Redshift documentation.
- Average: Average the values in the column name. This is only sensible for numeric column names.
- Count: Count the total number of records whose column name is not null.
- Count Distinct: Count the total number of distinct (unique) values of the column name.
- Max: Find the Maximum value of the column name.
- Median: Calculates the median value for the range of values. For more information, read the Redshift Median documentation.
- Min: Find the Minimum value of the column name.
- Standard Deviation - Population: Calculates the population standard deviation of the column name. This is only sensible for numeric columns names.
- Standard Deviation - Sample: Calculates the sample standard deviation of the column name. This is only sensible for numeric columns names.
- Sum: Sum the values in the column name. This is only sensible for numeric columns names.
- Variance - Population: Calculates the population variance of the column name. This is only sensible for numeric columns names.
- Variance - Sample: Calculates the sample variance of the column name. This is only sensible for numeric columns names.

For more information on the Aggregation Types, read the Redshift Aggregate Functions documentation.

Examples

Alice and Zoe are managing a merchandise booth on different days. We have some data for their sales on each day.

Input data

+------------+----------+----------+---------+
|    DATE    | PRODUCT  | QUANTITY | MANAGER |
+------------+----------+----------+---------+
| 2023-04-01 | Mugs     |       10 | Alice   |
| 2023-04-01 | Stickers |       20 | Alice   |
| 2023-04-02 | Mugs     |       15 | Alice   |
| 2023-04-02 | Stickers |        5 | Alice   |
| 2023-04-03 | Mugs     |       20 | Zoe     |
| 2023-04-03 | Stickers |       25 | Zoe     |
| 2023-04-04 | Mugs     |       15 | Zoe     |
| 2023-04-04 | Stickers |        5 | Zoe     |
+------------+----------+----------+---------+

Example 1Example 2

Find the total number of each product sold.

One of the simplest and most useful things to do would be to aggregate the sales numbers, grouped by product.

Aggregate component properties:

Groupings: PRODUCT
Aggregations:
- Column Name: QUANTITY
- Aggregation Type: Sum
Grouping Type: Group By

Output data

+--------------+----------+
| sum_Quantity | Product  |
+--------------+----------+
|           60 | Mugs     |
|           55 | Stickers |
+--------------+----------+

Find the average sales by each manager, for each product.

We can add multiple groupings and each becomes an output column from the Aggregate component. We'll be using an Average aggregation type, this time.

Aggregate component properties:

Groupings:
- PRODUCT
- MANAGER
Aggregations:
- Column Name: QUANTITY
- Aggregation Type: Average
Grouping Type: Group By

Output data

+--------------+---------+----------+
| avg_Quantity | Manager | Product  |
+--------------+---------+----------+
|    12.500000 | Alice   | Mugs     |
|    12.500000 | Alice   | Stickers |
|    17.500000 | Zoe     | Mugs     |
|    15.000000 | Zoe     | Stickers |
+--------------+---------+----------+

Note that because we've calculated an average, the avg_Quantity data now shows as a real type rather than an integer.

Snowflake	Databricks	Amazon Redshift
✅	✅	✅