Distinct
Distinct is a transformation component that generates a SELECT DISTINCT query to pass on only the distinct (or unique) input rows to the next component. A row is only duplicate if all of its included column values match all of those from another row.
This is extremely useful for pulling out all possible values for a given column or for cleaning up duplicate records from a data set.
Properties
Name
= string
A human-readable name for the component.
Columns
= dual listbox
Only these selected columns are kept and passed to the next component. Duplicate records from these columns are removed, leaving only distinct values.
Examples
We have some data about our employees, which we're identifying by ID. Let's see what the Distinct component can do for our data.
Input data
+-------+-------------+----------+
| ID | DEPARTMENT | POSITION |
+-------+-------------+----------+
| 00001 | Marketing | Junior |
| 00002 | Sales | Senior |
| 00003 | Marketing | Senior |
| 00004 | Engineering | Junior |
| 00001 | Marketing | Junior |
| 00005 | Sales | Junior |
+-------+-------------+----------+
Cleaning up duplicates
It looks like we've somehow got a duplicate record for ID 00001. That's not going to help our later transformations so let's clean it up.
If we put every column into the Distinct component then we can get everything back except the duplicates.
Distinct component properties:
- Columns:
- ID
- DEPARTMENT
- POSITION
Output data
+-------+-------------+----------+
| ID | DEPARTMENT | POSITION |
+-------+-------------+----------+
| 00001 | Marketing | Junior |
| 00002 | Sales | Senior |
| 00003 | Marketing | Senior |
| 00004 | Engineering | Junior |
| 00005 | Sales | Junior |
+-------+-------------+----------+
Excellent. We now only have one entry per ID, as expected.
Finding all departments
We'd just like a nice clean list of what departments our organization has. If we feed just the DEPARTMENT column in, the Distinct component will find all unique values for this column.
Distinct component properties:
- Columns: DEPARTMENT
Output data
+-------------+
| DEPARTMENT |
+-------------+
| Marketing |
| Sales |
| Engineering |
+-------------+
Snowflake | Databricks | Amazon Redshift |
---|---|---|
✅ | ✅ | ✅ |