Distinct
The Distinct transformation component lets you remove any duplicate rows from a dataset. A row is considered to be a duplicate if the values in all of the selected columns match all the values in another row.
This component is equivalent to writing a SELECT DISTINCT statement.
Use case
This component can be used to return all possible values for a given column, or to remove duplicate records from a dataset. For example, you can use it to:
- Remove duplicate customer records from your dataset to avoid issues with customer analytics.
- Clean up transaction logs that contain duplicate records due to retries or system errors.
- Make sure your data is ready for aggregation.
Properties
Name
= string
A human-readable name for the component.
Columns
= dual listbox
Only these selected columns are kept and passed to the next component. Duplicate records from these columns are removed, leaving only distinct values.
Examples
We have some data about our employees, which we're identifying by ID. Let's see what the Distinct component can do for our data.
Input data
+-------+-------------+----------+
| ID | DEPARTMENT | POSITION |
+-------+-------------+----------+
| 00001 | Marketing | Junior |
| 00002 | Sales | Senior |
| 00003 | Marketing | Senior |
| 00004 | Engineering | Junior |
| 00001 | Marketing | Junior |
| 00005 | Sales | Junior |
+-------+-------------+----------+
Cleaning up duplicates
It looks like we've somehow got a duplicate record for ID 00001. That's not going to help our later transformations so let's clean it up.
If we put every column into the Distinct component then we can get everything back except the duplicates.
Distinct component properties:
- Columns:
- ID
- DEPARTMENT
- POSITION
Output data
+-------+-------------+----------+
| ID | DEPARTMENT | POSITION |
+-------+-------------+----------+
| 00001 | Marketing | Junior |
| 00002 | Sales | Senior |
| 00003 | Marketing | Senior |
| 00004 | Engineering | Junior |
| 00005 | Sales | Junior |
+-------+-------------+----------+
Excellent. We now only have one entry per ID, as expected.
Finding all departments
We'd just like a nice clean list of what departments our organization has. If we feed just the DEPARTMENT column in, the Distinct component will find all unique values for this column.
Distinct component properties:
- Columns: DEPARTMENT
Output data
+-------------+
| DEPARTMENT |
+-------------+
| Marketing |
| Sales |
| Engineering |
+-------------+