Pipelines tutorial
In this tutorial, we'll look at a simple use case where we have some data sets from our social media platform, which we want to load into a data warehouse and transform. The first data set is data about social media posts, stored in an Excel sheet. The second is data about social media users, stored in a Google Sheet.
First, we'll load our data into the data warehouse. Then, we'll transform the data by converting the data types of some columns in our data. Next, we'll join the two datasets so we can see all our data together: the post title, who wrote it, the country they are from, and how many likes that post received.
We'll then perform some further transformations—first, an aggregation to calculate the sum total of likes for each unique country. Then, we'll rank these countries by their number of likes from most to fewest.
Get started
There are two pipeline types:
- Orchestration pipelines are used to extract and load data.
- Transformation pipelines are used to transform data within the warehouse.
Create an orchestration pipeline
We will first use an orchestration pipeline to load our data, starting on the Designer canvas. For more information about the Designer canvas, read Designer UI basics.
We are going to create a folder called SocialMediaData
, containing an orchestration pipeline called LoadSocialMediaData
.
- Click Files to open the files panel.
- Click the + icon, then click Folder to add a new folder. Name your new folder
SocialMediaData
. - Click ... next to this folder, then Add, then select Orchestration pipeline.
- Name this pipeline
LoadSocialMediaData
.
The pipeline will open on the canvas.
Load posts data using Excel Query
In this step, we will use the Excel Query component to load data from an Excel sheet that is stored in an S3 bucket.
Configure the component
- The Start component will appear on the canvas. This is the starting point for your pipeline flow.
- Click the + icon to the right of the Start component to open the Add component dialog.
- Search for and click the Excel Query component, which we will use to load social media data from an Excel sheet in an S3 bucket. This component is automatically connected to the Start component on the canvas, and the Configuration tab of the component properties panel opens on the right of the canvas.
- In the Connect section of the component properties panel, adjust the following settings:
- In Storage Type, keep the default selection of Amazon S3 Storage.
- Click the Storage URL property, then in S3 Buckets path enter
matillion-docs/Attachments/academy/socialmediadata.xlsx
and click Save.
- In the Configure section of the component properties panel, adjust the following settings:
- In Data Source, select the posts worksheet.
- In Data Selection, click the >> icon to load all columns from this worksheet, then click Save.
- In the Destination section of the component properties panel, adjust the following settings:
- In Target table, name the table
posts
. - In Stage Platform, keep the default selection of Snowflake Managed.
- In Target table, name the table
Validate the pipeline and sample your data
- Make sure the pipeline configuration is valid by clicking Validate on the canvas.
- Sample your data by clicking the Sample data tab at the bottom of the canvas, then click Sample data. You will see the first 25 rows of data loaded using this component.
Load user data using Google Sheets Query
In this step, we will use the Google Sheets Query component to load some test social media data that is stored in a Google Sheet. To do this, copy the Google Sheet below to your own Google account. You may want to bookmark or save this Google Sheet as a favorite, so you can access it easily.
https://docs.google.com/spreadsheets/d/1VYV8-MbMko12pWKeGJGhSu5zpkDDb51Hx3CIrqirTzk/edit?usp=sharing
- In your orchestration pipeline, click the + icon to the right of the Start component to add another component.
- In the Add component dialog, search for and click the Google Sheets Query component.
Authenticate the component
- Click the Authentication property, then click Manage. This will take you to a new tab containing a list of all the OAuth connections in your project.
- Click Add OAuth connection.
- In OAuth name, enter
google-sheets
. - In Provider, select Google.
- In Authentication Type, select OAuth 2.0 Authorization Code Grant.
- Click Sign in with Google.
- Select the Google account associated with the sheet you need to access, then click Allow.
You will be taken back to the OAuth connections list, where your new google-sheets
OAuth connection will appear.
Configure the component
- Return to Designer and reopen your
LoadSocialMediaData
orchestration pipeline. - Click the Google Sheets component.
- Click the Authentication property, then in Choose your profile, select your new google-sheets OAuth.
- Click the < icon to return to the Configuration tab of the component properties panel.
- In the Configure section of the component properties panel, adjust the following settings:
- In Spreadsheet Name, enter
SocialMediaData
. - In Data Source, select SocialMediaData_users.
- In Data Selection, click the >> icon to load all columns from this sheet, then click Save.
- In Row limit, increase the limit to 1000 rows of data.
- In Spreadsheet Name, enter
- In the Destination section of the component properties panel, adjust the following settings:
- In Warehouse, Database and Schema, select [Environment default].
- In Target Table, name the table
users
. - In Stage Platform, select Snowflake Managed.
- Validate the pipeline and sample your data as described for the Excel Query component.
Run the orchestration pipeline
Click Run on the canvas.
This orchestration pipeline will now run. It will load the data from the Excel file and Google Sheet into the target tables in your cloud data warehouse.
At the bottom of the screen, in the Task history tab, you can see the progress of this pipeline, and see whether it runs successfully. You can click a task to see more details.
Create a transformation pipeline
Now that we have our data, it's time to transform it. To do this, we need to create a transformation pipeline in our SocialMediaData
folder, which we're going to call JoinDataSets
.
- Click Files to open the files panel.
- Click ... next to the
SocialMediaData
folder, then Add, then select Transformation pipeline. - Name this pipeline
JoinDataSets
.
The new transformation pipeline will open on the canvas.
Bring your data into the transformation pipeline using Table Input
Transformation pipelines don't begin with a Start component, like orchestration pipelines do. Instead, transformation pipelines need to start with a component that provides data for transformation.
We need to connect the data that we loaded in our LoadSocialMediaData
orchestration pipeline to this new transformation pipeline. To do this, we'll use two Table Input components, one for each table containing our social media data.
- In your transformation pipeline, click the + icon on the canvas to add the first component to this pipeline.
- In the Add component dialog, search for and click the Table Input component.
- In the component properties panel, adjust the following settings:
- In Database and Schema, select [Environment default].
- In Target Table, select the posts table.
- In Column Names, click the >> icon to load all columns from this table, then click Save.
- Click the + icon on the left of the canvas to add another component.
- In the Add component dialog, search for the Table Input component, and drag a new Table Input component onto the canvas.
- In the component properties panel, adjust the following settings:
- In Database and Schema, select [Environment default].
- In Target Table, select the users table.
- In Column Names, click the >> icon to load all columns from this table, then click Save.
- Validate the pipeline and sample your data. Make sure to sample the data for both Table Input components, to make sure both data sets have been brought into the transformation pipeline correctly.
Transform your data
We are now going to use our JoinDataSets
transformation pipeline to convert some data into numbers, join the two data sets, and write our transformed data to a new table in our data warehouse.
Convert the 'Likes' and 'UserID' data to numbers using Convert Type
After sampling the data in our posts
Table Input component, we've noticed that the data in our Likes
and User ID
columns appears as numbers with a decimal point. Let's convert the data in these columns to whole numbers. This is a more accurate way to represent this data—we can't have half a "like". To do this, we will use the Convert Type component.
- Click the + icon to the right of the
posts
Table Input component to add a connected component. - In the Add component dialog, search for and click the Convert Type component.
- In the component properties panel, click the Conversions property, and configure two conversions:
- In the first row, in Column select UserID, and in Type select NUMBER. Leave the Size and Precision fields empty.
- Click the + icon in the bottom left to add another row.
- In the second row, in Column select Likes, and in Type select NUMBER. Leave the Size and Precision fields empty.
- Click Save.
- Validate the pipeline and sample your data.
Join the data from Excel and Google Sheets using Join
We're now going to join our posts
and users
tables together using the Join component.
- Click the + icon to the right of the Convert Type component to add a connected component.
- In the Add component dialog, search for and click the Join component.
- Click and drag the connection ring to the right of the
users
Table Input component to the Join component. - Click the Join component to open its component properties panel.
- Click the Main Table property, and select Convert Type.
- In the Main Table Alias property, enter
posts
. - Click the Joins property and configure how to join the tables:
- In Join Table, select users.
- In Join Alias, enter
users
. - In Join Type, select Inner.
- Click Save.
- Click the Join Expressions property, enter the following expression:
"posts"."UserID" = "users"."UserID"
, then click Save. - Click the Column Mappings property and configure the mappings:
- In the bottom left, click Add all.
- Select the checkboxes for the following columns:
- posts.PostID
- posts.RowId
- posts.UserID
- users.id
- users.UserID
- Click the - icon to remove these columns from the selection, because we don't want them included in our data.
- Click Save.
- Validate the pipeline and sample your data.
Write the combined data to a new table using Rewrite Table
We're now going to create a new table containing our combined data. We'll use the Rewrite Table component to create a new table, and name it SocialMediaOutput
.
- Click the + icon to the right of the Join component to add a connected component.
- In the Add component dialog, search for and click the Rewrite Table component.
- In the component properties panel, in Target Table, enter
SocialMediaOutput
. - Validate the pipeline.
Note
Sampling is not available for the Rewrite Table component, because it uses the data in the component immediately before it.
Run the transformation pipeline
Click Run on the canvas.
This transformation pipeline will now run. It will convert our likes and user ID data to number format, and join our two data sets together. It will then create a table called SocialMediaOutput
and write our data into this table in your data warehouse. You can see the progress of this task in the Task history tab at the bottom of the canvas.
Perform more transformations
Now that we have written our joined data into our new SocialMediaOutput
table, let's try aggregating and ranking our joined data using some more transformation pipeline components. We'll then create another new table containing this aggregated and ranked data.
Find the total likes per country using Aggregate
Let's use the Aggregate component to find the total number of likes for each country in our data set.
- Click the + icon to the right of the Join component to add a connected component.
- In the Add component dialog, search for and click the Aggregate component.
- In the properties panel, click the Groupings property. In the dialog, select Country from the list on the left, and click the → icon to move it into the Selected Groupings list. Click Save.
- Click the Aggregations property. In Column Name, select Likes. In Aggregate Type, select Sum. Click Save.
- Validate the pipeline and sample your data.
You'll see that almost all our data has disappeared, leaving a data set containing the total number of likes for each country.
Rank countries by total number of likes using Rank
Now let's sort our data. We'll use the Rank component to sort our data by country, in descending order according to the total number of likes.
- Click the + icon to the right of the Aggregate component to add a connected component.
- In the Add component dialog, search for and click the Rank component.
- In the component properties panel, click the Ordering Within Partitions property and configure the ordering settings:
- In Input Column, select sum_Likes.
- In Ordering, select Descending.
- Click Save.
- Click the Functions property and configure a function:
- In Window Function, select Rank.
- In Output Column, enter
CountryRank
. - Click Save.
- Validate the pipeline and sample your data.
You'll see that our data is now ranked in descending order by the number of total likes per country. Additionally, we have added a new column containing each country's rank.
Write the transformed data to a new table using Rewrite Table
We're now going to create a new table called RankedSocialMediaLikes
containing this aggregated, ranked data. This table will be separate from our SocialMediaOutput
table, which contains all our posts
and users
data.
- Click the + icon to the right of the Rank component to add a connected component.
- In the Add component dialog, search for and click the Rewrite Table component.
- In the component properties panel, in Target Table, enter
RankedSocialMediaLikes
. - Validate the pipeline.
Our transformation pipeline is now finished. It's a good idea to edit the names of your Rewrite Table components so that you know which tables they each write to. To rename a component, click the component, then click the edit icon next to the component name.
Automate these pipelines using And and Run Transformation
We can now link our LoadSocialMediaData
orchestration pipeline to our JoinDataSets
transformation pipeline, so that when we run the orchestration pipeline, it automatically runs our transformation pipeline. We'll use the And and Run Transformation components to do this.
- Open your
LoadSocialMediaData
orchestration pipeline. - Click the + icon to the right of the Google Sheets component to add a connected component.
- In the Add component dialog, search for and click the And component.
- Click and drag the connection ring to the right of the Excel Query component to the And component.
- Click the + icon to the right of the And component to add a connected component.
- In the Add component dialog, search for and click the Run Transformation component.
- In the component properties panel, in Transformation Pipeline, select SocialMediaData/JoinDataSets.
- Validate the pipeline.
- Run the pipeline.
When you run this orchestration pipeline, it will now run both the LoadSocialMediaData
and JoinDataSets
pipelines. You will only see the orchestration pipeline in the Task history tab at the bottom of the canvas, but if you click this task, you will see that the trasnformation pipeline is run as the final step of this pipeline.