Create a streaming pipeline
A streaming pipeline is a collection of source and target configuration details and optionally specified advanced settings to enable your streaming agent to begin monitoring and consuming source database changes and streaming these changes to your destination.
Read Streaming pipelines for a deeper look at the concepts and architecture of streaming pipelines in the Data Productivity Cloud.
Prerequisites
To run a streaming pipeline, you need:
- A streaming agent. Agents execute your streaming pipelines and act as a bridge between Matillion and your data network while preserving your data sovereignty. Read Create an agent to get started.
- A streaming-configured database. You'll need to make sure your source database is set up for streaming.
- An account with a destination service.
- A cloud platform account with either AWS or Azure for managing secrets. Read Secrets overview for more information about using managed secrets in the Data Productivity Cloud.
- An AWS Secrets Manager or Azure Key Vault service.
Note
Your secret manager and agent must be hosted in the same cloud platform. You can't use an AWS agent with Azure Key Vault, and so on.
Get started
- Log in to the Matillion Hub.
- Click ☰ → Designer.
- Select your project.
- Choose the Streaming tab.
- Click Add streaming pipeline.
The rest of this guide will explain how to set up your streaming pipeline.
Pipeline details
To start creating your streaming pipeline, first complete these fields:
Name
= string
A unique, descriptive name for your streaming pipeline.
Agent
= drop-down
A running streaming agent. To create a new streaming agent, read Create an agent.
Source
= drop-down
Your source database. Supported sources include Postgres, Oracle, MySQL, Microsoft SQL Server, and Db2 for IBM i.
Destination
= drop-down
Your destination service. Choose from:
- Snowflake
- Amazon S3
- Azure Blob
Once you complete these fields, click Continue.
Destination connection
In this section, you'll connect to your destination.
This is only required if your destination is Snowflake, as described in Streaming to a Snowflake destination.
Destination configuration
In this section, you'll configure your destination settings.
Complete the properties in this section as described in:
- Streaming to a Snowflake destination
- Streaming to an Azure Blob Storage destination
- Streaming to an Amazon S3 destination
Source setup
In this section, you'll configure your connection to your source database. Use the links below to visit the guidance for your database, then return to this page.
Click Connect to establish the connection to your source database.
Pipeline configuration
In this section, choose which schemas and tables to include in your streaming pipeline, and complete additional configuration settings.
Select tables
= dialog
Tables available from your defined source database will be displayed here These tables are defined and created by the user in the source database and thus can't be described here.
Choose the schemas and tables to include in the streaming pipeline.
Replication type
= drop-down
Set your streaming pipeline's replication type. Read Replication types to learn more.
Dates and times strategy
= drop-down
Choose the date, time, and timestamp processing strategy of your streaming pipeline.
- Snowflake Native Types (e.g. DATETIME): This action will convert the source date/time value to an appropriate Snowflake type. If it's not possible to convert, the pipeline will output the value as a string. This is the default setting.
- Integers from Epoch: This action will load all date/time values as integers in Snowflake. You can use these integers to calculate the date and time by adding the value to the epoch value, applying this value to all date/time fields in the target table. Although this may be complicated, it has the benefit of maintaining a very accurate date/time value without any errors that may be caused by conversion in the pipeline.
Read Date and times strategy to learn more.
Snapshot options
= boolean
All source connectors have two selectable snapshot configurations: initial snapshot and on-demand snapshot.
- Initial snapshot allows you to take an initial snapshot of the specified tables when the pipeline is started. The initial snapshot captures the current state of the data in those tables at the beginning of the pipeline execution. By enabling this option, you ensure that the pipeline starts with the most recent data available.
- On-demand snapshot allows you to trigger snapshots for specific tables in the pipeline manually. This means that new tables can be added to the capture list after the pipeline's initial startup, and snapshots for those tables can be triggered as needed. On-demand snapshots are useful when you want to capture specific points in time or when you need to dynamically expand the set of tables being captured without losing historical data.
Read Snapshots for more details of each snaphot type.
Signal table
= dialog
Choose a signal table. Signal tables should be pipeline-specific to avoid accidental overlap of snapshot executions.
Read Signal tables to learn more.
Parameter only available when the On-demand snapshot checkbox is set to ✅.
Advanced settings
= dialog
This property is optional.
Set up any additional key:value parameters for your streaming pipeline using the documentation of your source connector.
Note
If you experience out of memory (OOM) errors on the Streaming agent, a common cause is the history.dat
file growing very large over time. To solve this problem, edit the Streaming pipeline to add the following parameter to the Advanced settings property:
- Parameter:
matillion.compact-history
. - Value:
true
.
This setting will cause the history.dat
to compact automatically when it grows too large.
Finish setting up
Click Save pipeline to finish creating your streaming pipeline. You will be redirected to the Streaming tab, where you'll see your new streaming pipeline with basic metadata listed, including:
- The name of the pipeline.
- The status of the pipeline.
- The source database.
- The destination.
View your pipeline overview
In the Name column, click the name of your new streaming pipeline to view an overview of your streaming pipeline.
This overview page offers more details about your pipeline, summarizing your pipeline's configuration and clarifying the current status of your pipeline.
- Click Edit to return to the configuration wizard for your pipeline.
- Click Run to start your pipeline.
- Click Stop to halt your pipeline.