Streaming pipelines overview
Editions
This feature is for customers on our Enterprise edition only. Visit Matillion pricing to learn more about each edition.
Streaming pipelines provide a complete end-to-end solution for near-real-time data ingestion, allowing you to capture data from a source database and write it to either cloud storage or a cloud data warehouse.
Streaming pipelines can be used together with the Data Productivity Cloud's batch pipelines to give an end-to-end Extract, Load, Transform (ELT) process. You can use a Streaming pipeline to Extract data from a source and Load it into a cloud data warehouse, and then use a transformation pipeline created in Designer to Transform the data for downstream purposes.
You access the Streaming feature via a Data Productivity Cloud project.
Video example
Expand this box to watch our video about using Streaming pipelines.
Video
Core concepts
The following key terms are used throughout the documentation and user interface.
Sources
The source is the database from which you are capturing data changes. The following sources are supported:
Destinations
The destination is a cloud data warehouse or cloud storage location where the agent will deliver change event data captured from the source database. The following destinations are supported:
You can also use a Data Productivity Cloud pre-built pipeline to process change event data delivered to cloud storage and load it into Snowflake, Databricks, or Amazon Redshift.
Agents
The agent is a service that is deployed and configured within your cloud infrastructure and is responsible for securely connecting via a secrets management application to both the source database and the destination data warehouse or cloud storage. Read Create streaming agent for details.
Pipelines
A pipeline is a collection of configuration details, including source and destination configurations, that enable the Streaming agent to monitor and consume database changes, delivering them to the appropriate data warehouse or cloud storage location. Since there is a 1-to-1 relationship between a Streaming agent and a pipeline, multiple agents are required to configure and run multiple pipelines. Read Create a Streaming pipeline for details.
Key features of the Streaming pipeline architecture
Streaming pipelines use a Hybrid SaaS approach by using software agents hosted within your own cloud provider infrastructure to control data flow. The agent serves as a bridge between the source database and the destination data warehouse or cloud storage location, capturing and handling the data that flows through the Streaming pipeline. By keeping the agents within your own infrastructure, you retain control over data security and data sovereignty. Each agent controls the operation of a single Streaming pipeline, meaning you can scale to stream from multiple data sources by installing additional agents.
This approach provides the following benefits:
- Data stays within your private cloud: The Streaming agent is deployed in your private cloud infrastructure, which means that the change data that's being streamed from the source and written to the destination by the Streaming pipeline remains within your infrastructure, and you retain control over data security and data sovereignty.
- Continuous operation: Once the Streaming pipeline agent is configured and started, it operates autonomously without requiring much intervention. The agent continuously monitors all changes occurring in the source database, and delivers those changes to the designated destination data warehouse or cloud storage. This ensures a continuous and reliable change data capture process.
- Pipeline monitoring: The Data Productivity Cloud user interface allows you to monitor the status of each Streaming pipeline, ensuring visibility of the data capture process.
- Accurate representation of the source database: Streaming pipelines read database change logs from the source database to provide a consistent and up-to-date picture of the tables you have selected in that database. This ensures that the captured data reflects the latest changes to those tables.
- Minimal resource usage: The log consumption approach used by Streaming pipelines keeps the resource usage on the source database to a minimum. The Streaming pipeline doesn't query the source table in full for each change, it "listens" for record changes, resulting in a low impact on the source database.
- Database snapshots: Snapshotting the source database of a Streaming pipeline allows you to establish an up-to-date view of the database, giving you a baseline from which streaming changes begin. A snapshot can be taken at any time after the pipeline has started, to allow the loading of historical data. You can specify the historical data that the snapshot will load, and you can control the timing of the snapshotting to minimize disruption to ongoing streaming.
To enable configuration and monitoring through the Data Productivity Cloud interface, the Streaming agent establishes a secure connection back to the Matillion SaaS platform. This connection ensures secure communication between the agent and the platform for configuration updates, status monitoring, and agent management, though none of your streaming data is ever passed outside your infrastructure.
The Streaming agent requires access to a cloud secrets service, which is a secure storage system for storing authentication and connection credentials (known as "secrets"). These secrets are used by the agent to authenticate itself with the source database and establish a secure connection for capturing the data changes. For details of how the Data Productivity Cloud uses secrets, read Secrets overview.
This architecture is illustrated in the following diagram.
When you create a Streaming pipeline, the following happens:
- Connecting to the source database: The Streaming agent establishes a connection with the source database, allowing it to capture data changes and track them in near real time.
- Snapshotting the source database: An optional database snapshot captures a point-in-time view of your source database, to establish a current view of the data as a baseline from which to begin capturing changes to the data. Read Overview of snapshotting for details.
- Processing change events: Once connected to the source database, the agent processes the change events occurring in the database. The agent reads the database change logs, from which data changes are captured and transformed into manageable units and written out in near real-time.
- Buffering the changes: The changes captured from the source database are buffered within the Streaming agent before being written out to the destination. This buffering allows for efficient handling and processing of the changes before writing them to files. The files containing the changes are written when one of the following conditions is met:
- Time threshold: When the oldest change in a partition reaches a specific time threshold, the changes within that partition are written to a file. This ensures that changes are not held in the buffer for an extended period.
- File size limit: When the size of changes within a partition exceeds a specific limit, the changes are written to a file. This helps maintain manageable file sizes.
- Writing to the destination: Once processed, the change batches are sent to the destination data warehouse or cloud storage.
- If required, you can make use of our pre-built pipelines to load streamed data from cloud storage to a cloud data warehouse.
Format of change records
Change records are batched into files for transfer to the destination. These files use the Avro format, which is optimized for ingestion into a cloud data warehouse. The following example shows the common structure of a change record within the Avro file. Variations may occur for different sources.
{
"before": null,
"after": {
"actor_id": "70ac0033-c25b-7687-5a86-6861c08cabdd",
"first_name": "john",
"last_name": "smith",
"last_update": 1635638400123456,
"version": 0
},
"metadata": {
"connector": "postgresql",
"db": "postgres_db",
"key": ["actor_id"],
"lsn": 37094192,
"name": "matillion",
"op": "r",
"schema": "public",
"table": "actor",
"ts_ms": 1635638400234,
"txId": 543,
"version": "1.7.1.Final"
}
}
The op
field contains the type of the change for this record:
- c: Create
- r: Read (During snapshotting—read Overview of snapshotting for details.)
- u: Update
- d: Delete
The before
and after
fields contain the values in that record as they were before and after the change was applied, and as such the fields will differ by table. In cases where a record is created, the value for the before
field will be empty (null
); in cases where a record is deleted, the value for the after
field will be empty (null
).
Schema drift
Schema Drift occurs when your data source changes its metadata. Fields, columns, and data types can be added, removed, or changed on the fly. If you don't account for schema drift, your data flow becomes vulnerable to changes in the upstream data source.
Streaming pipelines can adapt to these source schema changes and propagate them to the destination. These "self-healing" pipelines not only save time, they significantly reduce the number of job failures and delays due to schema drift. Users can ensure that data continues to flow, uninterrupted, from source to destination.
In a streaming pipeline, the agent loads data to your chosen destination cloud storage area, which is then transformed to accommodate schema changes. Changes are driven by data rather than by DDL, which means that any changes made to the schema of the source tables won't be reflected in the target until the data changes.
The streaming pipeline handles specific changes in accordance with the following principles:
- Added tables: You must create a new streaming pipeline to use a new source table; it's not automatically included in the existing pipeline.
- Deleted tables: If a table is deleted/dropped in your source, it's not dropped on the destination platform but will be retained for historical purposes. You can manually drop this table in the destination if needed.
- Added columns: The streaming pipeline will pick up any new columns added to selected tables if the new column contains data.
- Deleted columns: If a column is dropped from the source table, it will remain in your destination table and historical data will be retained up to the point of dropping. Values in this column after being dropped will be inserted as NULL.
-
Data type changes: Data type changes will be accommodated, but if they aren't compatible changes for the destination cloud data platform (i.e. they are data types the destination does not support), this is handled in the following way:
- The column will be renamed as
<column_name>_<datetime>
, where<datetime>
has the format_yyyymmddhhmmss
, and will be the same for all columns in the same table in the same streaming pipeline execution. - The new column will be NULL up to the date of change, which needs to be considered for downstream dependencies such as views and reports.
- The column will be renamed as
Next steps
- Read Streaming agent installation to learn how to install and configure the Streaming agent in your cloud infrastructure.
- Read Create a Streaming pipeline to learn how to set up a Streaming pipeline.