Streaming pipelines overview
Editions
This feature is for customers on our Enterprise edition only. Visit Matillion pricing to learn more about each edition.
Streaming pipelines provide a complete end-to-end solution for near-real-time data ingestion, allowing you to capture data from a source database and write it to cloud storage or a data warehouse.
Streaming is a feature of Matillion's Data Processing Cloud platform, and can be used alongside the Data Productivity Designer feature to give you a more control and flexibility over your data. You access the Streaming feature via a Data Productivity Cloud project.
Matillion's Streaming pipeline architecture provides the following benefits:
- Data stays within your private cloud: The Streaming agent is deployed in your private cloud infrastructure (AWS or Azure), which means that the change data that's being streamed from the source and written to the destination by the Streaming pipeline remains within your infrastructure.
- Continuous operation: Once the Streaming pipeline agent is configured and started, it operates autonomously without requiring much intervention. The agent continuously monitors all changes occurring in the source database, and delivers those changes to the designated destination data warehouse or cloud storage. This ensures a continuous and reliable change data capture process.
- Pipeline monitoring: The Data Productivity Cloud user interface allows you to monitor the status of each Streaming pipeline, ensuring visibility of the data capture process.
- Accurate representation of the source database: Streaming pipelines leverage low-level logs from the source database to provide a consistent and up-to-date picture of the tables you have selected in that database. This ensures that the captured data reflects the latest changes to those tables.
- Minimal resource usage: The log consumption approach used by Streaming pipelines keeps the resource usage on the source database to a minimum. The Streaming pipeline doesn't query the source table in full for each change, it "listens" for record changes, resulting in a low impact on the source database.
- Database snapshots: A snapshot allows you to establish an up-to-date view of the database before streaming begins, giving you a baseline from which to begin streaming changes.
Core concepts
The following key terms are used throughout the documentation and user interface.
-
Source: The source is the database from which you are capturing data changes. The following sources are supported:
-
Destination: The destination is a cloud data warehouse or cloud storage location where the agent will deliver change event data captured from the source database. The following destinations are supported:
-
Agent: The agent is a service that is deployed and configured within your cloud infrastructure and is responsible for securely connecting via a secrets management application to both the source database and the destination data warehouse or cloud storage. Read Streaming agent installation for details.
- Pipeline: A pipeline is a collection of configuration details, including source and destination configurations, that enable the Streaming agent to monitor and consume database changes, delivering them to the appropriate data warehouse or cloud storage location. Since there is a 1-to-1 relationship between a Streaming agent and a pipeline, multiple agents are required to configure and run multiple pipelines.
Key features of the Streaming pipeline architecture
Streaming pipelines use a Hybrid SaaS approach by using software agents hosted within your own AWS or Azure infrastructure to control data flow. The agent serves as a bridge between the source database and the destination data warehouse or cloud storage location, capturing and handling the data that flows through the Streaming pipeline. By keeping the agents within your own infrastructure, you retain control over data security and data sovereignty. Each agent controls the operation of a single Streaming pipeline, meaning you can scale to stream from multiple data sources by installing additional agents.
To enable configuration and monitoring through the Data Productivity Cloud interface, the Streaming agent establishes a secure connection back to the Matillion SaaS platform. This connection ensures secure communication between the agent and the platform for configuration updates, status monitoring, and agent management, though none of your streaming data is ever passed outside your infrastructure.
The Streaming agent requires access to a cloud secrets service, which is a secure storage system for storing authentication and connection credentials (known as "secrets"). These secrets are used by the agent to authenticate itself with the source database and establish a secure connection for capturing the data changes. The secrets service can be either AWS Secrets Manager or Azure Key Vault, hosted in your own cloud infrastructure. For details of how the Data Productivity Cloud uses secrets, read Secrets overview.
When you create a Streaming pipeline, the following happens:
- Connecting to the source database: The Streaming agent establishes a connection with the source database, allowing it to capture data changes and track them in near real time.
- Snapshotting the source database: A database snapshot captures a point-in-time view of your source database, to establish a current view of the data before streaming begins. This snapshot is then the baseline from which to begin capturing changes to the data. Snapshotting is optional and must be selected during pipeline configuration if required.
- Processing change events: Once connected to the source database, the agent processes the change events occurring in the database. Data changes are captured and transformed into manageable units and written out in near real-time.
-
Buffering the changes: The changes captured from the source database are buffered within the Streaming agent before being written out to the destination. This buffering allows for efficient handling and processing of the changes before writing them to files. The files containing the changes are written when one of the following conditions is met:
- Time threshold: When the oldest change in a partition reaches a specific time threshold, the changes within that partition are written to a file. This ensures that changes are not held in the buffer for an extended period.
- File size limit: When the size of changes within a partition exceeds a specific limit, the changes are written to a file. This helps maintain manageable file sizes.
-
Writing to the destination: Once processed, the change batches are sent to the destination data warehouse or cloud storage.
Format of change records
Change records are batched into files for transfer to the destination. These files use the Avro format, which is optimized for ingestion into a cloud data warehouse. The following example shows the common structure of a change record within the Avro file. Variations may occur for different sources.
{
"before": null,
"after": {
"actor_id": "70ac0033-c25b-7687-5a86-6861c08cabdd",
"first_name": "john",
"last_name": "smith",
"last_update": 1635638400123456,
"version": 0
},
"metadata": {
"connector": "postgresql",
"db": "postgres_db",
"key": ["actor_id"],
"lsn": 37094192,
"name": "matillion",
"op": "r",
"schema": "public",
"table": "actor",
"ts_ms": 1635638400234,
"txId": 543,
"version": "1.7.1.Final"
}
}
The op
field contains the type of the change for this record:
- r: Read (during snapshot, if the snapshot option was selected in pipeline configuration)
- c: Create
- d: Delete
- u: Update
The before
and after
fields contain the values in that record as they were before and after the change was applied, and as such the fields will differ by table. In cases where a record is created, the value for the before
field will be empty (null
); in cases where a record is deleted, the value for the after
field will be empty (null
).
Next steps
- Read Streaming agent installation to learn how to install and configure the agent in your AWS or Azure infrastructure.
- Read Create a Streaming pipeline to learn how to set up your first Streaming pipeline.