Skip to content

CDC Pipeline Overview

A pipeline in Matillion CDC is a complete definition of a single CDC setup as your data is replicated from your source database and the changes arrive in your cloud storage area.

Process

CDC pipelines can be understood as a set of four sequential steps: launching your Streaming agent, setting up your source database, configuring your destination storage area, and managing your pipeline settings.

  1. Launch a Streaming agent: The Matillion Streaming agent is responsible for managing the CDC tasks in your cloud provider that orchestrates the CDC process.
  2. Set up your source: Connection to your source database or service as well as configuring your database to support CDC.
  3. Configure your destination: Connection to your destination service, usually Amazon S3 or Azure Blob Storage.
  4. Configure your Pipeline Settings: Configuration of the overall pipeline such as the frequency at which it runs.

Additionally, if you have access to Matillion ETL, you can choose to complete your CDC journey by using the stored change data to update a cloud data warehouse table using a Matillion ETL shared job.

For more information on creating your CDC pipeline, read Create and Manage a CDC pipeline for details of how to create and configure a pipeline.


Agent

Every CDC pipeline requires a Matillion Streaming agent to orchestrate the data loading tasks. Agents run on containers in your cloud platform and can be set up using our provided templates.

The Agents dashboard lists currently registered agents and shows their status.

  • A Streaming agent can only be configured to a single CDC pipeline at a time, thus each pipeline requires its own agent.
  • Agents must be installed within the cloud service provider of your choosing before a pipeline can be created.
  • Agent status must be "Connected" for a pipeline to succeed.
  • Agent installation can be a technically involved process and it is highly recommended that you read the documentation for agent installation and consult your cloud administrator for help and permissions where required.

Source

Your CDC pipeline will require you to select a database, schema and the tables that you want to replicate. Source databases must be configured for CDC before inclusion in a pipeline. Please read the CDC Sources Overview and consult your database administrator as required.

You should also be aware that:

  • The Streaming agent will require connection details including host name, database name and login details to connect to the source database.
  • You are required to store login passwords in a secrets manager (AWS or Azure) that should be set up in advance.
  • Some data sources allow you to specify JDBC connection parameters. The documentation for your source data model will provide information about supported JDBC connection settings.
  • Your choice of source tables may limit your transformation options when moving data from storage to a data warehouse.

Destination

The destinations for CDC pipelines are cloud storage containers that will receive data. This data can be used to update a cloud data warehouse using a Matillion ETL shared job.

  • You will need to know details of your storage container including name, account name and password.
  • You are required to store login passwords in a secrets manager (AWS or Azure) that should be set up in advance.
  • It is advised you have access to your storage container.

Settings

Pipeline settings are configured during pipeline creation and can be changed at a later date through the Manage Pipelines screen.

  • Pipeline Name: Provide a unique name for the pipeline.
  • Snapshotting: Turn full snapshotting off or on. For more detailed information on snapshotting options, see the Advanced Settings article for your chosen CDC source.

Managing pipelines

Once created, pipelines are managed through the Data Loader pipeline dashboard. From here, you can view the status of all pipelines, and you can start, stop, or delete any pipeline.

Note

When deleting and recreating a CDC pipeline, you must clear out the files that the pipeline places in your cloud storage. If you don't, the new pipeline will recognize the existing offset.dat file and will therefore skip the snapshot phase.