CDC shared jobs overview
Change Data Capture (CDC) and processing data from your chosen sources—using a CDC agent to load data to your chosen cloud storage area—is supported within Matillion ETL by three shared jobs. These shared jobs facilitate two data-loading use cases.
Data is loaded to your chosen cloud storage area by the CDC agent installed on your source database. The data is loaded by the agent into your choice of Amazon S3, Google Cloud Storage, or Azure Blob Storage. The CDC shared jobs can then read that data from cloud storage, and that data is loaded into target tables on your chosen data platform. The process can load a single table or all tables, based on the chosen configuration and shared job.
These shared jobs do not work with the Manage Change Data Capture feature in Matillion ETL. Instead, these shared jobs are for the CDC loading feature in Data Loader.
The processes within the shared jobs gather information by reading the files from cloud storage (including metadata regarding the source, any change events, and also whether any source table changes have occurred). This information is then used to facilitate a workflow to load agent-provided data into the target, based on the files loaded into cloud storage.
CDC shared jobs come prepackaged with the latest versions of Matillion ETL and are updated with each version. For access to faster shared job updates, the latest can be downloaded from the Matillion Exchange using the below link.
When using Amazon Redshift, a shared job won't extract columns that contain upper case letters. You may see an error message regarding NULL values or the job may still complete, but with NULL values in place of the data that wasn't extracted. To resolve this issue, you must set the Redshift parameter,
enable_case_sensitive_identifierto True. You can do this by altering the user, or updating the Redshift Parameter Group. See here for examples.
Schema drift support is also available and can be toggled using the appropriate job parameter
Schema Drift Action, which is fully described in the appropriate configuration guide for your use case. Additionally, see the section further down on schema drift support, which explains how this functionality is implemented.
There may be nuances in terms of the data typing conversion and support between sources to each cloud data platform. These should be considered when creating your workflow with regard to what each data type will be presented as in your target table or tables. This is covered below.
Blob Storage prerequisites
If the CDC files are stored in Azure Blob Storage, the following Python 3 packages must be installed on the Matillion ETL instance:
Below you will find two use cases. Configure the shared jobs dependent on your use case for your cloud data platform as per the relevant section. The suggested default is Use Case 1 (Sync All Tables), as this is a self contained process, bearing in mind the caveat surrounding key requirements for the Copy Table transformation type, which is explained in the configuration help below.
Use Case 1 - Sync All Tables
The purpose of this use case is to process CDC data from a Snowflake external table for all source tables, and then keep matching target tables in sync with the source tables.
First, place the Sync All Tables shared job onto the job canvas of an orchestration job from the Shared Jobs palette.
The Sync All Tables shared job contains a call to the Create or Refresh External Table shared job by default on Snowflake, so a standalone process to create or refresh the external table does not need to be additionally added to your workflow when using the Sync All Tables shared job on Snowflake.
Use Case 2 - Sync Single Table
The purpose of this use case is to process CDC data from a Snowflake external table for a single source table, and then keep a target table in sync with the source table.
First, place the Sync Single Table shared job onto the job canvas of an orchestration job from the Shared Jobs palette.
(Snowflake only) To ensure that the external table exposes the data from Amazon S3, Google Cloud Storage, or Azure Blob Storage, the shared job Create or Refresh External Table must be run to provide an updated view of the data in your chosen cloud storage area.
Therefore, the Create or Refresh External Table shared job is added to the orchestration job canvas. These shared jobs are linked, as in the following workflow.
Shared job links
Schema drift support
Schema drift support accommodates changes made to the source data, such as:
- Missing columns as a result of changes in the source schema. Missing columns are loaded as NULL.
- Data type changes for specified columns in the shared job configuration.
- Tables no longer present in the source schema. Any missing tables will no longer be loaded. However, your shared job will fail. All other tables specified as part of the configuration in the shared job will be loaded. If this scenario occurs, edit your shared job Table and Columns grid variable to remove the missing table.
To learn more, read Schema Drift.
Data type changes
Data type changes will also be accommodated, but if these are not compatible changes for the target cloud data platform, the current column will be renamed as
<column_name>_datetime and the column repurposed as the new data type. The format of the datetime extension is
_yyyymmddhhmmss, for example,
_20210113110334, and will be the same for all columns in the same table in the same shared job configuration. The new column will be NULL up to the date of change. This should be considered for downstream dependencies such as views and reports.
Data type tables can be found at Data typing with CDC shared jobs.
Databricks runtime releases
The CDC shared jobs in Matillion ETL currently support LTS 9.1 or later.
Resetting a pipeline
On rare occasion, such as major CDC agent changes, it may be required to fully reset your CDC pipeline. Do this by following these steps:
- Ensure your existing agent is running the latest version.
- Stop the CDC pipeline. The pipeline does not need to be deleted and can be resumed later.
- Stop the scheduled shared job executions in Matillion ETL.
- Clear all files in the storage location specified in the CDC pipeline. This includes any offset and history files.
- Import the latest .melt file containing the shared jobs from Matillion Exchange, using Project → Manage Shared Jobs → Import.
- Clear all pipeline tables in the warehouse. The Drop CDC Tables shared job on the Exchange can be used to identify all tables created by CDC, and optionally drop them.
- In all Matillion ETL orchestration jobs that call CDC shared jobs to be updated:
- Add the updated CDC shared jobs to the canvas as appropriate.
- Copy across all the properties from your old configurations.
- Delete any outdated shared jobs.
- Connect the new shared job to the workflow.
- Resume the pipeline and enable shared job schedules.