Sync All Tables shared job
Process CDC data for all source tables, and then keep matching target tables in sync with the source tables.
Properties
Cloud Storage URL
= string
The URL of the location where the CDC change event files are stored. This should have the format: s3://<bucket>/<prefix>/
or azure://<account>.blob.core.windows.net/<container>/<prefix>/
Warehouse
= string
The Snowflake warehouse used to process the data.
Target Database
= string
The Snowflake database containing the external table with change events and the target table.
Target Schema
= string
The Snowflake schema containing the external table with change events and the target table.
Snowflake External Stage
= string
The name of the existing external stage that contains the files output by the CDC pipeline. For a full explanation of how and why you would create a Snowflake external stage, read Manage Stages.
External Table
= string
The external table in Snowflake containing the CDC change events.
Concurrency
= drop-down
How to iterate over the source tables for which there are CDC event data:
- Sequential: Iterations are done in sequence, waiting for each to complete before starting the next. This is the default.
- Concurrent: Iterations are run concurrently.
Target Table Prefix
= string
A prefix to prepend to the source table name to generate the target table name. If no target table prefix is specified, the target table will have the same name as the source table.
Fully Qualify Target Table (Recommended)
= boolean
Optionally includes the source database and schema in the target table name.
- Y: The target table name will have the format
<source_database>_<source_schema>_<source_table>
This is the default setting. - N: The target table name will have the format
<source_table>
.
It's strongly recommended to leave this set to Y, unless you are confident that your source table names will always be unique.
If the target table name isn't fully qualified, and there are source tables with the same name in different databases or schemas, this job will see them as the same table and attempt to merge their metadata and contents.
Transformation Type
= drop-down
The type of transformation used when applying the change events to the target table.
- Copy Table: The target table will be maintained as a copy of the source table.
- Copy Table with Soft Deletes: Same as Copy Table, but records deleted in the source table will be retained in the target table.
- Change Log: All change events will be extracted and appended to the target table.
Note
A primary key is required on the source table for Copy Table and Copy Table with Soft Deletes transformations. The primary key is used by the shared job to merge updates into the target table. If the source table doesn't have a primary key, the transformation type will be updated to Change Log.
Note
Change Log is disabled in this version of the shared job and will be enabled in a future release.
Warning
If you intend to change the type of transformation for an already existing/running workflow, you must also drop the target tables before changing the transformation type. The transformation types are mutually exclusive and can't be applied to the same target tables.
Append All Metadata
= boolean
Append all metadata columns to the target table. Options are:
- Y: Append all metadata columns.
- N: Only append metadata columns required by this job.
For a list of metadata columns, and how the field values are derived, read Append Metadata.
Azure Key Vault URL
= string
Required for Azure Blob Storage
The URL of the Azure Key Vault used to store the Azure Blob Storage connection string. This is also called the vault's "DNS Name".
Azure Blob Storage Account Key Secret
= string
Required for Azure Blob Storage
Name of the Azure Key Vault secret that stores the Azure Blob Storage account key.
Schema Drift Action
= drop-down
Select which action should be taken when schema changes are detected in the source table and require an update to the target table.
- Update Target: The target table is updated according to the new schema. This is the default setting.
- Fail Job: This job will fail, causing no changes to the target table.
Logging Enabled
= string
Enable logging of row counts from transformation jobs. Options are:
- Y: Log row counts.
- N: Don't log row counts.
Note
Logged row counts may not match the number of rows processed by the Streaming agent. For the Copy Table and Copy Table with Soft Deletes transformation types, if a given record has more than one change event in a job execution, only the final change event will be processed. The "after" object of the final change event contains the net changes from all the previous events.
Log Table
= string
The name of the table in Snowflake containing the logged events.
Trim All String Columns
= string
Remove trailing whitespace from string columns. Options are:
- Y: Remove trailing whitespace.
- N: Don't remove trailing whitespace.
Tag Processed Files
= string
Once the Avro files have been processed, update the tags or metadata in cloud storage. This enables a lifecycle management policy to be applied to files to reduce storage costs. Options are:
- Y: Update tags/metadata.
- N: Don't update tags/metadata.
For S3 and Azure Blob Storage, a tag of matillion_cdc_processed = true
will be added to each file. If a file already has the maximum 10 tags, the tagging process will be skipped.
For Google Cloud Storage, the Custom-Time
metadata will be updated to be the time the file was processed. If the Custom-Time
has already been set to a value in the future, the metadata update will be skipped.
Bytes to Decimal Function
= string
For some source databases (e.g. PostgreSQL and Oracle), numeric columns can be defined without specifying a scale or precision, and these columns are able to contain values with different scales and precisions. When these columns are extracted to the Avro files, they are denoted as the type VariableScaleDecimal.
A VariableScaleDecimal value is a variant object with two properties: the value stored as a byte array, and a numeric scale. Snowflake SQL can't natively convert these byte arrays back to the original numeric format. If your source tables contain such columns, a User Defined Function is required by the shared job to convert the values.
If no value is specified for this parameter, the shared job will not attempt to convert the byte arrays, and will load them to columns of type variant.
If a value is specified, the shared job will check whether the user defined function already exists. If it doesn't exist, the job will attempt to create it. If the job can't create the function due to insufficient privileges, the job will fail, and the function must be created in the Snowflake Console.
If the parameter is set to a value like BYTES_TO_DECIMAL
, the job will look for the function in the environment default database and schema.
If the parameter is set to a value like CDC_DB.PUBLIC.BYTES_TO_DECIMAL
, the job will look for the function in the PUBLIC
schema of the CDC_DB
database.
Use Source Schemas
= string
Optionally create the target tables in a schema with the same name as the source table. If the schema doesn't already exist, the job will try to create it.
- Y: Create the target tables in a schema with the same name as the source.
- N: Create the target tables in the schema specified by the Target Schema parameter.
Note
Whichever option is selected, the external tables will still be created in the schema specified by the Target Schema parameter.
Primary Key Override
= column editor
Optionally provide a list of primary key columns for the source tables.
By default, the job will read the primary key columns from the CDC Avro files. However, if the source table does not a have a primary defined in its DDL, a list of unique columns can be specified here to enable Copy Table transformations.
Note
The values for the source_database, source_schema, source_table, and source_column are case sensitive, and must match the source database.
Cloud Storage URL
= string
The URL of the location where the CDC change event files are stored. This should have the format: s3://<bucket>/<prefix>/
or azure://<account>.blob.core.windows.net/<container>/<prefix>/
Target Database
= string
The Databricks database containing the external table with change events and the target table.
Concurrency
= drop-down
How to iterate over the source tables for which there are CDC event data:
- Sequential: Iterations are done in sequence, waiting for each to complete before starting the next. This is the default.
- Concurrent: Iterations are run concurrently.
Stage Table Prefix
= string
A prefix to prepend to the source table name to generate the stage table names.
Target Table Prefix
= string
A prefix to prepend to the source table name to generate the target table name. If no Target Table Prefix is specified, the target table will have the same name as the source table.
Fully Qualify Target Table (Recommended)
= boolean
Optionally includes the source database and schema in the target table name.
- Y: The target table name will have the format
<source_database>_<source_schema>_<source_table>
This is the default setting. - N: The target table name will have the format
<source_table>
.
It's strongly recommended to leave this set to Y, unless you are confident that your source table names will always be unique.
If the target table name isn't fully qualified, and there are source tables with the same name in different databases or schemas, this job will see them as the same table and attempt to merge their metadata and contents.
Transformation Type
= drop-down
The type of transformation used when applying the change events to the target table.
- Copy Table: The target table will be maintained as a copy of the source table.
- Copy Table with Soft Deletes: Same as Copy Table, but records deleted in the source table will be retained in the target table.
- Change Log: All change events will be extracted and appended to the target table.
Note
A primary key is required on the source table for Copy Table and Copy Table with Soft Deletes transformations. The primary key is used by the shared job to merge updates into the target table. If the source table doesn't have a primary key, the transformation type will be updated to Change Log.
Note
Change Log is disabled in this version of the shared job and will be enabled in a future release.
Warning
If you intend to change the type of transformation for an already existing/running workflow, you must also drop the target tables before changing the transformation type. The transformation types are mutually exclusive and can't be applied to the same target tables.
Append All Metadata
= boolean
Append all metadata columns to the target table. Options are:
- Y: Append all metadata columns.
- N: Only append metadata columns required by this job.
For a list of metadata columns, and how the field values are derived, read Append Metadata.
Azure Key Vault URL
= string
Required for Azure Blob Storage
The URL of the Azure Key Vault used to store the Azure Blob Storage connection string. This is also called the vault's "DNS Name".
Azure Blob Storage Account Key Secret
= string
Required for Azure Blob Storage
Name of the Azure Key Vault secret that stores the Azure Blob Storage account key.
Schema Drift Action
= drop-down
Select which action should be taken when schema changes are detected in the source table and require an update to the target table.
- Update Target: The target table is updated according to the new schema. This is the default setting.
- Fail Job: This job will fail, causing no changes to the target table.
Note
Schema drift is disabled but will be enabled in a future release.
Catalog
= string
The name of the Databricks catalog containing the databases.
For workspaces which do not have unity catalog enabled, this should be set to hive_metastore
or [Environment Default]
.
Tag Processed Files
= string
Once the Avro files have been processed, update the tags or metadata in cloud storage. This enables a lifecycle management policy to be applied to files to reduce storage costs. Options are:
- Y: Update tags/metadata.
- N: Don't update tags/metadata.
For S3 and Azure Blob Storage, a tag of matillion_cdc_processed = true
will be added to each file. If a file already has the maximum 10 tags, the tagging process will be skipped.
Bytes to Decimal Function
= string
For some source databases (e.g. PostgreSQL and Oracle), numeric columns can be defined without specifying a scale or precision, and these columns are able to contain values with different scales and precisions. When these columns are extracted to the Avro files, they are denoted as the type VariableScaleDecimal.
A VariableScaleDecimal value is a struct with two properties, the value stored as a byte array, and a numeric scale. Databricks SQL cannot natively convert these byte arrays back to the original numeric format. If your source tables contain such columns, a Python User Defined Function is required by the shared job to convert the values.
If no value is specified for this parameter, the shared job will not attempt to convert the byte arrays, and will load them to columns of type variant.
If a value is specified, the shared job will check whether the user defined function already exists. If it does not exist, the job will attempt to create it. If the job cannot create the function due to insufficient privileges, the job will fail and the function must be created in the Databricks Console.
If the parameter is set to a value like bytes_to_decimal
, the job will look for the function in the environment default catalog and schema.
If the parameter is set to a value like unity_catalog.cdc.bytes_to_decimal
, the job will look for the function in the cdc schema of the unity_catalog catalog
.
Note
Python User Defined Functions are only supported in schemas in a unity catalog. Also, the endpoint or cluster must be Databricks runtime 13.2 or later, or a serverless or pro SQL warehouse. If your environment does not meet these requirements, the job will not attempt to create or use the User Defined Function.
Cloud Storage URL
= string
The URL of the location where the CDC change event files are stored. This should have the format: s3://<bucket>/<prefix>/
or azure://<account>.blob.core.windows.net/<container>/<prefix>/
Concurrency
= drop-down
How to iterate over the source tables for which there are CDC event data:
- Sequential: Iterations are done in sequence, waiting for each to complete before starting the next. This is the default.
- Concurrent: Iterations are run concurrently.
IAM Role
= string
The Amazon Resource Name (ARN) for an IAM role with the necessary permissions to access the S3 bucket containing the CDC event data.
Region
= drop-down
(Optional) AWS Region of the Amazon S3 bucket containing the CDC event data if it resides in a different region to the Redshift cluster.
Target Schema
The Redshift schema containing the external table with change events and the target table.
Target Table Prefix
= string
A prefix to prepend to the source table name to generate the target table name. If no target table prefix is specified, the target table will have the same name as the source table.
Fully Qualify Target Table (Recommended)
= boolean
Optionally includes the source database and schema in the target table name.
- Y: The target table name will have the format
<source_database>_<source_schema>_<source_table>
This is the default setting. - N: The target table name will have the format
<source_table>
.
It's strongly recommended to leave this set to Y, unless you are confident that your source table names will always be unique.
If the target table name isn't fully qualified, and there are source tables with the same name in different databases or schemas, this job will see them as the same table and attempt to merge their metadata and contents.
Transformation Type
= drop-down
The type of transformation used when applying the change events to the target table.
- Copy Table: The target table will be maintained as a copy of the source table.
- Copy Table with Soft Deletes: Same as Copy Table, but records deleted in the source table will be retained in the target table.
- Change Log: All change events will be extracted and appended to the target table.
Note
A primary key is required on the source table for Copy Table and Copy Table with Soft Deletes transformations. The primary key is used by the shared job to merge updates into the target table. If the source table doesn't have a primary key, the transformation type will be updated to Change Log.
Note
Change Log is disabled in this version of the shared job and will be enabled in a future release.
Warning
If you intend to change the type of transformation for an already existing/running workflow, you must also drop the target tables before changing the transformation type. The transformation types are mutually exclusive and can't be applied to the same target tables.
Append All Metadata
= boolean
Append all metadata columns to the target table. Options are:
- Y: Append all metadata columns.
- N: Only append metadata columns required by this job.
For a list of metadata columns, and how the field values are derived, read Append Metadata.
Schema Drift Action
= drop-down
Select which action should be taken when schema changes are detected in the source table and require an update to the target table.
- Update Target: The target table is updated according to the new schema. This is the default setting.
- Fail Job: This job will fail, causing no changes to the target table.
Concurrency
= string
How to iterate over the source tables for which there are CDC event data:
- Sequential: Iterations are done in sequence, waiting for each to complete before starting the next. This is the default.
- Concurrent: Iterations are run concurrently.
Maximum number of files per table
= string
The maximum number of files which will be processed for each source table, by a single job execution. The default is 1000.
Tag Processed Files
= string
Once the Avro files have been processed, update the tags in cloud storage. This enables a lifecycle management policy to be applied to files to reduce storage costs. Options are:
- Y: Update tags.
- N: Don't update tags.
For S3, a tag of matillion_cdc_processed = true
will be added to each file. If a file already has the maximum 10 tags, the tagging process will be skipped.
Columns to Exclude
= column editor
(Optional) List of columns in the Avro files which should not be loaded into Redshift.
For example, if the source table contains columns which are too large to load into Redshift, the columns could be listed here under source_database, source schema, source_table, and source_column.
Cloud Storage URL
= string
The URL of the location where the CDC change event files are stored. This should have the format: gs://<bucket>/<prefix>/
Target Project
= drop-down
The GCP project containing the target table.
Target Dataset
= drop-down
The BigQuery dataset containing the target table.
Concurrency
= drop-down
How to iterate over the source tables for which there are CDC event data:
- Sequential: Iterations are done in sequence, waiting for each to complete before starting the next. This is the default.
- Concurrent: Iterations are run concurrently.
Stage Table Prefix
= string
A prefix to prepend to the source table name to generate the stage table names.
Target Table Prefix
= string
A prefix to prepend to the source table name to generate the target table name. If no Target Table Prefix is specified, the target table will have the same name as the source table.
Fully Qualify Target Table (Recommended)
= boolean
Optionally includes the source database and schema in the target table name.
- Y: The target table name will have the format
<source_database>_<source_schema>_<source_table>
This is the default setting. - N: The target table name will have the format
<source_table>
.
It's strongly recommended to leave this set to Y, unless you are confident that your source table names will always be unique.
If the target table name isn't fully qualified, and there are source tables with the same name in different databases or schemas, this job will see them as the same table and attempt to merge their metadata and contents.
Transformation Type
= drop-down
The type of transformation used when applying the change events to the target table.
- Copy Table: The target table will be maintained as a copy of the source table.
- Copy Table with Soft Deletes: Same as Copy Table, but records deleted in the source table will be retained in the target table.
- Change Log: All change events will be extracted and appended to the target table.
Note
A primary key is required on the source table for Copy Table and Copy Table with Soft Deletes transformations. The primary key is used by the shared job to merge updates into the target table. If the source table doesn't have a primary key, the transformation type will be updated to Change Log.
Note
Change Log is disabled in this version of the shared job and will be enabled in a future release.
Warning
If you intend to change the type of transformation for an already existing/running workflow, you must also drop the target tables before changing the transformation type. The transformation types are mutually exclusive and can't be applied to the same target tables.
Append All Metadata
= boolean
Append all metadata columns to the target table. Options are:
- Y: Append all metadata columns.
- N: Only append metadata columns required by this job.
For a list of metadata columns, and how the field values are derived, read Append Metadata.
Schema Drift Action
= drop-down
Select which action should be taken when schema changes are detected in the source table and require an update to the target table.
- Update Target: The target table is updated according to the new schema. This is the default setting.
- Fail Job: This job will fail, causing no changes to the target table.
Tag Processed Files
= string
Once the Avro files have been processed, update the metadata in cloud storage. This enables a lifecycle management policy to be applied to files to reduce storage costs. Options are:
- Y: Update metadata.
- N:* Don't update metadata.
The Custom-Time
metadata will be updated to be the time the file was processed. If the 'Custom-Time' has already been set to a value in the future, the metadata update will be skipped.
Bytes to Decimal Function
= string
For some source databases (e.g. PostgreSQL and Oracle), numeric columns can be defined without specifying a scale or precision, and these columns are able to contain values with different scales and precisions. When these columns are extracted to the Avro files, they are denoted as the type VariableScaleDecimal.
A VariableScaleDecimal value is a struct with two properties: the value stored as a byte array, and a numeric scale. If your source tables contain such columns, a User Defined Function is required by the shared job to convert the values.
If no value is specified for this parameter, the shared job will not attempt to convert the byte arrays, and will load them to columns of type STRUCT<scale
INT64, value
BYTES>.
If a value is specified, the shared job will check whether the user defined function already exists. If it does not exist, the job will attempt to create it. If the job cannot create the function due to insufficient privileges, the job will fail and the function must be created in the BigQuery console.
If the parameter is set to a value like bytes_to_decimal
, the job will look for the function in the environment default project and dataset.
If the parameter is set to a value like my_project.cdc.bytes_to_decimal
, the job will look for the function in the cdc dataset of the my_project
project.
Note
If the original decimal value contains more than 15 digits, BigQuery will start to use floating point approximations during the conversion. This means the resulting decimal number will only be approximately the same as the value in the source table.
Tables created
The ETL process creates some transient staging tables during a pipeline run that are deleted after a run, regardless of success.
Tables are also created permanently to support the pipeline process, including the target table that contains your source data.
Read Tables created.
Snowflake | Delta Lake on Databricks | Amazon Redshift | Google BigQuery | Azure Synapse Analytics |
---|---|---|---|---|
✅ | ✅ | ✅ | ✅ | ❌ |