Skip to content

Sync All Tables shared job

Process CDC data for all source tables, and then keep matching target tables in sync with the source tables.


Properties

Cloud Storage URL = string

The URL of the location where the CDC change event files are stored. This should have the format: s3://<bucket>/<prefix>/ or azure://<account>.blob.core.windows.net/<container>/<prefix>/


Warehouse = string

The Snowflake warehouse used to process the data.


Target Database = string

The Snowflake database containing the external table with change events and the target table.


Target Schema = string

The Snowflake schema containing the external table with change events and the target table.


Snowflake External Stage = string

The name of the existing external stage that contains the files output by the CDC pipeline. For a full explanation of how and why you would create a Snowflake external stage, read Manage Stages.


External Table = string

The external table in Snowflake containing the CDC change events.


Concurrency = drop-down

How to iterate over the source tables for which there are CDC event data:

  • Sequential: Iterations are done in sequence, waiting for each to complete before starting the next. This is the default.
  • Concurrent: Iterations are run concurrently.

Target Table Prefix = string

A prefix to prepend to the source table name to generate the target table name. If no target table prefix is specified, the target table will have the same name as the source table.


Fully Qualify Target Table (Recommended) = boolean

Optionally includes the source database and schema in the target table name.

  • Y: The target table name will have the format <source_database>_<source_schema>_<source_table> This is the default setting.
  • N: The target table name will have the format <source_table>.

It's strongly recommended to leave this set to Y, unless you are confident that your source table names will always be unique.

If the target table name isn't fully qualified, and there are source tables with the same name in different databases or schemas, this job will see them as the same table and attempt to merge their metadata and contents.


Transformation Type = drop-down

The type of transformation used when applying the change events to the target table.

  • Copy Table: The target table will be maintained as a copy of the source table.
  • Copy Table with Soft Deletes: Same as Copy Table, but records deleted in the source table will be retained in the target table.
  • Change Log: All change events will be extracted and appended to the target table.

Note

A primary key is required on the source table for Copy Table and Copy Table with Soft Deletes transformations. The primary key is used by the shared job to merge updates into the target table. If the source table doesn't have a primary key, the transformation type will be updated to Change Log.

Note

Change Log is disabled in this version of the shared job and will be enabled in a future release.

Warning

If you intend to change the type of transformation for an already existing/running workflow, you must also drop the target tables before changing the transformation type. The transformation types are mutually exclusive and can't be applied to the same target tables.


Append All Metadata = boolean

Append all metadata columns to the target table. Options are:

  • Y: Append all metadata columns.
  • N: Only append metadata columns required by this job.

For a list of metadata columns, and how the field values are derived, read Append Metadata.


Azure Key Vault URL = string

Required for Azure Blob Storage

The URL of the Azure Key Vault used to store the Azure Blob Storage connection string. This is also called the vault's "DNS Name".


Azure Blob Storage Account Key Secret = string

Required for Azure Blob Storage

Name of the Azure Key Vault secret that stores the Azure Blob Storage account key.


Schema Drift Action = drop-down

Select which action should be taken when schema changes are detected in the source table and require an update to the target table.

  • Update Target: The target table is updated according to the new schema. This is the default setting.
  • Fail Job: This job will fail, causing no changes to the target table.

Logging Enabled = string

Enable logging of row counts from transformation jobs. Options are:

  • Y: Log row counts.
  • N: Don't log row counts.

Note

Logged row counts may not match the number of rows processed by the CDC agent. For the Copy Table and Copy Table with Soft Deletes transformation types, if a given record has more than one change event in a job execution, only the final change event will be processed. The "after" object of the final change event contains the net changes from all the previous events.


Log Table = string

The name of the table in Snowflake containing the logged events.


Trim All String Columns = string

Remove trailing whitespace from string columns. Options are:

  • Y: Remove trailing whitespace.
  • N: Don't remove trailing whitespace.

Tag Processed Files = string

Once the Avro files have been processed, update the tags or metadata in cloud storage. This enables a lifecycle management policy to be applied to files to reduce storage costs. Options are:

  • Y: Update tags/metadata.
  • N: Don't update tags/metadata.

For S3 and Azure Blob Storage, a tag of matillion_cdc_processed = true will be added to each file. If a file already has the maximum 10 tags, the tagging process will be skipped.

For Google Cloud Storage, the Custom-Time metadata will be updated to be the time the file was processed. If the Custom-Time has already been set to a value in the future, the metadata update will be skipped.


Bytes to Decimal Function = string

For some source databases (e.g. PostgreSQL and Oracle), numeric columns can be defined without specifying a scale or precision, and these columns are able to contain values with different scales and precisions. When these columns are extracted to the Avro files, they are denoted as the type VariableScaleDecimal.

A VariableScaleDecimal value is a variant object with two properties: the value stored as a byte array, and a numeric scale. Snowflake SQL can't natively convert these byte arrays back to the original numeric format. If your source tables contain such columns, a User Defined Function is required by the shared job to convert the values.

If no value is specified for this parameter, the shared job will not attempt to convert the byte arrays, and will load them to columns of type variant.

If a value is specified, the shared job will check whether the user defined function already exists. If it doesn't exist, the job will attempt to create it. If the job can't create the function due to insufficient privileges, the job will fail, and the function must be created in the Snowflake Console.

If the parameter is set to a value like BYTES_TO_DECIMAL, the job will look for the function in the environment default database and schema.

If the parameter is set to a value like CDC_DB.PUBLIC.BYTES_TO_DECIMAL, the job will look for the function in the PUBLIC schema of the CDC_DB database.


Use Source Schemas = string

Optionally create the target tables in a schema with the same name as the source table. If the schema doesn't already exist, the job will try to create it.

  • Y: Create the target tables in a schema with the same name as the source.
  • N: Create the target tables in the schema specified by the Target Schema parameter.

Note

Whichever option is selected, the external tables will still be created in the schema specified by the Target Schema parameter.


Primary Key Override = column editor

Optionally provide a list of primary key columns for the source tables.

By default, the job will read the primary key columns from the CDC Avro files. However, if the source table does not a have a primary defined in its DDL, a list of unique columns can be specified here to enable Copy Table transformations.

Note

The values for the source_database, source_schema, source_table, and source_column are case sensitive, and must match the source database.

Cloud Storage URL = string

The URL of the location where the CDC change event files are stored. This should have the format: s3://<bucket>/<prefix>/ or azure://<account>.blob.core.windows.net/<container>/<prefix>/


Target Database = string

The Databricks database containing the external table with change events and the target table.


Concurrency = drop-down

How to iterate over the source tables for which there are CDC event data:

  • Sequential: Iterations are done in sequence, waiting for each to complete before starting the next. This is the default.
  • Concurrent: Iterations are run concurrently.

Stage Table Prefix = string

A prefix to prepend to the source table name to generate the stage table names.


Target Table Prefix = string

A prefix to prepend to the source table name to generate the target table name. If no Target Table Prefix is specified, the target table will have the same name as the source table.


Fully Qualify Target Table (Recommended) = boolean

Optionally includes the source database and schema in the target table name.

  • Y: The target table name will have the format <source_database>_<source_schema>_<source_table> This is the default setting.
  • N: The target table name will have the format <source_table>.

It's strongly recommended to leave this set to Y, unless you are confident that your source table names will always be unique.

If the target table name isn't fully qualified, and there are source tables with the same name in different databases or schemas, this job will see them as the same table and attempt to merge their metadata and contents.


Transformation Type = drop-down

The type of transformation used when applying the change events to the target table.

  • Copy Table: The target table will be maintained as a copy of the source table.
  • Copy Table with Soft Deletes: Same as Copy Table, but records deleted in the source table will be retained in the target table.
  • Change Log: All change events will be extracted and appended to the target table.

Note

A primary key is required on the source table for Copy Table and Copy Table with Soft Deletes transformations. The primary key is used by the shared job to merge updates into the target table. If the source table doesn't have a primary key, the transformation type will be updated to Change Log.

Note

Change Log is disabled in this version of the shared job and will be enabled in a future release.

Warning

If you intend to change the type of transformation for an already existing/running workflow, you must also drop the target tables before changing the transformation type. The transformation types are mutually exclusive and can't be applied to the same target tables.


Append All Metadata = boolean

Append all metadata columns to the target table. Options are:

  • Y: Append all metadata columns.
  • N: Only append metadata columns required by this job.

For a list of metadata columns, and how the field values are derived, read Append Metadata.


Azure Key Vault URL = string

Required for Azure Blob Storage

The URL of the Azure Key Vault used to store the Azure Blob Storage connection string. This is also called the vault's "DNS Name".


Azure Blob Storage Account Key Secret = string

Required for Azure Blob Storage

Name of the Azure Key Vault secret that stores the Azure Blob Storage account key.


Schema Drift Action = drop-down

Select which action should be taken when schema changes are detected in the source table and require an update to the target table.

  • Update Target: The target table is updated according to the new schema. This is the default setting.
  • Fail Job: This job will fail, causing no changes to the target table.

Note

Schema drift is disabled but will be enabled in a future release.


Catalog = string

The name of the Databricks catalog containing the databases.

For workspaces which do not have unity catalog enabled, this should be set to hive_metastore or [Environment Default].


Tag Processed Files = string

Once the Avro files have been processed, update the tags or metadata in cloud storage. This enables a lifecycle management policy to be applied to files to reduce storage costs. Options are:

  • Y: Update tags/metadata.
  • N: Don't update tags/metadata.

For S3 and Azure Blob Storage, a tag of matillion_cdc_processed = true will be added to each file. If a file already has the maximum 10 tags, the tagging process will be skipped.


Bytes to Decimal Function = string

For some source databases (e.g. PostgreSQL and Oracle), numeric columns can be defined without specifying a scale or precision, and these columns are able to contain values with different scales and precisions. When these columns are extracted to the Avro files, they are denoted as the type VariableScaleDecimal.

A VariableScaleDecimal value is a struct with two properties, the value stored as a byte array, and a numeric scale. Databricks SQL cannot natively convert these byte arrays back to the original numeric format. If your source tables contain such columns, a Python User Defined Function is required by the shared job to convert the values.

If no value is specified for this parameter, the shared job will not attempt to convert the byte arrays, and will load them to columns of type variant.

If a value is specified, the shared job will check whether the user defined function already exists. If it does not exist, the job will attempt to create it. If the job cannot create the function due to insufficient privileges, the job will fail and the function must be created in the Databricks Console.

If the parameter is set to a value like bytes_to_decimal, the job will look for the function in the environment default catalog and schema.

If the parameter is set to a value like unity_catalog.cdc.bytes_to_decimal, the job will look for the function in the cdc schema of the unity_catalog catalog.

Note

Python User Defined Functions are only supported in schemas in a unity catalog. Also, the endpoint or cluster must be Databricks runtime 13.2 or later, or a serverless or pro SQL warehouse. If your environment does not meet these requirements, the job will not attempt to create or use the User Defined Function.

Cloud Storage URL = string

The URL of the location where the CDC change event files are stored. This should have the format: s3://<bucket>/<prefix>/ or azure://<account>.blob.core.windows.net/<container>/<prefix>/


Concurrency = drop-down

How to iterate over the source tables for which there are CDC event data:

  • Sequential: Iterations are done in sequence, waiting for each to complete before starting the next. This is the default.
  • Concurrent: Iterations are run concurrently.

IAM Role = string

The Amazon Resource Name (ARN) for an IAM role with the necessary permissions to access the S3 bucket containing the CDC event data.


Region = drop-down

(Optional) AWS Region of the Amazon S3 bucket containing the CDC event data if it resides in a different region to the Redshift cluster.


Target Schema

The Redshift schema containing the external table with change events and the target table.


Target Table Prefix = string

A prefix to prepend to the source table name to generate the target table name. If no target table prefix is specified, the target table will have the same name as the source table.


Fully Qualify Target Table (Recommended) = boolean

Optionally includes the source database and schema in the target table name.

  • Y: The target table name will have the format <source_database>_<source_schema>_<source_table> This is the default setting.
  • N: The target table name will have the format <source_table>.

It's strongly recommended to leave this set to Y, unless you are confident that your source table names will always be unique.

If the target table name isn't fully qualified, and there are source tables with the same name in different databases or schemas, this job will see them as the same table and attempt to merge their metadata and contents.


Transformation Type = drop-down

The type of transformation used when applying the change events to the target table.

  • Copy Table: The target table will be maintained as a copy of the source table.
  • Copy Table with Soft Deletes: Same as Copy Table, but records deleted in the source table will be retained in the target table.
  • Change Log: All change events will be extracted and appended to the target table.

Note

A primary key is required on the source table for Copy Table and Copy Table with Soft Deletes transformations. The primary key is used by the shared job to merge updates into the target table. If the source table doesn't have a primary key, the transformation type will be updated to Change Log.

Note

Change Log is disabled in this version of the shared job and will be enabled in a future release.

Warning

If you intend to change the type of transformation for an already existing/running workflow, you must also drop the target tables before changing the transformation type. The transformation types are mutually exclusive and can't be applied to the same target tables.


Append All Metadata = boolean

Append all metadata columns to the target table. Options are:

  • Y: Append all metadata columns.
  • N: Only append metadata columns required by this job.

For a list of metadata columns, and how the field values are derived, read Append Metadata.


Schema Drift Action = drop-down

Select which action should be taken when schema changes are detected in the source table and require an update to the target table.

  • Update Target: The target table is updated according to the new schema. This is the default setting.
  • Fail Job: This job will fail, causing no changes to the target table.

Concurrency = string

How to iterate over the source tables for which there are CDC event data:

  • Sequential: Iterations are done in sequence, waiting for each to complete before starting the next. This is the default.
  • Concurrent: Iterations are run concurrently.

Maximum number of files per table = string

The maximum number of files which will be processed for each source table, by a single job execution. The default is 1000.


Tag Processed Files = string

Once the Avro files have been processed, update the tags in cloud storage. This enables a lifecycle management policy to be applied to files to reduce storage costs. Options are:

  • Y: Update tags.
  • N: Don't update tags.

For S3, a tag of matillion_cdc_processed = true will be added to each file. If a file already has the maximum 10 tags, the tagging process will be skipped.


Columns to Exclude = column editor

(Optional) List of columns in the Avro files which should not be loaded into Redshift.

For example, if the source table contains columns which are too large to load into Redshift, the columns could be listed here under source_database, source schema, source_table, and source_column.

Cloud Storage URL = string

The URL of the location where the CDC change event files are stored. This should have the format: gs://<bucket>/<prefix>/


Target Project = drop-down

The GCP project containing the target table.


Target Dataset = drop-down

The BigQuery dataset containing the target table.


Concurrency = drop-down

How to iterate over the source tables for which there are CDC event data:

  • Sequential: Iterations are done in sequence, waiting for each to complete before starting the next. This is the default.
  • Concurrent: Iterations are run concurrently.

Stage Table Prefix = string

A prefix to prepend to the source table name to generate the stage table names.


Target Table Prefix = string

A prefix to prepend to the source table name to generate the target table name. If no Target Table Prefix is specified, the target table will have the same name as the source table.


Fully Qualify Target Table (Recommended) = boolean

Optionally includes the source database and schema in the target table name.

  • Y: The target table name will have the format <source_database>_<source_schema>_<source_table> This is the default setting.
  • N: The target table name will have the format <source_table>.

It's strongly recommended to leave this set to Y, unless you are confident that your source table names will always be unique.

If the target table name isn't fully qualified, and there are source tables with the same name in different databases or schemas, this job will see them as the same table and attempt to merge their metadata and contents.


Transformation Type = drop-down

The type of transformation used when applying the change events to the target table.

  • Copy Table: The target table will be maintained as a copy of the source table.
  • Copy Table with Soft Deletes: Same as Copy Table, but records deleted in the source table will be retained in the target table.
  • Change Log: All change events will be extracted and appended to the target table.

Note

A primary key is required on the source table for Copy Table and Copy Table with Soft Deletes transformations. The primary key is used by the shared job to merge updates into the target table. If the source table doesn't have a primary key, the transformation type will be updated to Change Log.

Note

Change Log is disabled in this version of the shared job and will be enabled in a future release.

Warning

If you intend to change the type of transformation for an already existing/running workflow, you must also drop the target tables before changing the transformation type. The transformation types are mutually exclusive and can't be applied to the same target tables.


Append All Metadata = boolean

Append all metadata columns to the target table. Options are:

  • Y: Append all metadata columns.
  • N: Only append metadata columns required by this job.

For a list of metadata columns, and how the field values are derived, read Append Metadata.


Schema Drift Action = drop-down

Select which action should be taken when schema changes are detected in the source table and require an update to the target table.

  • Update Target: The target table is updated according to the new schema. This is the default setting.
  • Fail Job: This job will fail, causing no changes to the target table.

Tag Processed Files = string

Once the Avro files have been processed, update the metadata in cloud storage. This enables a lifecycle management policy to be applied to files to reduce storage costs. Options are:

  • Y: Update metadata.
  • N:* Don't update metadata.

The Custom-Time metadata will be updated to be the time the file was processed. If the 'Custom-Time' has already been set to a value in the future, the metadata update will be skipped.


Bytes to Decimal Function = string

For some source databases (e.g. PostgreSQL and Oracle), numeric columns can be defined without specifying a scale or precision, and these columns are able to contain values with different scales and precisions. When these columns are extracted to the Avro files, they are denoted as the type VariableScaleDecimal.

A VariableScaleDecimal value is a struct with two properties: the value stored as a byte array, and a numeric scale. If your source tables contain such columns, a User Defined Function is required by the shared job to convert the values.

If no value is specified for this parameter, the shared job will not attempt to convert the byte arrays, and will load them to columns of type STRUCT<scale INT64, value BYTES>.

If a value is specified, the shared job will check whether the user defined function already exists. If it does not exist, the job will attempt to create it. If the job cannot create the function due to insufficient privileges, the job will fail and the function must be created in the BigQuery console.

If the parameter is set to a value like bytes_to_decimal, the job will look for the function in the environment default project and dataset.

If the parameter is set to a value like my_project.cdc.bytes_to_decimal, the job will look for the function in the cdc dataset of the my_project project.

Note

If the original decimal value contains more than 15 digits, BigQuery will start to use floating point approximations during the conversion. This means the resulting decimal number will only be approximately the same as the value in the source table.


Tables created

The ETL process creates some transient staging tables during a pipeline run that are deleted after a run, regardless of success.

Tables are also created permanently to support the pipeline process, including the target table that contains your source data.

Read Tables created.


Snowflake Delta Lake on Databricks Amazon Redshift Google BigQuery Azure Synapse Analytics