Schema drift
Schema Drift occurs when your data source changes its metadata. On the fly, fields, columns, and types can be added, removed, or changed. If you don't account for schema drift, your data flow becomes vulnerable to changes in the upstream data source.
Data Loader pipelines can easily adapt to these schema changes and propagate them to the destination, without assistance from IT or the need to code. These "self-healing" pipelines not only save time, they significantly reduce the number of job failures and delays due to schema drift. Users can ensure that data continues to flow, uninterrupted, from source to target.
CDC Pipelines
In a CDC pipeline, any columns that are added or removed when the data changes will be visible. Changes are driven by data rather than by DDL. As a result, any changes you make to the schema of the source tables will not be reflected until the data changes. This encompasses new (empty) tables and new (empty) columns.
Matillion ETL shared jobs are available that handle schema drift, which will accommodate the conditions below. The agent loads data to your chosen cloud storage area which is then transformed to accommodate these conditions. If you intend to transform the data from your chosen cloud storage area yourself, the below are useful points to consider.
- Adding a Table: You must create a new pipeline to add a new table. The ability to edit a pipeline will be considered in a future release.
- Delete a Table: If a table is deleted/dropped in your source, it is not removed from the destination platform and will be retained for historical purposes. You can manually drop this table if needed.
- Adding a Column: The pipeline will pick up any new columns added to selected tables if the new column contains data.
- Deleting a Column: If a column is dropped from the source table, it will remain in your target table. However, it will not be loaded - historical data will be retained and from the point of deletion, the values for this column will be inserted as NULL.
- Data Type change: See below Data Type changes section for more detail when using a Matillion ETL Shared job to use CDC data from your storage to update a data warehouse.
Data Type Changes
Data Type changes are accommodated, but if these are not compatible changes for the target cloud platform, the current column will be renamed as column_name
, _datetime , and the column re-purposed as the new datatype. The format of the datetime extension is mmddhhmmss
, for example, 20210113110334
and will be the same for all columns in the same table in the same Pipeline execution. The new column will be NULL up to the date of change - this should be considered for downstream dependencies such as views, reports, etc.