Schema drift
Schema Drift occurs when your data source changes its metadata. Fields, columns, and data types can be added, removed, or changed on the fly. If you don't account for schema drift, your data flow becomes vulnerable to changes in the upstream data source.
Streaming pipelines can adapt to these source schema changes and propagate them to the destination. These "self-healing" pipelines not only save time, they significantly reduce the number of job failures and delays due to schema drift. Users can ensure that data continues to flow, uninterrupted, from source to destination.
In a streaming pipeline, the agent loads data to your chosen destination cloud storage area, which is then transformed to accommodate schema changes. Changes are driven by data rather than by DDL, which means that any changes made to the schema of the source tables won't be reflected in the target until the data changes.
The streaming pipeline handles specific changes in accordance with the following principles:
- Added tables: You must create a new streaming pipeline to use a new source table; it's not automatically included in the existing pipeline.
- Deleted tables: If a table is deleted/dropped in your source, it's not dropped on the destination platform but will be retained for historical purposes. You can manually drop this table in the destination if needed.
- Added columns: The streaming pipeline will pick up any new columns added to selected tables if the new column contains data.
- Deleted columns: If a column is dropped from the source table, it will remain in your destination table and historical data will be retained up to the point of dropping. Values in this column after being dropped will be inserted as NULL.
-
Data type changes: Data type changes will be accommodated, but if they aren't compatible changes for the destination cloud data platform (i.e. they are data types the destination does not support), this is handled in the following way:
- The column will be renamed as
<column_name>_<datetime>
, where<datetime>
has the format_yyyymmddhhmmss
, and will be the same for all columns in the same table in the same streaming pipeline execution. - The new column will be NULL up to the date of change, which needs to be considered for downstream dependencies such as views and reports.
- The column will be renamed as