Data typing with CDC shared jobs

This page captures what data types will be used when the CDC Sync Single Table shared job loads CDC data into Snowflake, Redshift, Delta Lake on Databricks, and BigQuery.

Snowflake data types

Primitive data types

Avro schema type	Snowflake type
string	VARCHAR(16777216)
int	NUMBER(38,0)
long	NUMBER(38,0)
float	FLOAT
double	FLOAT
boolean	BOOLEAN
bytes	VARIANT

Complex data types

Avro schema type	Snowflake type	Notes
{"type": "bytes", "scale": s, "precision": p, "logicalType": "decimal"} where 1 <= p <= 38 0 <= s <= p	NUMBER(p, s)	N/A
{"type": "bytes", "scale": s, "precision": p, "logicalType": "decimal"} where p > 38 0 <= s <= p or p or s not provided	VARIANT	Snowflake can't represent a decimal where precision or scale is not provided. When reading such a column, Snowflake will simply extract the raw hexadecimal string from the Avro file.
{"type": "bytes", "scale": s, "precision": p, "logicalType": "decimal"} where 1 <= p < s <= 38	N/A	In Snowflake, it's invalid to have the scale greater than the precision. If an Avro file contains a decimal with scale greater than precision (which is valid in Oracle), Snowflake won't be able to read the file, and the file will be silently skipped.
{"type": "int", "logicalType": "date"}	NUMBER(38, 0)	N/A
{"type": "int", "logicalType": "times-millis"}	NUMBER(38, 0)	N/A
{"type": "long", "logicalType": "timestamp-millis"}	NUMBER(38, 0)	N/A
{"type": "array", "items": ["null", "element_type"]}	VARIANT	The data will be represented in a structure like: [<element>, <element>, ...]
{"type": "record", "name": "VariableScaleDecimal", "namespace": "io.debezium.data", "fields": [ { "name": "scale", "type": "int" }, { "name": "value", "type": "bytes" } ]}	VARIANT	The data will be represented in a structure like: { "scale": 5, "value": "7048860DDF79"}
{"type": "record", "name": "Point", "namespace": "io.debezium.data.geometry", "fields": [ { "name": "x", "type": "double" }, { "name": "y", "type": "double" }, { "name": "wkb", "type": ["null", "bytes"], "default": null }, { "name": "srid", "type": ["null", "int"], "default": null } ]}	VARIANT	The data will be represented in a structure like: {"srid": null, "wkb": "01010000000000000000000040000000000000F03F", "x": 2.000000000000000e+00, "y": 1.000000000000000e+00}

Amazon Redshift data types

Warning

A single string value loaded to SUPER is limited to the maximum VARCHAR length of 65535 bytes.

Primitive data types

Avro schema type	Redshift type	Notes
string	VARCHAR	All VARCHAR are created with the maximum size.
int	INT	N/A
long	BIGINT	N/A
float	REAL (FLOAT4)	N/A
double	FLOAT8	N/A
boolean	BOOL	N/A
bytes	VARCHAR	All VARCHAR are created with the maximum size. Unable to create VARBINARY in Matillion ETL for Redshift, hence the mapping to VARCHAR. Can't map this to SUPER because there are some numeric types (Kafka Connect) that come through as bytes and need to be a string in case they're a KEY COLUMN.

Logical data types

A logical type is an Avro primitive or complex type with extra attributes to represent a derived type. The attribute logicalType must always be present for a logical type, and is a string with the name of one of the logical types listed later in this section. Other attributes may be defined for particular logical types.

The Matillion ETL for Redshift shared job will handle logicalTypes by using their underlying Avro data type. Custom parsing of some logical data types can also be performed, as detailed in Casting logical data types, below.

Logical type	Avro schema type	Format
decimal	bytes	{ "type": "bytes", "scale": s, "precision": p, "logicalType": "decimal" }
date	int	{ "type": "int", "logicalType": "date" }
times-millis	int	{ "type": "int", "logicalType": "times-millis" }
timestamp-millis	int	{ "type": "long", "logicalType": "timestamp-millis" }

Complex data types

Avro schema type	Redshift type	Notes
union	N/A	Represents a collection of possible data-types, format: [ <DataType1>,...<DataTypeN>] By convention, a union is only used to indicate a field is NULLABLE, that is its format will always be: [ 'null', <DataType>] and then Matillion ETL decodes the <DataType> accordingly.
array	SUPER	Each value should be a valid JSON_ARRAY.
enum	VARCHAR	All VARCHAR are created with the maximum size.
map	SUPER	Each value should be a valid JSON_OBJECT.
record	SUPER	A record holds a fixed number of fields. By convention, these types are used to describe custom schema data types, where the name/namespace identify what the underlying fields mean. Format: {"type": "record", "name": "<CustomDataType>", "namespace": "<CustomDataNamespace>", "fields": [ <AVRO FIELD DEFINITIONS>, ... ]} Where <namespace>.<name> and <name> are valid identifiers for the underlying record custom type.

Delta Lake on Databricks data types

Primitive data types

Avro schema type	Databricks type
string	string
int	int
long	long
float	double
double	double
boolean	boolean
bytes	binary

Complex data types

Avro schema type	Databricks type	Notes
{"type": "bytes", "scale": s, "precision": p, "logicalType": "decimal"} where 1 <= p <= 38 0 <= s <= p	decimal(p, s)	N/A
{"type": "bytes", "scale": s, "precision": p, "logicalType": "decimal"} where p > 38 0 <= s <= p	N/A	If an Avro file contains a decimal with precision greater than 38, Databricks will throw the following error: `org.apache.spark.sql.catalyst.parser.ParseException: decimal can only support precision up to 38`
{"type": "bytes", "scale": s, "precision": p, "logicalType": "decimal"} where 1 <= p < s <= 38	N/A	If an Avro file contains a decimal with scale greater than precision (which is valid in Oracle), Databricks will throw the following error: `com.databricks.sql.io.FileReadException: Error while reading file`
{"type": "int", "logicalType": "date"}	int	N/A
{"type": "int", "logicalType": "times-millis"}	int	N/A
{"type": "long", "logicalType": "timestamp-millis"}	long	N/A
{"type": "array", "items": ["null", "element_type"]}	`array<element_type>`	N/A
{"type": "record", "name": "VariableScaleDecimal", "namespace": "io.debezium.data", "fields": [ { "name": "scale", "type": "int" }, { "name": "value", "type": "bytes" } ]}	`struct<scale:int, value:binary>`	N/A
{"type": "record", "name": "Point", "namespace": "io.debezium.data.geometry", "fields": [ { "name": "x", "type": "double" }, { "name": "y", "type": "double" }, { "name": "wkb", "type": ["null", "bytes"], "default": null }, { "name": "srid", "type": ["null", "int"], "default": null } ]}	`struct<x:double, y:double, wkb:binary,srid:int>`	N/A

Google BigQuery data types

Primitive data types

Avro schema type	BigQuery type
string	STRING
int	INT(64)
long	INT(64)
float	FLOAT64
double	FLOAT64
boolean	BOOL
bytes	BYTES

Extended data types

Avro schema type	Snowflake type	Notes
{"type": "bytes", "scale": s, "precision": p, "logicalType": "decimal"} where 1 <= p - s <= 29 0 <= s <= 9	NUMERIC	If the precision and scale will fit in a NUMERIC type, NUMERIC will be used. If at least one of the precision or scale is out of bounds for NUMERIC, but within the bounds for BIGNUMERIC, BIGNUMERIC will be used.
{"type": "bytes", "scale": s, "precision": p, "logicalType": "decimal"} where 1 <= p - s <= 38 0 <= s <= 38	BIGNUMERIC	If the precision and scale will fit in a NUMERIC type, NUMERIC will be used. If at least one of the precision or scale is out of bounds for NUMERIC, but within the bounds for BIGNUMERIC, BIGNUMERIC will be used.
{"type": "bytes", "scale": s, "precision": p, "logicalType": "decimal"} where p - s > 38 or s > 38	N/A	BigQuery can't process decimal numbers with more than 38 digits before or after the decimal place. The shared job will raise an error and fail the iteration.
{"type": "int", "logicalType": "date"}	INT64	N/A
{"type": "int", "logicalType": "times-millis"}	INT64	N/A
{"type": "long", "logicalType": "timestamp-millis"}	INT64	N/A
{"type": "array", "items": ["null", "element_type"]}	ARRAY< element_type >	N/A
{"type": "record", "name": "VariableScaleDecimal", "namespace": "io.debezium.data", "fields": [ { "name": "scale", "type": "int" }, { "name": "value", "type": "bytes" } ]}	STRUCT< scale INT64, value BYTES >	N/A
{"type": "record", "name": "Point", "namespace": "io.debezium.data.geometry", "fields": [ { "name": "x", "type": "double" }, { "name": "y", "type": "double" }, { "name": "wkb", "type": ["null", "bytes"], "default": null }, { "name": "srid", "type": ["null", "int"], "default": null } ]}	STRUCT< x FLOAT64, y FLOAT64, wkb BYTES, srid INT64 >	N/A
{"type": "record", "name": "Geometry", "namespace": "io.debezium.data.geometry", "fields": [ { "name": "wkb", "type": "bytes" }, { "name": "srid", "type": ["null", "int"], "default": null } ]}	STRUCT< wkb BYTES, srid INT64 >	N/A

Casting logical data types

When the source table contains columns with logical data types timestamp-millis, times-millis, or date, the CDC shared job will cast them to columns with data types timestamp, time, or date in the target table.

To use this functionality, you must include the parameter time.precision.mode = connect when you create your pipeline in Data Loader.

Note

Ensure that you are using revision 2 of the shared job, identified by (rev.2) in the job name.