Data typing with CDC shared jobs
This page captures what data types will be used when the CDC Sync Single Table shared job loads CDC data into Snowflake, Redshift, Delta Lake on Databricks, and BigQuery.
Snowflake data types
Primitive data types
Avro schema type | Snowflake type |
---|---|
string | VARCHAR(16777216) |
int | NUMBER(38,0) |
long | NUMBER(38,0) |
float | FLOAT |
double | FLOAT |
boolean | BOOLEAN |
bytes | VARIANT |
Complex data types
Avro schema type | Snowflake type | Notes |
---|---|---|
{"type": "bytes", "scale": s, "precision": p, "logicalType": "decimal"} where 1 <= p <= 38 0 <= s <= p |
NUMBER(p, s) | N/A |
{"type": "bytes", "scale": s, "precision": p, "logicalType": "decimal"} where p > 38 0 <= s <= p or p or s not provided |
VARIANT | Snowflake can't represent a decimal where precision or scale is not provided. When reading such a column, Snowflake will simply extract the raw hexadecimal string from the Avro file. |
{"type": "bytes", "scale": s, "precision": p, "logicalType": "decimal"} where 1 <= p < s <= 38 |
N/A | In Snowflake, it's invalid to have the scale greater than the precision. If an Avro file contains a decimal with scale greater than precision (which is valid in Oracle), Snowflake won't be able to read the file, and the file will be silently skipped. |
{"type": "int", "logicalType": "date"} |
NUMBER(38, 0) | N/A |
{"type": "int", "logicalType": "times-millis"} |
NUMBER(38, 0) | N/A |
{"type": "long", "logicalType": "timestamp-millis"} |
NUMBER(38, 0) | N/A |
{"type": "array", "items": ["null", "element_type"]} |
VARIANT | The data will be represented in a structure like: [<element>, <element>, ...] |
{"type": "record", "name": "VariableScaleDecimal", "namespace": "io.debezium.data", "fields": [ { "name": "scale", "type": "int" }, { "name": "value", "type": "bytes" } ]} |
VARIANT | The data will be represented in a structure like:{ "scale": 5, "value": "7048860DDF79"} |
{"type": "record", "name": "Point", "namespace": "io.debezium.data.geometry", "fields": [ { "name": "x", "type": "double" }, { "name": "y", "type": "double" }, { "name": "wkb", "type": ["null", "bytes"], "default": null }, { "name": "srid", "type": ["null", "int"], "default": null } ]} |
VARIANT | The data will be represented in a structure like:
{"srid": null, "wkb": "01010000000000000000000040000000000000F03F", "x": 2.000000000000000e+00, "y": 1.000000000000000e+00} |
Amazon Redshift data types
Warning
A single string value loaded to SUPER is limited to the maximum VARCHAR length of 65535 bytes.
Primitive data types
Avro schema type | Redshift type | Notes |
---|---|---|
string | VARCHAR | All VARCHAR are created with the maximum size. |
int | INT | N/A |
long | BIGINT | N/A |
float | REAL (FLOAT4) | N/A |
double | FLOAT8 | N/A |
boolean | BOOL | N/A |
bytes | VARCHAR | All VARCHAR are created with the maximum size. Unable to create VARBINARY in Matillion ETL for Redshift, hence the mapping to VARCHAR. Can't map this to SUPER because there are some numeric types (Kafka Connect) that come through as bytes and need to be a string in case they're a KEY COLUMN. |
Logical data types
A logical type is an Avro primitive or complex type with extra attributes to represent a derived type. The attribute logicalType
must always be present for a logical type, and is a string with the name of one of the logical types listed later in this section. Other attributes may be defined for particular logical types.
The Matillion ETL for Redshift shared job will handle logicalTypes by using their underlying Avro data type. Custom parsing of some logical data types can also be performed, as detailed in Casting logical data types, below.
Logical type | Avro schema type | Format |
---|---|---|
decimal | bytes | { "type": "bytes", "scale": s, "precision": p, "logicalType": "decimal" } |
date | int | { "type": "int", "logicalType": "date" } |
times-millis | int | { "type": "int", "logicalType": "times-millis" } |
timestamp-millis | int | { "type": "long", "logicalType": "timestamp-millis" } |
Complex data types
Avro schema type | Redshift type | Notes |
---|---|---|
union | N/A | Represents a collection of possible data-types, format: [ <DataType1>,...<DataTypeN>] By convention, a union is only used to indicate a field is NULLABLE, that is its format will always be: [ 'null', <DataType>] and then Matillion ETL decodes the <DataType> accordingly. |
array | SUPER | Each value should be a valid JSON_ARRAY. |
enum | VARCHAR | All VARCHAR are created with the maximum size. |
map | SUPER | Each value should be a valid JSON_OBJECT. |
record | SUPER | A record holds a fixed number of fields. By convention, these types are used to describe custom schema data types, where the name/namespace identify what the underlying fields mean. Format: {"type": "record", "name": "<CustomDataType>", "namespace": "<CustomDataNamespace>", "fields": [ <AVRO FIELD DEFINITIONS>, ... ]} Where <namespace>.<name> and <name> are valid identifiers for the underlying record custom type. |
Delta Lake on Databricks data types
Primitive data types
Avro schema type | Databricks type |
---|---|
string | string |
int | int |
long | long |
float | double |
double | double |
boolean | boolean |
bytes | binary |
Complex data types
Avro schema type | Databricks type | Notes |
---|---|---|
{"type": "bytes", "scale": s, "precision": p, "logicalType": "decimal"} where 1 <= p <= 38 0 <= s <= p |
decimal(p, s) | N/A |
{"type": "bytes", "scale": s, "precision": p, "logicalType": "decimal"} where p > 38 0 <= s <= p |
N/A | If an Avro file contains a decimal with precision greater than 38, Databricks will throw the following error:org.apache.spark.sql.catalyst.parser.ParseException: decimal can only support precision up to 38 |
{"type": "bytes", "scale": s, "precision": p, "logicalType": "decimal"} where 1 <= p < s <= 38 |
N/A | If an Avro file contains a decimal with scale greater than precision (which is valid in Oracle), Databricks will throw the following error:com.databricks.sql.io.FileReadException: Error while reading file |
{"type": "int", "logicalType": "date"} |
int | N/A |
{"type": "int", "logicalType": "times-millis"} |
int | N/A |
{"type": "long", "logicalType": "timestamp-millis"} |
long | N/A |
{"type": "array", "items": ["null", "element_type"]} |
array<element_type> |
N/A |
{"type": "record", "name": "VariableScaleDecimal", "namespace": "io.debezium.data", "fields": [ { "name": "scale", "type": "int" }, { "name": "value", "type": "bytes" } ]} |
struct<scale:int, value:binary> |
N/A |
{"type": "record", "name": "Point", "namespace": "io.debezium.data.geometry", "fields": [ { "name": "x", "type": "double" }, { "name": "y", "type": "double" }, { "name": "wkb", "type": ["null", "bytes"], "default": null }, { "name": "srid", "type": ["null", "int"], "default": null } ]} |
struct<x:double, y:double, wkb:binary,srid:int> |
N/A |
Google BigQuery data types
Primitive data types
Avro schema type | BigQuery type |
---|---|
string | STRING |
int | INT(64) |
long | INT(64) |
float | FLOAT64 |
double | FLOAT64 |
boolean | BOOL |
bytes | BYTES |
Extended data types
Avro schema type | Snowflake type | Notes |
---|---|---|
{"type": "bytes", "scale": s, "precision": p, "logicalType": "decimal"} where 1 <= p - s <= 29 0 <= s <= 9 |
NUMERIC | If the precision and scale will fit in a NUMERIC type, NUMERIC will be used. If at least one of the precision or scale is out of bounds for NUMERIC, but within the bounds for BIGNUMERIC, BIGNUMERIC will be used. |
{"type": "bytes", "scale": s, "precision": p, "logicalType": "decimal"} where 1 <= p - s <= 38 0 <= s <= 38 |
BIGNUMERIC | If the precision and scale will fit in a NUMERIC type, NUMERIC will be used. If at least one of the precision or scale is out of bounds for NUMERIC, but within the bounds for BIGNUMERIC, BIGNUMERIC will be used. |
{"type": "bytes", "scale": s, "precision": p, "logicalType": "decimal"} where p - s > 38 or s > 38 |
N/A | BigQuery can't process decimal numbers with more than 38 digits before or after the decimal place. The shared job will raise an error and fail the iteration. |
{"type": "int", "logicalType": "date"} |
INT64 | N/A |
{"type": "int", "logicalType": "times-millis"} |
INT64 | N/A |
{"type": "long", "logicalType": "timestamp-millis"} |
INT64 | N/A |
{"type": "array", "items": ["null", "element_type"]} |
ARRAY< element_type > |
N/A |
{"type": "record", "name": "VariableScaleDecimal", "namespace": "io.debezium.data", "fields": [ { "name": "scale", "type": "int" }, { "name": "value", "type": "bytes" } ]} |
STRUCT< scale INT64, value BYTES > |
N/A |
{"type": "record", "name": "Point", "namespace": "io.debezium.data.geometry", "fields": [ { "name": "x", "type": "double" }, { "name": "y", "type": "double" }, { "name": "wkb", "type": ["null", "bytes"], "default": null }, { "name": "srid", "type": ["null", "int"], "default": null } ]} |
STRUCT< x FLOAT64, y FLOAT64, wkb BYTES, srid INT64 > |
N/A |
{"type": "record", "name": "Geometry", "namespace": "io.debezium.data.geometry", "fields": [ { "name": "wkb", "type": "bytes" }, { "name": "srid", "type": ["null", "int"], "default": null } ]} |
STRUCT< wkb BYTES, srid INT64 > |
N/A |
Casting logical data types
When the source table contains columns with logical data types timestamp-millis
, times-millis
, or date
, the CDC shared job will cast them to columns with data types timestamp
, time
, or date
in the target table.
To use this functionality, you must include the parameter time.precision.mode = connect
when you create your pipeline in Data Loader.
Note
Ensure that you are using revision 2 of the shared job, identified by (rev.2)
in the job name.