Skip to content

Data typing with CDC shared jobs

This page captures what data types will be used when the CDC Sync Single Table shared job loads CDC data into Snowflake, Redshift, Delta Lake on Databricks, and BigQuery.


Snowflake data types

Primitive data types

Avro schema type Snowflake type
string VARCHAR(16777216)
int NUMBER(38,0)
long NUMBER(38,0)
float FLOAT
double FLOAT
boolean BOOLEAN
bytes VARIANT

Complex data types

Avro schema type Snowflake type Notes
{"type": "bytes",
"scale": s,
"precision": p,
"logicalType": "decimal"}
where
1 <= p <= 38
0 <= s <= p
NUMBER(p, s) N/A
{"type": "bytes",
"scale": s,
"precision": p,
"logicalType": "decimal"}
where
p > 38
0 <= s <= p
or
p or s not provided
VARIANT Snowflake can't represent a decimal where precision or scale is not provided. When reading such a column, Snowflake will simply extract the raw hexadecimal string from the Avro file.
{"type": "bytes",
"scale": s,
"precision": p,
"logicalType": "decimal"}
where
1 <= p < s <= 38
N/A In Snowflake, it's invalid to have the scale greater than the precision. If an Avro file contains a decimal with scale greater than precision (which is valid in Oracle), Snowflake won't be able to read the file, and the file will be silently skipped.
{"type": "int",
"logicalType": "date"}
NUMBER(38, 0) N/A
{"type": "int",
"logicalType": "times-millis"}
NUMBER(38, 0) N/A
{"type": "long",
"logicalType": "timestamp-millis"}
NUMBER(38, 0) N/A
{"type": "array",
"items": ["null", "element_type"]}
VARIANT The data will be represented in a structure like:
[<element>, <element>, ...]
{"type": "record",
 "name": "VariableScaleDecimal",
 "namespace": "io.debezium.data",
"fields": [
{
"name": "scale",
"type": "int"
},
{
"name": "value",
"type": "bytes"
}
]}
VARIANT The data will be represented in a structure like:
{ "scale": 5, 
"value": "7048860DDF79"}
{"type": "record",
"name": "Point",
"namespace": "io.debezium.data.geometry",
"fields": [
{
"name": "x", 
"type": "double"
},
{
"name": "y", 
"type": "double"
},
{
"name": "wkb",
"type": ["null", "bytes"],
"default": null
},
{
"name": "srid",
"type": ["null", "int"],
"default": null
}
]}
VARIANT The data will be represented in a structure like:
{"srid": null,
"wkb": "01010000000000000000000040000000000000F03F",
"x": 2.000000000000000e+00,
"y": 1.000000000000000e+00}

Amazon Redshift data types

Warning

A single string value loaded to SUPER is limited to the maximum VARCHAR length of 65535 bytes.

Primitive data types

Avro schema type Redshift type Notes
string VARCHAR All VARCHAR are created with the maximum size.
int INT N/A
long BIGINT N/A
float REAL (FLOAT4) N/A
double FLOAT8 N/A
boolean BOOL N/A
bytes VARCHAR All VARCHAR are created with the maximum size. Unable to create VARBINARY in Matillion ETL for Redshift, hence the mapping to VARCHAR. Can't map this to SUPER because there are some numeric types (Kafka Connect) that come through as bytes and need to be a string in case they're a KEY COLUMN.

Logical data types

A logical type is an Avro primitive or complex type with extra attributes to represent a derived type. The attribute logicalType must always be present for a logical type, and is a string with the name of one of the logical types listed later in this section. Other attributes may be defined for particular logical types.

The Matillion ETL for Redshift shared job will handle logicalTypes by using their underlying Avro data type. Custom parsing of some logical data types can also be performed, as detailed in Casting logical data types, below.

Logical type Avro schema type Format
decimal bytes
{
"type": "bytes",
"scale": s,
"precision": p,
"logicalType": "decimal"
}
date int
{
"type": "int",
"logicalType": "date"
}
times-millis int
{
"type": "int",
"logicalType": "times-millis"
}
timestamp-millis int
{
"type": "long",
"logicalType": "timestamp-millis"
}

Complex data types

Avro schema type Redshift type Notes
union N/A Represents a collection of possible data-types, format: [ <DataType1>,...<DataTypeN>]
By convention, a union is only used to indicate a field is NULLABLE, that is its format will always be: [ 'null', <DataType>] and then Matillion ETL decodes the <DataType> accordingly.
array SUPER Each value should be a valid JSON_ARRAY.
enum VARCHAR All VARCHAR are created with the maximum size.
map SUPER Each value should be a valid JSON_OBJECT.
record SUPER A record holds a fixed number of fields. By convention, these types are used to describe custom schema data types, where the name/namespace identify what the underlying fields mean.
Format:
{"type": "record",
 "name": "<CustomDataType>",
 "namespace": "<CustomDataNamespace>",
"fields": [
<AVRO FIELD DEFINITIONS>, ... 
]}
Where <namespace>.<name> and <name> are valid identifiers for the underlying record custom type.

Delta Lake on Databricks data types

Primitive data types

Avro schema type Databricks type
string string
int int
long long
float double
double double
boolean boolean
bytes binary

Complex data types

Avro schema type Databricks type Notes
{"type": "bytes",
"scale": s,
"precision": p,
"logicalType": "decimal"}
where
1 <= p <= 38
0 <= s <= p
decimal(p, s) N/A
{"type": "bytes",
"scale": s,
"precision": p,
"logicalType": "decimal"}
where
p > 38
0 <= s <= p
N/A If an Avro file contains a decimal with precision greater than 38, Databricks will throw the following error:
org.apache.spark.sql.catalyst.parser.ParseException: decimal can only support precision up to 38
{"type": "bytes",
"scale": s,
"precision": p,
"logicalType": "decimal"}
where
1 <= p < s <= 38
N/A If an Avro file contains a decimal with scale greater than precision (which is valid in Oracle), Databricks will throw the following error:
com.databricks.sql.io.FileReadException: Error while reading file
{"type": "int",
"logicalType": "date"}
int N/A
{"type": "int",
"logicalType": "times-millis"}
int N/A
{"type": "long",
"logicalType": "timestamp-millis"}
long N/A
{"type": "array",
"items": ["null", "element_type"]}
array<element_type> N/A
{"type": "record",
 "name": "VariableScaleDecimal",
 "namespace": "io.debezium.data",
"fields": [
{
"name": "scale",
"type": "int"
},
{
"name": "value",
"type": "bytes"
}
]}
struct<scale:int, value:binary> N/A
{"type": "record",
"name": "Point",
"namespace": "io.debezium.data.geometry",
"fields": [
{
"name": "x", 
"type": "double"
},
{
"name": "y", 
"type": "double"
},
{
"name": "wkb",
"type": ["null", "bytes"],
"default": null
},
{
"name": "srid",
"type": ["null", "int"],
"default": null
}
]}
struct<x:double, y:double, wkb:binary,srid:int> N/A

Google BigQuery data types

Primitive data types

Avro schema type BigQuery type
string STRING
int INT(64)
long INT(64)
float FLOAT64
double FLOAT64
boolean BOOL
bytes BYTES

Extended data types

Avro schema type Snowflake type Notes
{"type": "bytes",
"scale": s,
"precision": p,
"logicalType": "decimal"}
where
1 <= p - s <= 29
0 <= s <= 9
NUMERIC If the precision and scale will fit in a NUMERIC type, NUMERIC will be used. If at least one of the precision or scale is out of bounds for NUMERIC, but within the bounds for BIGNUMERIC, BIGNUMERIC will be used.
{"type": "bytes",
"scale": s,
"precision": p,
"logicalType": "decimal"}
where
1 <= p - s <= 38
0 <= s <= 38
BIGNUMERIC If the precision and scale will fit in a NUMERIC type, NUMERIC will be used. If at least one of the precision or scale is out of bounds for NUMERIC, but within the bounds for BIGNUMERIC, BIGNUMERIC will be used.
{"type": "bytes",
"scale": s,
"precision": p,
"logicalType": "decimal"}
where
p - s > 38
or
s > 38
N/A BigQuery can't process decimal numbers with more than 38 digits before or after the decimal place. The shared job will raise an error and fail the iteration.
{"type": "int",
"logicalType": "date"}
INT64 N/A
{"type": "int",
"logicalType": "times-millis"}
INT64 N/A
{"type": "long",
"logicalType": "timestamp-millis"}
INT64 N/A
{"type": "array",
"items": ["null", "element_type"]}
ARRAY<
element_type
>
N/A
{"type": "record",
 "name": "VariableScaleDecimal",
 "namespace": "io.debezium.data",
"fields": [
{
"name": "scale",
"type": "int"
},
{
"name": "value",
"type": "bytes"
}
]}
STRUCT<
scale INT64,
value BYTES
>
N/A
{"type": "record",
"name": "Point",
"namespace": "io.debezium.data.geometry",
"fields": [
{
"name": "x", 
"type": "double"
},
{
"name": "y", 
"type": "double"
},
{
"name": "wkb",
"type": ["null", "bytes"],
"default": null
},
{
"name": "srid",
"type": ["null", "int"],
"default": null
}
]}
STRUCT<
x FLOAT64,
y FLOAT64,
wkb BYTES, 
srid INT64
>
N/A
{"type": "record",
"name": "Geometry",
"namespace": "io.debezium.data.geometry",
"fields": [
{
"name": "wkb",
"type": "bytes"
},
{
"name": "srid",
"type": ["null", "int"],
"default": null
}
]}
STRUCT<
wkb BYTES, 
srid INT64
>
N/A

Casting logical data types

When the source table contains columns with logical data types timestamp-millis, times-millis, or date, the CDC shared job will cast them to columns with data types timestamp, time, or date in the target table.

To use this functionality, you must include the parameter time.precision.mode = connect when you create your pipeline in Data Loader.

Note

Ensure that you are using revision 2 of the shared job, identified by (rev.2) in the job name.