PostgreSQL streaming connector

The PostgreSQL connector can monitor and capture row-level changes within PostgreSQL schemas in a non-intrusive and performant manner, and achieves this by using the raw output stream from the pgoutput decoding plugin. The connector produces a change event record for every insert, update, and delete event occurring in any tables being monitored by the streaming agent.

Note

We are using the Debezium PostgreSQL connector to connect to this source. For more information, read Debezium connector for PostgreSQL.

How the connector works

PostgreSQL normally purges write-ahead log segments (WAL) after a period of time. This means a complete history of changes is not available for the connector. To overcome this scenario, the connector will initially perform a consistent snapshot for all schemas and tables that are being monitored. This will allow the connector to establish a base state, after which, streaming can begin.

Note

Depending on the number of rows within your schema, this may take a while to complete.

Once the snapshot stage has completed, the connector will move to the streaming stage. All changes that have occurred since the snapshot started will be captured during the streaming process. No changes will be missed.

The streaming stage will continue to monitor and consume changes as and when they're occurring within the database and any produced change events will be exported to your selected destination in a consistent and predictable pattern. The connector will maintain an acceptable time lag behind the source database—this lag can be monitored through the pipeline dashboard.

The connector is tolerant of failures. As the connector reads changes and produces events, it records the WAL position for each event. If the connector stops for any reason (including communication failures, network problems, or crashes), upon restart, the connector continues reading the WAL where it last stopped.

If the connector stopped at any point during the snapshot stage, a new snapshot will be taken on restart of the pipeline.

Prerequisites

Observe the following information about supported versions, limitations, and database rules.

Versions

The Data Productivity Cloud supports PostgreSQL versions 12, 13, 14, and 15 (including minor versions). PostgreSQL versions 12 and above contain the pgoutput plugin by default to capture logs natively.

Limitations

The numeric values NaN, Infinity, and -Infinity are not supported.
Amazon Aurora Serverless database clusters do not support streaming. Only provisioned Amazon Aurora database clusters support streaming.

Database rules

Your PostgreSQL database must be configured for streaming. Read Configuring PostgreSQL database to learn more.
Your PostgreSQL database must be hosted on the primary server, and this server must be active.

Source setup

Refer to this section when you create a streaming pipeline.

Server address = string

The server address of your PostgreSQL database.

Port = integer

The port number used to access your PostgreSQL database. The default is 5432.

Database name = string

The name of your PostgreSQL container database installation.

Username = string

The username used to log in to the specified database.

Secrets Manager = drop-down

Choose the service you use to manage secrets.

Secret name = string

The name of the secret in your secret manager service that references your PostgreSQL password.

JDBC parameters and specific connection settings = column editor

This parameter is optional.

Specify any parameter:value pairs as part of an advanced connection. Click Save to finish adding parameters.

Click Connect to establish the connection to your source database.

Unavailable field values

When a field (column) is not part of the replica identity for a table and the value is large, Postgres may use TOAST to store the value. For change records where a TOASTED value hasn't been modified as part of the change, that field within the change record will contain the following value: __value_not_modified__.

TOAST values aren't treated any differently from the actual values when replicating changes into the target tables. This means that it's possible for the replica table to inadvertently have rows where the value is updated to this placeholder value, and the actual value of the row is lost.

To avoid this case, set the replica identity of the table to FULL, which ensures all values are always included in the change records. The replica identity setting for a table can be changed with an alter table statement:

ALTER TABLE <the_table> REPLICA IDENTITY FULL;

Advanced settings

Advanced settings are optional. Advanced setting are entered as parameter:value pairs in the Add advanced settings dialog. Some parameters you might want to add are discussed below.

slot.name

slot.name: Is the name of the PostgreSQL logical decoding slot that's created for streaming changes from a particular plug-in for a particular database/schema. A valid replication slot name must contain only digits, lowercase characters, and underscores with a length of <= 63.

A unique identifier (slot name) associated with a replication slot will be generated, and the default name will be matillion_cdc. A replication slot is a feature provided by the PostgreSQL database management system, to facilitate streaming replication. The agent can connect to the replication slot by referencing the slot name, and consume the captured changes in real-time. When the agent connects, it will look for the replication slot name, matillion_cdc, and if it doesn't find it, then the agent will create one.

Note

This setting must be used if you want to have more than one pipeline using the same source PostgreSQL database.

time.precision.mode

time.precision.mode: A simplified explanation of the mappings for temporal types in PostgreSQL based on the time.precision.mode configuration property:

When the time.precision.mode property is set to adaptive, which is the default setting, the connector determines the literal type and semantic type for temporal columns based on their data type definition. This approach guarantees that the events accurately reflect the values stored in the database.
When the time.precision.mode property is set to connect, the connector uses connector-logical types to represent temporal types. This can be beneficial when consumers are limited to handling the standard logical types and cannot handle variable-precision time values. However, it's important to note that choosing this mode may lead to a loss of precision for PostgreSQL columns that have a fractional second precision greater than 3.

Configuring new tables

When adding tables to a pipeline, they must be correctly configured to allow the on-demand snapshot to take place and to ensure future changes on those tables will be streamed. Read Configure your PostgreSQL database for a more comprehensive outline of these requirements.

Add tables to the streaming publication

To enable on-demand snapshots and streaming for new tables, you must add the tables to the database publication. Use the following SQL statements:

For adding tables individually:

ALTER PUBLICATION matillion_cdc_publication ADD TABLE <schema>.<table>;

For adding all tables in a schema:

ALTER PUBLICATION matillion_cdc_publication ADD TABLES IN SCHEMA <schema>;

Replace <schema> and <table> with the names of each new table.

Grant USAGE privileges for new schemas

If any of the new tables are in schemas that previously did not have tables captured in the pipeline, grant USAGE privileges to the database role configured for streaming. Use the following SQL statement:

GRANT USAGE ON SCHEMA <schema> TO <streaming_role>;

Replace <schema> with the schema of the new tables and <streaming_role> with the user configured for streaming. Repeat this step for every new schema.

Grant SELECT privileges on new tables

To run on-demand snapshots, the user configured for streaming must have SELECT privileges on the new tables. Use the following SQL statements:

For granting SELECT permissions to tables individually:

GRANT SELECT ON <schema>.<table> TO <streaming_role>;

For granting SELECT permissions on all tables in a schema:

GRANT SELECT ON ALL TABLES IN SCHEMA public TO <streaming_role>;

Replace <schema> and <table> with the names of each new table and <streaming_role> with the user configured for streaming.

By following these steps, you ensure that new tables added to the pipeline will be properly configured for on-demand snapshots and streaming, allowing historical data to be captured without losing synchronization with the target. If on-demand snapshots are disabled, new tables will only capture changes from the point of addition and the pipeline will have a STREAMING status immediately.