Skip to content

PostgreSQL connector

The CDC agent's PostgreSQL connector can monitor and capture row-level changes within PostgreSQL schemas in a non-intrusive and performant manner, and achieves this by using the raw output stream from the pgoutput decoding plugin. The connector produces a change event record for every insert, update, and delete event occurring in any tables being monitored by the CDC agent.

Note

We are using the Debezium PostgreSQL connector to connect to this source. For more information, read Debezium connector for PostgreSQL.


Versions

Data Loader CDC supports PostgreSQL versions 12, 13, 14, and 15 (including minor versions). PostgreSQL versions 12 and above contain the pgoutput plugin by default to capture logs natively.

Note

To upgrade your PostgreSQL database, read Upgrading PostgreSQL databases.


Limitations

Change data capture has the following limitations:

  • The numeric values NaN, Infinity, and -Infinity are not supported.

Database rules

The following rules must be applied to the PostgreSQL database:

Note

Aurora Serverless database clusters do not support CDC. Only provisioned Aurora database clusters support CDC.


Connect to PostgreSQL

When selecting PostgreSQL as a source during pipeline creation, you will be required to provide the following information:

Property Description
Server address The server address of your PostgreSQL database.
Port The port number used to access your PostgreSQL database.
Database Name The name of your PostgreSQL Container Database installation.
Username The username used to log in to the specified database.
Secret Provider Choose the provider for your secrets manager that contains your database password.
Secret Name Enter the name of the secret that corresponds to your database password. Read Manage secrets to learn how to add a named secret to your provider's secret manager.

Tables

Property Description
All Tables Tables available from your defined PostgreSQL source will be displayed in Matillion CDC. These tables are defined and created by the user in their PostgreSQL database and thus can't be described here.
Tables to extract and load Selecting and moving a table to this column will include it in the CDC pipeline.

How the connector works

PostgreSQL normally purges write-ahead log segments (WAL) after a period of time. This means a complete history of changes are not available for the connector. To overcome this scenario, the connector will initially perform a consistent snapshot for all schemas and tables that are being monitored. This will allow the connector to establish a base state after which streaming can begin.

Note

Depending on the number of rows within your schema, this may take a while to complete.

If required, the snapshot stage can be excluded if you wish to omit historic data from your CDC pipeline by disabling snapshots in the Pipeline Settings.

Once the snapshot stage has completed, the connector will move to the streaming stage. All changes that have occurred since the snapshot started will be captured during the streaming process - no changes will be missed.

The streaming stage will continue to monitor and consume changes as and when they're occurring within the database and any produced change events will be exported to your selected data lake in a consistent and predictable pattern. The connector will maintain an acceptable time lag behind the source database - this lag can be monitored through the Data Loader UI.

The connector is tolerant of failures. As the connector reads changes and produces events, it records the WAL position for each event. If the connector stops for any reason (including communication failures, network problems, or crashes), upon restart, the connector continues reading the WAL where it last stopped.

If the connector stopped at any point during the snapshot stage, a new snapshot will be taken on restart of the pipeline.


Unavailable field values

When a field is not part of the replica identity for a table and the value is large, Postgres may use TOAST to store the value. For change records where a TOASTed value hasn't been modified as part of the change, that field within the change record will contain the following value: __value_not_modified__.

Note

For CDC Agent version 2.77.1 and earlier, this value is __debezium_unavailable_value.

TOAST values aren't treated any differently from the actual values when replicating changes into the target tables. This means that it's possible for the replica table to inadvertently have rows where the value is updated to this placeholder value, and the actual value of the row is lost.

To avoid this case, set the replica identity of the table to FULL, which ensures all values are always included in the change records. The replica identity setting for a table can be changed with an alter table statement:

ALTER TABLE <the_table> REPLICA IDENTITY FULL;

Next steps

You need to configure and set up your PostgreSQL database. For detailed information on how to do this, read Configuring PostgreSQL database guide.