Configuring PostgreSQL database
The CDC Agent's PostgreSQL connector is able to monitor and consume changes as they're occurring within your PostgreSQL database. To achieve this, the connector uses PostgreSQLs logical decoding mechanism which is tracking and extracting all persistent changes to a database's tables. This process will therefore need to claim a replication slot to enable CDC. To ensure you are setup for CDC, please continue reading this setup guide.
The PostgreSQL connector currently supports PostgreSQL versions 10 and above. These versions contain the
pgoutput logical decoding plugin natively, so no further plugin installations are required.
WAL considerations for PostgreSQL
The WAL (Write-Ahead Log) doesn't take up a constant amount of disk space. It can grow as changes come in, and shrink over time, or as checkpoints are made. The PostgreSQL database will automatically attempt to reclaim disk space in blocks when that data is no longer needed.
In some cases, the WAL may grow as changes come in, and shrink over time, or as checkpoints are made. While the WAL is shared by all databases, tables and schemas, your CDC agent may only track a select few. This means the WAL can quickly fill with untracked changes, particularly if your untracked resources have high traffic while your tracked resources have low traffic. Configuring the heartbeat for your CDC agent so your agent makes regular updates to a table in the database can help avoid this issue.
To complete the steps detailed in this guide you will need:
- An account on PostgreSQL. This is most commonly a super user account, but it doesn't have to be.
- An understanding of which database, and schemas you will be targeting for CDC.
- A list of tables which you want to monitor, and capture changes for.
Setting up PostgreSQL for CDC
The initial requirement is to ensure that the
max_replication_slots are set appropriately to allow the connector to establish a connection, and claim a replication slot.
There are two different approaches to configuring these values, which are explained in the two options below:
Option 1: Update the
This configuration file can be updated to set all three properties. For logical decoding, the
wal_levelmust be set to
logical, and there must be one available replication slot, which can be claimed by the connector. An example configuration is included below:
# REPLICATION wal_level = logical -- instructs the server to use logical decoding with the write-ahead log max_wal_senders = 1 -- use a maximum of 1 separate process for processing WAL changes| default: 10 max_replication_slots = 1 -- allow a maximum of 1 replication slot to be created for streaming WAL changes| default: 10
SHOW config_filesnippet via PostgreSQL to view the location of this file.
Once the above values are finalized and saved, a restart is required to apply any changes made.
Option 2: Update the system properties via SQL
An alternative method can be used to update these properties using SQL. Once updated, a restart is still required to apply the changes. An example SQL script is provided below. Once the following is applied, the database server will be ready for CDC:
SHOW wal_level -- can be used with max_wal_senders and max_replication_slots ALTER SYSTEM SET wal_level = logical -- can be used with max_wal_senders and max_replication_slots
Configuration instructions vary slightly on RDS instances. To configure your RDS PostgreSQL server, the RDS instance parameter
rds.logical_replicationneeds to be set to
1. Applying this change will require a restart. You may decide to apply the changes immediately or, if this is not possible, wait for a planned maintenance window. Once applied, you can query
SHOW wal_levelto ensure the property is set to
For PostgreSQLs write-ahead logs, the default setting for
0- this setting configures the size of WAL files held for standby servers. If your retention value is too low, and you experience an outage, change data capture may not be able to recover. WAL segments may already have moved beyond a stored position. This is typically indicated with a startup error as follows:
ERROR: requested WAL segment 000000010000000000000001 has already been removed. If this were to occur, it will be necessary for the CDC agent to take a new snapshot of the database.
Replication slots are guaranteed to retain all WAL segments. It's important to closely monitor replication slots to avoid excessive disk consumption, and other conditions that may occur, such as catalog bloat, if a replication slot stays unused for too long. It's recommended that you review the official documentation for the mechanics and configuration of write-ahead logs.
Setting up permissions
Once the server configuration has been completed you will need to setup an appropriate role within the database for the CDC agent to use.
It's not recommended to give elevated privileges or superuser permissions for this role, and the master account shouldn't be used for CDC purposes. Roles and users are synonymous in PostgreSQL. For more information, refer to the PostgreSQL documentation.
To setup a role with the correct permissions, a superuser will need to complete the following steps:
Create a new role with
CREATE ROLE cdcrole WITH REPLICATION LOGIN PASSWORD 'xxx' -- a suitable password should be provided
Replication permissions for RDS PostgreSQL require a different process. Refer to step 5 for more information.
Assign the role
USAGEpermissions on any schema you are intending to use within your CDC pipeline.
GRANT USAGE ON SCHEMA public TO cdcrole;
If you need to perform a snapshot, assign the
SELECTpermissions on any tables you wish to perform CDC against. This is required for the
snapshotstage of the process. If you don't need to perform a snapshot,
SELECTpermissions aren't required.
-- Grant the new user with permissions to run SELECT on the CDC marked tables GRANT SELECT ON public.phone TO cdcrole; GRANT SELECT ON public.customer TO cdcrole; -- You can also grant ALL TABLES to the user if required GRANT SELECT ON ALL TABLES IN SCHEMA public TO cdcrole;
A new publication will need to be created with the name
matillion_cdc_publication. All CDC marked tables should be added to this publication. This publication must exist for the
Snapshottingstage of the process. An example is included below:
CREATE PUBLICATION matillion_cdc_publication FOR TABLE public.phone, public.customer -- The default publish setting is appropriate in this case
For RDS replication permissions you must have
superuseraccess to perform the following command. Your RDS instance will automatically create an
rds_replicationrole. You will not be able to assign the
REPLICATIONoption directly to your new user, instead you can grant the
rds_replicationrole to your user.
CREATE ROLE cdcrole WITH LOGIN PASSWORD 'xxx' -- Assign the rds_replication role to the new user GRANT rds_replication TO cdcrole
PostgreSQL pipeline check for replication slot
A replication slot is a feature provided by the PostgreSQL database management system, to facilitate streaming replication and CDC. A unique identifier (slot name) associated with a replication slot will be generated, and the default name will be
matillion_cdc. The agent can connect to the replication slot by referencing the slot name, and consume the captured changes in real-time. When the agent connects, it will look for the replication slot name,
matillion_cdc, and if it doesn't find it, then the agent will create one.
If your pipeline's status didn't change to
Snapshotting, and remained in the
Not Running status, this may have been caused by the Replication Slot creation not completing as expected.
One of the reasons for this occurring is an exclusive lock held on one of the tables included in the pipeline selection, this can cause the agent to wait for the replication slot to be fully generated. One of the ways this can be identified, is by running the following commands.
To identify active SQL queries within the PostgreSQL instance, use
SELECT * FROM pg_stat_activity
If the above command shows the
CREATE_REPLICATION_SLOTcommand is still running, the
pg_locksview needs to be examined using the related PID. The queries in steps 2 and 3 use the
pg_locksview, which shows active processes, and the locks they hold within the PostgreSQL instance:
SELECT * FROM pg_locks WHERE pid = ?
If any locks have been identified as present for the given PID, use the associated transaction ID from the above SQL command to interrogate the
SELECT * FROM pg_locks WHERE transactionid = ?
When executing the above command for a given transaction ID, if there are multiple locks returned by separate processes, It has likely identified a situation where the replication slot generation is blocked by another process. You can look back at the
pg_stat_activityview, using the PID returned from this command to identify why the execution hasn't completed.
SELECT * FROM pg_stat_activity WHERE pid = ?
From the output of this command, it should be obvious if other activities are holding locks on the tables included in the pipeline.
You're required to wait for the exclusive lock process to complete. Your pipeline's status will then be set to "Snapshotting".
Now the PostgreSQL database has been configured, you can connect the CDC agent to your database. For more information, read Data Loader Pipeline UI.