Configuring PostgreSQL database

The CDC agent's PostgreSQL connector can monitor and consume changes as they're occurring within your PostgreSQL database. To achieve this, the connector uses PostgreSQL's logical decoding mechanism, which is tracking and extracting all persistent changes to a database's tables. This process will therefore need to claim a replication slot to enable CDC. To ensure you are set up for CDC, continue reading this setup guide.

These steps are based on advice that can be found in the PostgreSQL documentation, particularly concerning replication, and the write-ahead log.

Note

The PostgreSQL connector currently supports PostgreSQL versions 10 and above. These versions contain the pgoutput logical decoding plugin natively, so no further plugin installations are required.

WAL considerations for PostgreSQL

The WAL (Write-Ahead Log) doesn't take up a constant amount of disk space. It can grow as changes come in, and shrink over time, or as checkpoints are made. The PostgreSQL database will automatically attempt to reclaim disk space in blocks when that data is no longer needed.

While the WAL is shared by all databases, tables, and schemas, your CDC agent may only track a select few. This means the WAL can quickly fill with untracked changes, particularly if your untracked resources have high traffic while your tracked resources have low traffic. Configuring the heartbeat for your CDC agent so your agent makes regular updates to a table in the database can help to avoid this issue.

Prerequisites

To complete the steps detailed in this guide you will need:

A PostgreSQL account. This is most commonly a superuser account, but it doesn't have to be.
An understanding of which database and schemas you will be targeting for CDC.
A list of tables that you want to monitor, and capture changes for.

Setting up PostgreSQL for CDC

The initial requirement is to ensure that the wal_level, max_wal_senders, and max_replication_slots are set appropriately to allow the connector to establish a connection, and claim a replication slot.

There are two different approaches to configuring these values, which are explained in the two options below:

Option 1: Update the `postgresql.conf` file

This configuration file can be updated to set all three properties. For logical decoding, the wal_level must be set to logical, and there must be one available replication slot, which can be claimed by the connector. An example configuration is included below:

# REPLICATION
wal_level = logical   -- instructs the server to use logical decoding with the write-ahead log
max_wal_senders = 1   --  use a maximum of 1 separate process for processing WAL changes| default: 10
max_replication_slots = 1   -- allow a maximum of 1 replication slot to be created for streaming WAL changes| default: 10

Run the SHOW config_file snippet via PostgreSQL to view the location of this file.
Once the above values are finalized and saved, a restart is required to apply any changes made.

Option 2: Update the system properties via SQL

An alternative method can be used to update these properties using SQL. Once updated, a restart is still required to apply the changes. An example SQL script is provided below. Once the following is applied, the database server will be ready for CDC:
```
SHOW wal_level   -- can be used with max_wal_senders and max_replication_slots
ALTER SYSTEM SET wal_level = logical   -- can be used with max_wal_senders and max_replication_slots
```
Configuration instructions vary slightly on RDS instances. To configure your RDS PostgreSQL server, the RDS instance parameter rds.logical_replication needs to be set to 1. Applying this change will require a restart. You may decide to apply the changes immediately or, if this is not possible, wait for a planned maintenance window. Once applied, you can query SHOW wal_level to ensure the property is set to logical.
For PostgreSQL's write-ahead logs, the default setting for wal_keep_segments is 0. This setting configures the size of WAL files held for standby servers. If your retention value is too low, and you experience an outage, CDC may not be able to recover. WAL segments may already have moved beyond a stored position. This is typically indicated with a startup error as follows: ERROR: requested WAL segment 000000010000000000000001 has already been removed. If this were to occur, it would be necessary for the CDC agent to take a new snapshot of the database.

Note

Replication slots are guaranteed to retain all WAL segments. It's important to closely monitor replication slots to avoid excessive disk consumption, and other conditions that may occur, such as catalog bloat, if a replication slot stays unused for too long. It's recommended that you review the official documentation for the mechanics and configuration of write-ahead logs.

Setting up permissions

Once the server configuration has been completed you will need to set up an appropriate role within the database for the CDC agent to use.

It's not recommended to give elevated privileges or superuser permissions for this role, and the master account shouldn't be used for CDC purposes. Roles and users are synonymous in PostgreSQL. For more information, refer to the PostgreSQL documentation.

To set up a role with the correct permissions, a superuser will need to complete the following steps:

Create a new role with REPLICATION and LOGIN permissions.
```
CREATE ROLE cdcrole WITH REPLICATION LOGIN PASSWORD 'xxx' -- a suitable password should be provided
```
Note

Replication permissions for RDS PostgreSQL require a different process. Refer to step 5 for more information.
Assign the role USAGE permissions on any schema you are intending to use within your CDC pipeline.
```
GRANT USAGE ON SCHEMA public TO cdcrole;
```

If you need to perform a snapshot, assign the SELECT permissions on any tables you wish to perform CDC against. This is required for the snapshot stage of the process. If you don't need to perform a snapshot, SELECT permissions aren't required.

-- Grant the new user with permissions to run SELECT on the CDC marked tables
GRANT SELECT ON public.phone TO cdcrole;
GRANT SELECT ON public.customer TO cdcrole;

-- You can also grant ALL TABLES to the user if required
GRANT SELECT ON ALL TABLES IN SCHEMA public TO cdcrole;

A new publication will need to be created with the name matillion_cdc_publication. All CDC-marked tables should be added to this publication. This publication must exist for the Snapshotting stage of the process. An example is included below:
```
CREATE PUBLICATION matillion_cdc_publication FOR TABLE public.phone, public.customer
-- The default publish setting is appropriate in this case
```
For RDS replication permissions you must have superuser access to perform the following command. Your RDS instance will automatically create an rds_replication role. You will not be able to assign the REPLICATION option directly to your new user. Instead, you can grant the rds_replication role to your user.
```
CREATE ROLE cdcrole WITH LOGIN PASSWORD 'xxx'
-- Assign the rds_replication role to the new user
GRANT rds_replication TO cdcrole
```

PostgreSQL pipeline check for replication slot

A replication slot is a feature provided by the PostgreSQL database management system to facilitate streaming replication and CDC. A unique identifier (slot name) associated with a replication slot will be generated, and the default name will be matillion_cdc. The agent can connect to the replication slot by referencing the slot name, and consume the captured changes in real time. When the agent connects, it will look for the replication slot name, matillion_cdc, and if it doesn't find it, then the agent will create one.

If your pipeline's status didn't change to Snapshotting, and remained in the Not Running status, this may have been caused by the Replication Slot creation not completing as expected.

One of the reasons for this occurring is an exclusive lock held on one of the tables included in the pipeline selection—this can cause the agent to wait for the replication slot to be fully generated. One of the ways this can be identified is by running the following commands.

To identify active SQL queries within the PostgreSQL instance, use pg_stat_activity:
```
SELECT * 
FROM pg_stat_activity
```
If the above command shows the CREATE_REPLICATION_SLOT command is still running, the pg_locks view needs to be examined using the related PID. The queries in steps 2 and 3 use the pg_locks view, which shows active processes, and the locks they hold within the PostgreSQL instance:
```
SELECT * 
FROM pg_locks 
WHERE pid = ?
```
If any locks have been identified as present for the given PID, use the associated transaction ID from the above SQL command to interrogate the pg_locks view again:
```
SELECT *
FROM pg_locks
WHERE transactionid = ?
```
When executing the above command for a given transaction ID, if there are multiple locks returned by separate processes, it has likely identified a situation where the replication slot generation is blocked by another process. You can look back at the pg_stat_activity view, using the PID returned from this command to identify why the execution hasn't completed.
```
SELECT *
FROM pg_stat_activity
WHERE pid = ?
```

From the output of this command, it should be clear if other activities are holding locks on the tables included in the pipeline.

Note

You're required to wait for the exclusive lock process to complete. Your pipeline's status will then be set to "Snapshotting".

Upgrading PostgreSQL databases

Upgrading a PostgreSQL database involves moving data from an older version of PostgreSQL to a newer version, while ensuring data integrity and compatibility. The process may vary slightly depending on your operating system, so it's important to check your PostgreSQL connector version, and the official PostgreSQL documentation for your specific version and platform.

Upgrading a PostgreSQL database is important for the following reasons:

Security
Bug fixes
Performance improvements
Feature enhancements
Compatibility
End of life (EOL) support
Data integrity
Regulatory compliance

For more information about how to upgrade your PostgreSQL database, read Upgrading PostgreSQL.

Next steps

Now the PostgreSQL database has been configured, you can connect the CDC agent to your database. For more information, read Data Loader Pipeline UI.