Configure your PostgreSQL database
The PostgreSQL streaming connector can monitor and consume changes as they're occurring within your PostgreSQL database. To achieve this, the connector uses PostgreSQL's logical decoding mechanism, which is tracking and extracting all persistent changes to a database's tables. This process will therefore need to claim a replication slot to enable streaming.
These steps are based on advice that can be found in the PostgreSQL documentation, particularly concerning replication, and the write-ahead log.
Note
The PostgreSQL connector currently supports PostgreSQL versions 12 and above. These versions contain the pgoutput
logical decoding plugin natively, so no further plugin installations are required.
WAL considerations for PostgreSQL
The WAL (Write-Ahead Log) doesn't take up a constant amount of disk space. It can grow as changes come in, and shrink over time, or as checkpoints are made. The PostgreSQL database will automatically attempt to reclaim disk space in blocks when that data is no longer needed.
While the WAL is shared by all databases, tables, and schemas, your streaming pipeline may only track a select few. This means the WAL can quickly fill with untracked changes, particularly if your untracked resources have high traffic while your tracked resources have low traffic.
Prerequisites
To complete the steps detailed in this guide you will need:
- A PostgreSQL account. This is most commonly a superuser account, but it doesn't have to be.
- An understanding of which database and schemas you will be targeting for streaming.
- A list of tables that you want to monitor, and capture changes for.
Setting up PostgreSQL for streaming
The initial requirement is to ensure that the wal_level
, max_wal_senders
, and max_replication_slots
are set appropriately to allow the connector to establish a connection, and claim a replication slot.
There are two different approaches to configuring these values, which are explained in the two options below:
Option 1: Update the postgresql.conf
file
-
This configuration file can be updated to set all three properties. For logical decoding, the
wal_level
must be set tological
, and there must be one available replication slot, which can be claimed by the connector. An example configuration is included below:# REPLICATION wal_level = logical -- instructs the server to use logical decoding with the write-ahead log max_wal_senders = 1 -- use a maximum of 1 separate process for processing WAL changes| default: 10 max_replication_slots = 1 -- allow a maximum of 1 replication slot to be created for streaming WAL changes| default: 10
-
Run the
SHOW config_file
snippet via PostgreSQL to view the location of this file. - Once the above values are finalized and saved, a restart is required to apply any changes made.
Option 2: Update the system properties via SQL
-
An alternative method can be used to update these properties using SQL. Once updated, a restart is still required to apply the changes. An example SQL script is provided below. Once the following is applied, the database server will be ready for streaming:
SHOW wal_level; -- can be used with max_wal_senders and max_replication_slots ALTER SYSTEM SET wal_level = logical; -- can be used with max_wal_senders and max_replication_slots
-
Configuration instructions vary slightly on RDS instances. To configure your RDS PostgreSQL server, the RDS instance parameter
rds.logical_replication
needs to be set to1
. Applying this change will require a restart. You may decide to apply the changes immediately or, if this is not possible, wait for a planned maintenance window. Once applied, you can querySHOW wal_level
to ensure the property is set tological
. -
For PostgreSQL's write-ahead logs, the default setting for
wal_keep_segments
is0
. This setting configures the size of WAL files held for standby servers. If your retention value is too low, and you experience an outage, streaming may not be able to recover. WAL segments may already have moved beyond a stored position. This is typically indicated with a startup error as follows:ERROR: requested WAL segment 000000010000000000000001 has already been removed
. If this were to occur, it would be necessary for the streaming agent to take a new snapshot of the database.Note
Replication slots are guaranteed to retain all WAL segments. It's important to closely monitor replication slots to avoid excessive disk consumption, and other conditions that may occur, such as catalog bloat, if a replication slot stays unused for too long. It's recommended that you review the official documentation for the mechanics and configuration of write-ahead logs.
Setting up permissions
Once the server configuration has been completed you will need to set up an appropriate role within the database for the streaming agent to use.
It's not recommended to give elevated privileges or superuser permissions for this role, and the master account shouldn't be used for streaming purposes. Roles and users are synonymous in PostgreSQL. For more information, refer to the PostgreSQL documentation.
To set up a role with the correct permissions, a superuser will need to complete the following steps:
-
Create a new role with
REPLICATION
andLOGIN
permissions.CREATE ROLE streaming_role WITH REPLICATION LOGIN PASSWORD 'xxx'; -- a suitable password should be provided
Note
Replication permissions for RDS PostgreSQL require a different process. Refer to step 5 for more information.
-
Assign the role
USAGE
permissions on any schema you are intending to use within your streaming pipeline.GRANT USAGE ON SCHEMA public TO streaming_role;
-
If you need to perform a snapshot, assign the
SELECT
permissions on any tables you wish to perform streaming against. This is required for thesnapshot
stage of the process. If you don't need to perform a snapshot,SELECT
permissions aren't required.-- Grant the new user with permissions to run SELECT on the streaming-marked tables GRANT SELECT ON public.phone TO streaming_role; GRANT SELECT ON public.customer TO streaming_role; -- You can also grant ALL TABLES to the user if required GRANT SELECT ON ALL TABLES IN SCHEMA public TO streaming_role;
-
A new publication will need to be created with the name
matillion_cdc_publication
. All streaming-marked tables should be added to this publication. This publication must exist for the snapshotting stage of the process. An example is included below:CREATE PUBLICATION matillion_cdc_publication FOR TABLE public.phone, public.customer; -- The default publish setting is appropriate in this case
-
For RDS replication permissions you must have superuser access to perform the following command. Your RDS instance will automatically create an
rds_replication
role. You will not be able to assign theREPLICATION
option directly to your new user. Instead, you can grant therds_replication
role to your user.CREATE ROLE streaming_role WITH LOGIN PASSWORD 'xxx'; -- Assign the rds_replication role to the new user GRANT rds_replication TO streaming_role;
PostgreSQL pipeline check for replication slot
A replication slot is a feature provided by the PostgreSQL database management system to facilitate streaming replication and change data capture. A unique identifier (slot name) associated with a replication slot will be generated, and the default name will be matillion_cdc
. The agent can connect to the replication slot by referencing the slot name, and consume the captured changes in real time. When the agent connects, it will look for the replication slot name, matillion_cdc
, and if it doesn't find it, then the agent will create one.
If your pipeline's status didn't change to Snapshotting
, and remained in the Not Running
status, this may have been caused by the Replication Slot creation not completing as expected.
One of the reasons for this occurring is an exclusive lock held on one of the tables included in the pipeline selection—this can cause the agent to wait for the replication slot to be fully generated. One of the ways this can be identified is by running the following commands.
-
To identify active SQL queries within the PostgreSQL instance, use
pg_stat_activity
:SELECT * FROM pg_stat_activity;
-
If the above command shows the
CREATE_REPLICATION_SLOT
command is still running, thepg_locks
view needs to be examined using the related PID. The queries in steps 2 and 3 use thepg_locks
view, which shows active processes, and the locks they hold within the PostgreSQL instance:SELECT * FROM pg_locks WHERE pid = ?;
-
If any locks have been identified as present for the given PID, use the associated transaction ID from the above SQL command to interrogate the
pg_locks
view again:SELECT * FROM pg_locks WHERE transactionid = ?;
-
When executing the above command for a given transaction ID, if there are multiple locks returned by separate processes, it has likely identified a situation where the replication slot generation is blocked by another process. You can look back at the
pg_stat_activity
view, using the PID returned from this command to identify why the execution hasn't completed.SELECT * FROM pg_stat_activity WHERE pid = ?;
From the output of this command, it should be clear if other activities are holding locks on the tables included in the pipeline.
Note
You're required to wait for the exclusive lock process to complete. Your pipeline's status will then be set to "Snapshotting".
Upgrading PostgreSQL databases
Upgrading a PostgreSQL database involves moving data from an older version of PostgreSQL to a newer version, while ensuring data integrity and compatibility. The process may vary slightly depending on your operating system, so it's important to check your PostgreSQL connector version, and the official PostgreSQL documentation for your specific version and platform.
Upgrading a PostgreSQL database is important for the following reasons:
- Security
- Bug fixes
- Performance improvements
- Feature enhancements
- Compatibility
- End of life (EOL) support
- Data integrity
- Regulatory compliance
For more information about how to upgrade your PostgreSQL database, read Upgrading PostgreSQL.