Skip to content

PostgreSQL snapshots

When setting up a pipeline using the PostgreSQL connector, you can enable or disable snapshotting features. Snapshotting can be modified after a pipeline has started by stopping the pipeline and editing its configuration.

The PostgreSQL connector offers two snapshot configurations:

  • Initial snapshot.
  • On demand snapshot.

Initial snapshot

This configuration allows you to take an initial snapshot of the specified tables when the pipeline is started. The initial snapshot captures the current state of the data in those tables at the beginning of the pipeline execution. By enabling this option, you ensure that the pipeline starts with the most recent data available. The initial snapshot setting will only effect a pipeline on its first startup.

  1. Initial snapshot effect: The initial snapshot setting only impacts the pipeline during its first startup.
  2. First startup determination: A pipeline is considered to be in its first startup under the following conditions:
    • The target location is empty.
    • The last known location in the database logs is unknown or invalid.
  3. Initial snapshots enabled: If initial snapshots are enabled, the connector will fetch all historical data from the configured tables. During the initial snapshot process, the pipeline will have a SNAPSHOTTING status. If the pipeline is stopped before the snapshot concludes, the initial snapshot will start again when the pipeline is restarted.
  4. Snapshot completion and streaming: Once the initial snapshot is finished, the pipeline transitions to a STREAMING status. After this point, any changes that occur in the database will be streamed, ensuring that the target remains in sync with the source database.
  5. Initial snapshots disabled: If the initial snapshot option is disabled, the pipeline will immediately have a STREAMING status when it is started. Only changes made to the database from that point on will be streamed to the target.

On-demand snapshot

Enabling on demand snapshot allows you to trigger snapshots for specific tables in the pipeline manually. This means that new tables can be added to the capture list after the pipeline's initial startup, and snapshots for those tables can be triggered as needed. On demand snapshots are useful when you want to capture specific points in time or when you need to dynamically expand the set of tables being captured without losing historical data.

Note

On Demand Snapshots require a minimum agent version of 2.65.0

  1. Dynamic table addition: Enabling on demand snapshots lets you trigger snapshots for newly added tables in the pipeline. This ensures that historical data from the newly added tables is captured without any loss.
  2. On Demand snapshot process: While an on demand snapshot is in progress, the pipeline will display a SNAPSHOTTING status. If the pipeline is stopped during the snapshot process, the snapshot will resume on pipeline restart.
  3. Snapshot completion and streaming: After the on demand snapshot is finished, the pipeline returns to a STREAMING status. At this point, all tables in the target location will be in sync with the source.
  4. Signal table requirement: To enable on demand snapshots for PostgreSQL, a signal table must be configured in the source database. This table is tracked by the connector and automatically triggers snapshots when it receives a signal.

Note

  • An on-demand snapshot cannot be triggered on tables without a primary key.
  • Schema changes while a pipeline has a SNAPSHOTTING status should be avoided and will cause the pipeline to fail if they occur.

Setting up a signal table

  • A signal table should be pipeline-specific to avoid accidental overlap of snapshot executions.
  • The connection details provided for the source must have write permissions to the signal table.
  • The pipeline will insert a row into this table to trigger a snapshot.
  • The table should be created with the correct structure.
  • Supplemental logging must be enabled for the signal table.

1. Table creation

Create the signal table with the following SQL Statement:

CREATE TABLE IF NOT EXISTS <schema>.<table> (
id VARCHAR(42) PRIMARY KEY,
type VARCHAR(32) NOT NULL,
data VARCHAR(2048) NULL
);

The signal table should have three columns:

  • id: An identifier for each signal execution of type VARCHAR(42).
  • type: Indicates the type of signal being sent of type VARCHAR(32).
  • data: Stores additional data related to the signal of type VARCHAR(2048).

2. Add signal table to CDC publication

If the signal table is not already included in the CDC publication, you should add it using the following SQL:

ALTER PUBLICATION matillion_cdc_publication ADD TABLE <schema>.<table>;

Replace <schema> and <table> with the appropriate schema and table names for your signal table.

3. Database role privileges

Grant the necessary privileges to the database role configured for running the pipeline:

-- Grant the cdcrole the ability to access the schema
GRANT USAGE ON SCHEMA <schema> TO <cdcrole>

-- Grant the cdcrole the ability to insert records to the signal table
GRANT INSERT ON <schema>.<table> TO <cdcrole>

Replace <schema> and <table> with the names used in the create statement, and <cdcrole> with the user configured for CDC.

By correctly configuring the signal table, on demand snapshots can be enabled, and new tables added to the pipeline after the initial startup will be snapshotted. If on demand snapshots are disabled, any new tables added to the pipeline will capture changes from that point onwards but will not sync their historical data with the target, and the pipeline will immediately have a STREAMING status.


Configuring new tables

When adding tables to a pipeline, they must be correctly configured to allow the on demand snapshot to take place and to ensure change data capture will stream future changes on those tables. Read Configuring PostgreSQL database for a more comprehensive outline of these requirements.

1. Adding tables to the CDC publication

To enable on demand snapshots and CDC for new tables, you must add the tables to the database publication. Use the following SQL statements:

For adding tables individually:

ALTER PUBLICATION matillion_cdc_publication ADD TABLE <schema>.<table>;

For adding all tables in a schema:

ALTER PUBLICATION matillion_cdc_publication ADD TABLES IN SCHEMA <schema>;

Replace <schema> and <table> with the names of each new table.

2. Granting USAGE privileges for new schemas

If any of the new tables are in schemas that previously did not have tables captured in the pipeline, grant USAGE privileges to the database role configured for CDC. Use the following SQL statement:

GRANT USAGE ON SCHEMA <schema> TO <cdcrole>;

Replace <schema> with the schema of the new tables and <cdcrole> with the user configured for CDC. Repeat this step for every new schema.

3. Granting SELECT privileges on new tables

To run on demand snapshots, the user configured for CDC must have SELECT privileges on the new tables. Use the following SQL statements:

For granting SELECT permissions to tables individually:

GRANT SELECT ON <schema>.<table> TO <cdcrole>;

For granting SELECT permissions on all tables in a schema:

GRANT SELECT ON ALL TABLES IN SCHEMA public TO <cdcrole>;

Replace <schema> and <table> with the names of each new table and <cdcrole> with the user configured for CDC.

By following these steps, you ensure that new tables added to the pipeline will be properly configured for on demand snapshots and CDC, allowing historical data to be captured without losing sync with the target. If on demand snapshots are disabled, new tables will only capture changes from the point of addition and the pipeline will have a STREAMING status immediately.