Configuring and managing snapshots
A snapshot is a feature of Streaming pipelines that allows you to establish an up-to-date view of the source database, giving you a baseline from which to begin streaming changes.
This article details the process of configuring and managing snapshots for your streaming pipelines. Before beginning, you should ensure you understand the concepts behind the feature, as explained in Overview of snapshots.
Configuring a snapshot
Snapshots are configured separately for each Streaming pipeline. Before a snapshot can be added to a pipeline, the pipeline itself must be configured and in a Streaming state. Read Create a Streaming pipeline for details.
To configure a snapshot for a pipeline:
- Log in to the Data Productivity Cloud.
- Click ☰ → Designer.
- Select your project.
- Click the Streaming tab.
-
Select the pipeline you want to configure a snapshot for. This will open the pipeline dashboard. The dashboard page shows details of the pipeline configuration, as well as metrics on the progress of streaming and snapshotting.
Note
If the pipeline is not in Streaming status, you may need to further configure the pipeline. Read Manage Streaming pipeline for details.
-
On the pipeline dashboard, click the Snapshotting tab. This will show the snapshot queue, which initially will be empty.
- Click Request a snapshot.
-
The Select schemas dialog shows all the tables that are part of the pipeline definition. Select the tables you want to snapshot. For each selected table, the dialog will display an SQL query which identifies what will be included in the snapshot. For example:
SELECT * FROM "my_schema"."my_table"
You can edit this query if you want to snapshot only part of the table. See Snapshotting part of a table for details.
Note
You can't snapshot a table that hasn't been included in the pipeline definition.
-
Enter a Streaming catch up duration. This should be the number of minutes you want to elapse between one queued snapshot request completing and the next one starting. This delay allows the pipeline to perform a certain amount of streaming between snapshot requests, to catch up to the most recent change events. Read Queueing and delaying snapshots for a discussion of how and why to use this feature.
- Click Snapshot n table(s).
The snapshot will begin executing immediately. If you selected more than one table to snapshot, one table snapshot will begin and the other tables will be added to the snapshot queue.
Snapshot management
You manage a pipeline's snapshots from the pipeline dashboard. All snapshot status information is under the Snapshotting tab of the dashboard.
The snapshot queue
Multiple snapshots can be configured for a single pipeline, and they will be queued and performed sequentially. When one snapshot completes, it will be removed from the queue and the next snapshot will begin.
If your snapshot request includes multiple tables, the request will automatically be broken down into single-table snapshot requests that are queued individually. This increases the resilience of the snapshot process, as a disruption (for example, due to network issues) can only ever result in a single table failing to snapshot and having to retry, with the remainder of the queued tables being unaffected.
You can configure a delay between the execution of different requests in the queue. This allows streaming to resume during the delay between snapshots, to catch up to the most recent change events.
To view the queue, click the Snapshotting tab on the pipeline dashboard. The queue is only visible when the pipeline is running.
By default, the queue is sorted into an arbitrary order.
Snapshot events
When a snapshot request starts, completes, or fails, a Snapshot event is shown on the Snapshotting tab of the pipeline dashboard. The event description includes relevant details such as number of rows that were captured in a completed snapshot.
You can choose to receive fail event notifications via email. Select the option on the pipeline dashboard to enable email notifications.
Repeating a snapshot
If you want to repeat a snapshot which has already completed, you need to configure a new snapshot with the same details, following the process given above.
If you make multiple snapshot requests for the same data, this will result in duplicate rows in your target data, so be sure you really want this before repeating a snapshot. You must also be aware of this when breaking down a table into multiple requests for snapshotting, as a set of poorly written SQL queries could result in requests containing overlapping data, which again will result in duplicated rows in the target.
Note
If you are using streaming simply for table synchronization, then repeating a snapshot has essentially no effect, as it will just overwrite the existing rows in the target.
Recovering from a failed snapshot
There are three different scenarios in which snapshots can fail.
-
A Streaming pipeline or agent encounters an irrecoverable error or is stopped while a snapshot is in progress.
- Upon restarting the pipeline, the disrupted snapshot will be restarted.
- Any remaining snapshots in the queue will execute in sequence after the completion of the restarted snapshot.
- The pipeline dashboard displays an event for the snapshot failure.
-
A snapshot fails while the Streaming pipeline is still running. This can happen if the connection to the source database is temporarily interrupted, or if the SQL statement used to break down the snapshot of a large table is invalid.
- The failed snapshots will not automatically retry. You must manually recreate a failed snapshot in order for it to execute.
- The Streaming pipeline will continue to execute snapshots that remain in the snapshot queue.
- A persistent failure, such as the source database being unavailable, can cause multiple snapshots to fail, as each one in the queue is triggered in turn.
- The pipeline dashboard displays events for each snapshot failure.
-
The table being snapshotted is removed from the pipeline configuration while the snapshot request is still in the queue.
- The snapshot will fail as soon as it attempts to start and will be removed from the queue.
- The following snapshot in the queue will be executed, in sequence.
- A fail event is displayed on the pipeline dashboard page with information highlighting that the snapshot was completed but no tables matching the requested table were identified in the pipeline.
- The snapshot will not be automatically retried.
For further information on why a snapshot failed, you can review the agent logs around the time of the snapshot failure event.
Snapshotting part of a table
Snapshotting a large table can take time, which increases risk and affects recoverability. To overcome this, you can break down the snapshot of especially large tables into multiple discrete requests.
For example, if you have a source table with date column values ranging from 2022 to 2024, your first request could snapshot records for 2024, your second could snapshot 2023, and your final request could snapshot 2022.
Alternatively, you may only want to snapshot part of a given table, and exclude superfluous data. For example, you could create a single request to snapshot records dated 2024, and ignore 2023 and 2022. This gives you full control over which records you choose to snapshot, and improves snapshot efficiency and the usefulness of your target data.
You can break down a snapshot into as many or as few requests as you wish.
To snapshot just part of a table, you need to edit the SQL query displayed when you select the table while configuring the snapshot.
For example, if your table has a numberic "year" column, and you only want the snapshot to include rows where the year is equal to 2024, you could write the following query:
SELECT * FROM "my_schema"."my_table" WHERE year = 2024
Errors in your query will cause the snapshot to fail, and will show the error in the event list on the dashboard.
Warning
This operation assumes a knowledge of basic SQL syntax, and writing incorrect queries will adversely affect your target data.
Use extra caution when writing these queries, and always test thoroughly before use, as even valid SQL could still have an unexpected result that causes your pipeline to fail.
Adding more tables
At any point after a pipeline has started streaming you can add new source tables to the pipeline configuration, as described in the documentation for each streaming source.
Once the new table has been added to the pipeline configuration and the pipeline has been restarted, you can create a new snapshot for the table by following the process for creating a snapshot given above.
Re-baselining out-of-sync data
If a Streaming pipeline is stopped, either deliberately or through failure, for longer than the source database's change log retention period, there is a high probability that the source and target will become, and remain, out of sync. This means that when streaming is started again, changes may have been missed. In this case, a new baseline will need to be established to retain the synchronisation between source and target.
To re-baseline data, configure a new snapshot of the source database.
This out-of-sync state may also be caused by stopping a pipeline and making a change to the destination configuration, such as changing the file prefix for a cloud storage destination. In this case, you will need to configure a new snapshot to re-baseline and re-load snapshotted data into the new storage destination.