Skip to content

Overview of snapshotting

Snapshotting the source database of a Streaming pipeline allows you to establish an up-to-date view of the database, giving you a baseline from which streaming changes begin.

A database snapshot captures a static view of the database at a certain point in time. Streaming can only capture database change operations that occur after the streaming pipeline is started, which means that no data existing in the source prior to the beginning of streaming will be copied to the target database. If you require the historical data to also be in the target, you would use snapshotting to synchronize the target to match the source at the point where streaming started.

The true value of a Streaming pipeline is realized by getting the pipeline into a streaming state as soon as possible. Large initial snapshots can delay the start of streaming by hours or even days, as streaming and snapshotting operations can't be performed in parallel. To avoid this undesirable situation, Data Productivity Cloud takes an approach that requires the pipeline to be configured and actively streaming before snapshotting begins. You can then control the timing of snapshotting to minimize streaming disruption.

To begin configuring your snapshots, read Configuring and managing snapshots.

Note

This feature requires that the streaming agent you are running is version 2.111.0 or later.

To determine what version your agent is, locate the agent on the Agents screen and then click ...Agent details. The version is listed at the bottom of the list of agent parameters.


Scheduling snapshots efficiently

When snapshotting a large number of tables, or a number of large tables, the snapshot can take several hours or even days. While a snapshot is running, streaming is paused for that pipeline. In some cases, the snapshot duration may last longer than the retention period of the source database logs, meaning the offset position may be lost and the pipeline becomes unable to re-enter a streaming state. Matillion's implementation of snapshotting avoids these potential problems by allowing you to break down a snapshot into a series of smaller requests, which can be interleaved with regular streaming.

Snapshot requests can be configured at any point after the pipeline has been started. The pipeline will continue to stream until a queued request is due, at which point it will pause streaming to allow the snapshot request to complete. Once all pending snapshot requests have been completed, the pipeline will then continue to stream until the next scheduled request.

The contents of a single snapshot request are completely configurable, and can consist of a single table, several tables, or selected parts of a larger table. If several tables are included in a single request, the request will be broken down into multiple single-table requests for processing purposes.

Advantages of this approach to streaming snapshots are:

  • The snapshot has greater recoverability. If the pipeline is disrupted while a snapshot request is in progress, the snapshot process will resume from the beginning of that specific request, with no effect on requests already completed before the disruption.
  • Because you configure the snapshots after the pipeline has started streaming, you can ensure that your streaming configuration is working as expected before proceeding with the snapshot.
  • Breaking down large tables into smaller snapshot requests is useful to prevent log staleness, since streaming is restarted between each snapshot so it can catch up to the most recent change events.
  • Breaking down large tables gives you flexibility when it comes to snapshotting only the data you require. For example, in a large table you may only need to snapshot data from after a specific cut-off date, and ignore the older data.

Queueing and delaying snapshots

If you break a snapshot down into multiple requests, the requests will be queued and performed sequentially. When one snapshot completes, it will be removed from the queue and the next snapshot will begin.

You can configure a delay between the execution of different requests in the queue. The delay allows streaming to be restarted between each snapshot, to catch up to the most recent change events.

If no delay is configured, all queued snapshot requests will be executed consecutively, leaving no time for streaming to occur until all snapshots have been completed. This would increase the risk of log staleness and reduce recoverability, negating the benefit of the queuing approach.