Snapshots
A database snapshot captures a static view of your database at a certain point in time. In terms of streaming pipelines, a snapshot allows you to establish an up-to-date view of the database before streaming begins, giving you a baseline from which to begin streaming changes.
All source connectors have two snapshot options: initial snapshot and on-demand snapshot, described in detail below. You will select which of these options to use when you create the streaming pipeline.
Initial snapshot
Initial snapshot allows you to take an initial snapshot of the specified tables when the pipeline is started. The initial snapshot captures the current state of the data in those tables at the beginning of the pipeline execution. By enabling this option, you ensure that the pipeline starts with the most recent data available.
- An initial snapshot only impacts the pipeline during its first startup.
- A pipeline is considered to be in its first startup under the following conditions:
- The target location is empty.
- The last-known location in the database logs is unknown or invalid.
- If initial snapshots are enabled, the connector will fetch all historical data from the configured tables. During the initial snapshot process, the pipeline will have a SNAPSHOTTING status. If the pipeline is stopped before the snapshot concludes, the initial snapshot will start again when the pipeline is restarted.
- Once the initial snapshot is finished, the pipeline transitions to a STREAMING status. After this point, any changes that occur in the database will be streamed, ensuring that the target remains synchronized with the source database.
- If the initial snapshot option is disabled, the pipeline will immediately have a STREAMING status when it's started. Only changes made to the database from that point on will be streamed to the target.
On-demand snapshot
On-demand snapshot allows you to trigger snapshots for specific tables in the pipeline manually. This means that new tables can be added to the capture list after the pipeline's initial startup, and snapshots for those tables can be triggered as needed. On-demand snapshots are useful when you want to capture specific points in time or when you need to dynamically expand the set of tables being captured without losing historical data.
- Enabling on-demand snapshots lets you trigger snapshots for newly added tables in the pipeline. This ensures that historical data from the newly added tables is captured without any loss.
- While an on-demand snapshot is in progress, the pipeline will display a SNAPSHOTTING status. If the pipeline is stopped during the snapshot process, the snapshot will resume on pipeline restart.
- After the on-demand snapshot is finished, the pipeline returns to a STREAMING status. At this point, all tables in the target location will be synchronized with the source.
- To enable on-demand snapshots, a signal table must be configured in the source database. This table is tracked by the connector and automatically triggers snapshots when it receives a signal.
Note
- An on-demand snapshot cannot be triggered on tables that don't have a primary key.
- Schema changes while a pipeline has a SNAPSHOTTING status should be avoided, and will cause the pipeline to fail if they occur.