Vacuum
Vacuum is an orchestration component that performs a vacuum operation on a list of tables. Vacuum is a housekeeping task that physically reorganizes table data according to its sort key, and reclaims space left over from deleted rows. Vacuum is almost always used at the end of an orchestration pipeline.
For more information about the vacuum process, read:
- Databricks VACUUM documentation.
- AWS VACUUM documentation.
Properties
Name
= string
A human-readable name for the component.
Catalog
= drop-down
Select a Databricks Unity Catalog. The special value, [Environment Default], will use the catalog specified in the Data Productivity Cloud environment setup. Selecting a catalog will determine which databases are available in the next parameter.
Schema (Database)
= drop-down
The Databricks schema. The special value, [Environment Default], will use the schema defined in the environment. Read Create and manage schemas to learn more.
Tables to Vacuum
= dual listbox
Select which tables to vacuum.
Retention Period
= integer
The retention threshold. The default is 7
, with the unit specified in Retention Unit.
Retention Unit
= drop-down
Select the unit of the Retention Period. Options are Day, Hour, or Week. The default is Day.
Name
= string
A human-readable name for the component.
Schema
= drop-down
Select the table schema. The special value, [Environment Default], will use the schema defined in the environment. For more information on using multiple schemas, read Schemas.
Tables to Vacuum
= dual listbox
The tables to vacuum.
Only one vacuum may be running at any one time across an entire Amazon Redshift cluster. Therefore, vacuums may fail due to concurrent workloads. This is usually harmless if the same tables will be vacuumed again on the next run of the pipeline. If this is the case, consider joining the "Failure" link of the component to an End Success component to prevent vacuum failure from failing the whole pipeline.
Vacuum Options
= drop-down
The component reclaims disk space occupied by deleted rows in a table, using the method selected here:
- None: A default vacuum operation. This is analogous to "FULL" in the current AWS implementation.
- FULL: Is equivalent to DELETE ONLY if the target table is more than 95% sorted, otherwise will perform a full sort.
- SORT ONLY: Sorts the table but does not reclaim disk space. Is quick at the expense of unclaimed memory.
- DELETE ONLY: Will not sort tables and is consequently quicker than other methods.
- REINDEX: Analyzes interleaved sort keys and performs a FULL sort.
Snowflake | Databricks | Amazon Redshift |
---|---|---|
❌ | ✅ | ✅ |