Optimize
Optimize the layout of Databricks data with the Optimize orchestration component. You can optionally optimize a subset of data or colocate data by column. If you don't specify colocation, bin-packing optimization is performed.
Bin-packing optimization is idempotent. This means that if the operation is run twice on the same dataset, the second run has no effect. Bin-packing aims to produce evenly balanced data files with respect to their size on disk, but not necessarily the number of tuples per file. Typically, however, the two measures are often correlated.
Z-Ordering is not idempotent. However, Z-Ordering does aim to be an incremental operation. The time taken for Z-Ordering isn't guaranteed to reduce over multiple runs. Z-Ordering aims to produce evenly balanced data files with respect to the number of tuples, but not necessarily data size on disk. While the two measures are often correlated, situations can occur where this is not the case, leading to skews in optimisation times for tasks.
Properties
Name
= string
A human-readable name for the component.
Catalog
= drop-down
Select a Databricks Unity Catalog. The special value, [Environment Default], will use the catalog specified in the Data Productivity Cloud environment setup. Selecting a catalog will determine which databases are available in the next parameter.
Schema (Database)
= drop-down
The Databricks schema. The special value, [Environment Default], will use the schema defined in the environment. Read Create and manage schemas to learn more.
Table
= drop-down
The table to be optimized. Only one table can be selected per instance of the component.
Partition
= expression editor
The partition columns to include in the optimization process with the related condition. The default is none.
Z Order
= column editor
The columns to include in the optimization process. This list should exclude any partition columns. The default is none.
Snowflake | Databricks | Amazon Redshift |
---|---|---|
❌ | ✅ | ❌ |