Snowflake optimization for Matillion ETL🔗

This guide outlines best practices for configuring and managing Snowflake virtual warehouses to achieve optimal performance and throughput with Matillion ETL jobs. By understanding how Matillion ETL leverages Snowflake compute, you can build data pipelines that are scalable, efficient, and cost-effective.

How Matillion ETL uses Snowflake warehouses🔗

A Snowflake virtual warehouse is a cluster of compute resources responsible for executing all data loading, transformation, and query tasks. In Matillion ETL, every SQL statement generated by a component, from a simple Filter to a complex Join is pushed down and executed in a Snowflake warehouse.

Key concepts:

Separation of compute and storage: In Snowflake, compute (warehouses) are independent from storage. This means you can scale compute resources up or down based on performance needs without impacting the underlying data.
Matillion ETL as the control plane: Matillion ETL orchestrates the jobs, while the Snowflake warehouse performs the actual processing. Misconfigured warehouses are a common source of performance bottlenecks in Matillion ETL.

Snowflake configuration🔗

Performance optimization begins with your Snowflake account. These settings, managed by a Snowflake administrator, provide the foundation for efficient operation.

Creating and optimizing your warehouse🔗

Begin by setting up a dedicated warehouse for your Matillion ETL workloads. Selecting the appropriate size requires balancing performance and cost. Snowflake uses a t-shirt sizing model (for example, X-Small, Small, Medium), where each step up doubles both compute capacity and credit usage.

Example: Creating a warehouse via SQL🔗

This query creates a new, medium-sized warehouse that will auto-suspend after 5 minutes of inactivity and resume automatically when a query is submitted.

CREATE OR REPLACE WAREHOUSE MATILLION_ETL_WH
  WAREHOUSE_SIZE = 'MEDIUM'
  AUTO_SUSPEND = 300 -- Time in seconds; 300 seconds = 5 minutes
  AUTO_RESUME = TRUE
  INITIALLY_SUSPENDED = TRUE
  COMMENT = 'Dedicated warehouse for Matillion ETL workloads';

For more information, read the Snowflake documentation on creating warehouses.

Configuring the warehouse for concurrency and queuing🔗

When a warehouse receives more queries than it can handle at once, the extra queries are placed in a queue. To reduce delays, you can enable multi-cluster warehouses, which allow Snowflake to automatically add additional clusters to handle increased load.

Multi-cluster warehouses: Automatically scales out by adding more clusters of the same size as demand increases. This is ideal for handling concurrent Matillion ETL jobs.
Scaling policies:
- Standard: Adds new clusters quickly to reduce queuing time. Best for unpredictable, high-concurrency workloads.
- Economy: Adds new clusters gradually to save credits. Best suited for workloads that can tolerate occasional queuing.

Example: Altering a warehouse to be multi-cluster🔗

This query modifies the warehouse to scale between 1 and 4 clusters using the "Standard" scaling policy.

ALTER WAREHOUSE MATILLION_ETL_WH SET
  MIN_CLUSTER_COUNT = 1
  MAX_CLUSTER_COUNT = 4
  SCALING_POLICY = 'STANDARD';

For more information, read the Snowflake documentation on altering warehouses.

Setting permissions🔗

Your Matillion ETL role in Snowflake requires specific permissions to interact with the warehouse. At a minimum, the role needs USAGE to use the warehouse, and OPERATE to resume or suspend it.

Example: Granting warehouse privileges🔗

Replace MATILLION_ROLE with the actual role your Matillion ETL instance uses.

GRANT USAGE ON WAREHOUSE MATILLION_ETL_WH TO ROLE MATILLION_ROLE;
GRANT OPERATE ON WAREHOUSE MATILLION_ETL_WH TO ROLE MATILLION_ROLE;

For a complete list of required permissions, read Snowflake role privileges with Matillion ETL.

Matillion ETL configuration🔗

Once Snowflake is configured, you must configure your Matillion ETL environment to use the warehouse resources correctly.

Setting the default warehouse in your environment🔗

Each Matillion ETL environment should be configured to point to a default Snowflake warehouse. All jobs within that project and environment will use this warehouse unless explicitly overridden within the job itself.

Follow these steps in your Matillion ETL instance:

Click Project in the top left menu.
Select Environments.
Choose the environment you want to configure. For example, "development", "production".
In the Data Warehouse properties, click the Default Warehouse drop-down menu, and select the warehouse you created. For example, MATILLION_ETL_WH.
Click Test to validate the connection and permissions. If the test succeeds, a green tick will appear.
Click OK to save.

For more information, read Environments.

Configuring Matillion-side concurrency🔗

For Matillion ETL Enterprise customers, you can control the number of concurrent connections the instance opens to Snowflake. This setting should be aligned with your warehouse's capacity.

Concurrent Connections is located in the Environments panel. This setting specifies the maximum number of parallel statements that Matillion ETL will attempt to execute in Snowflake. If your Snowflake warehouse is configured for high concurrency (for example, a large or multi-cluster warehouse), you can increase this value. If you notice queries queuing in Snowflake, consider lowering the value or increasing your warehouse's capacity. It's best to start with the default setting and adjust based on monitoring results. For more information, read Job concurrency.

Dynamic warehouse usage in Matillion ETL jobs🔗

For workloads with varying computational needs, you can change warehouse settings dynamically within an orchestration job. The most common pattern is to scale a warehouse up before a large transformation, and scale it back down upon completion to manage costs. This is accomplished using the Alter Warehouse component.

Example: A job that temporarily resizes a warehouse🔗

This orchestration job includes the following steps:

Start.
Alter warehouse (scale up): Increases the warehouse size to X-LARGE.
Run transformation job: Executes a transformation job that requires high compute power.
Alter warehouse (scale down): Decreases the warehouse size back to MEDIUM.
End Success/End Failure.

Using the Alter Warehouse component in your Matillion ETL instance🔗

Drag the Alter Warehouse component onto the orchestration canvas.
In the Properties tab, select the Warehouse you want to alter.
Set the desired properties. To resize, choose a new Warehouse Size. You can also use this component to suspend, resume, or change multi-cluster settings.
Link this component in your Matillion ETL job flow as needed.

For more information, read Alter Warehouse.

Cost warning: The 60-second rule🔗

Before using this pattern, keep in mind Snowflake's minimum billing policy:

Minimum Billing: Each time a warehouse is resumed or resized, Snowflake charges for at least 60 seconds of usage.
When to use this pattern: This approach is most cost-effective for transformation jobs that run several minutes or longer, where the performance benefits of scaling up outweigh the minimum billing cost.
When to avoid this pattern: For short, resource-intensive queries (for example, under two minutes), resizing to a larger warehouse may be more expensive than simply running on a smaller, consistently-sized warehouse. Always consider the expected runtime before applying this pattern.

Cost management🔗

Effectively managing your Snowflake warehouses is key to controlling your Snowflake credit consumption.

Credits and Sizing: Snowflake charges based on credits consumed, with per-second billing (minimum of 60 seconds). Each warehouse size has a different credit-per-hour rate:
- X-Small: 1 credit/hour.
- Small: 2 credits/hour.
- Medium: 4 credits/hour.
- Large: 8 credits/hour.
- ...and so on. Doubling the size doubles the cost.
Auto-Suspend is Critical: The AUTO_SUSPEND parameter is your most important cost-control tool. A suspended warehouse consumes zero credits. For Matillion ETL workloads, a low suspend time (for example, 1-5 minutes) is highly recommended to avoid paying for idle compute time.
Dynamic Sizing for Cost Efficiency: The "scale up, scale down" pattern is a powerful cost-saving strategy for long-running jobs. Remember that each RESIZE or RESUME operation incurs a 60-second minimum bill at the new warehouse size. Ensure your transformation's duration justifies this minimum cost before implementing the pattern.

For more information about Snowflake costs, read Understanding overall cost.

Monitoring performance🔗

To identify bottlenecks, you need to monitor how your Matillion ETL jobs are performing on your Snowflake warehouse:

Use Snowflake's query history: The best place to start is the Activity → Query History page in Snowflake's Snowsight UI.
Key Metrics to watch:
- Queued Overload Time: The amount of time a query waits for warehouse resources to become available. If this value is consistently high, it indicates your warehouse is undersized or requires additional clusters to handle concurrent workloads.
- Execution Time: The duration a query spends actively running. Consistently long execution times may suggest that a larger warehouse is needed or that the query could be optimized.
Correlating Matillion tasks with Snowflake queries:
- In Matillion ETL, find the timestamp of a long-running component in the Tasks panel.
- In Snowflake's Query History, filter by your MATILLION_ETL_WH warehouse, and the user or role Matillion ETL is configured with.
- Find the query that corresponds to the timestamp from Matillion. The SQL text will match the code generated by the Matillion component. This allows you to pinpoint exactly which operations are causing delays or high costs.

Advanced: Workload management🔗

For more complex environments, isolating different types of workloads onto separate warehouses can prevent resource contention and improve predictability.

The concept🔗

A long-running, resource-intensive transformation job shouldn't block dozens of smaller, faster data loading jobs. By assigning each workload to its own warehouse, you ensure they don't compete for compute resources.

Example setup🔗

LOADING_WH: A SMALL, multi-cluster warehouse (1–5 clusters) optimized for high-concurrency ingestion tasks, such as loading from S3 or running multiple Database Query components in parallel.
TRANSFORM_WH: A LARGE, single-cluster warehouse designed for heavy data transformations (for example, building fact and dimension tables) that require significant memory and compute power but occur less frequently.

How to implement in Matillion ETL🔗

Create separate Matillion ETL environments for loading and transformation, each with its own default warehouse. Alternatively, you can use the Alter Warehouse component within a job to switch the session to the appropriate warehouse before running a specific set of tasks.

Considerations for Matillion ETL clusters🔗

Running Matillion ETL in a high-availability (HA) cluster requires additional planning for your Snowflake warehouse configuration to handle the increased load.

Concurrency is Multiplied: The Concurrent Connections setting in a Matillion ETL environment is a per-node limit. The total potential load on your Snowflake warehouse is this value multiplied by the number of nodes in your cluster. (for example, 8 concurrent connections on a 2-node cluster = 16 potential concurrent queries).
Multi-Cluster Warehouses are Essential: To handle the multiplied load, using a multi-cluster warehouse is highly recommended. Ensure your warehouse's MAX_CLUSTER_COUNT is high enough to absorb the peak load from all Matillion ETL nodes combined to prevent queuing. A STANDARD scaling policy is often best for these environments to ensure resources are added quickly.
Workload Management is More Critical: The risk of a single warehouse becoming a bottleneck increases with a cluster. The advanced strategy of using separate warehouses for different workloads becomes even more important for maintaining performance and preventing job contention.

For more information about Matillion ETL clusters, read Designing a job for a high availability cluster, and HA cluster.