Scaling best practices

The Matillion Data Productivity Cloud executes pipelines via an agent. This works by decomposing your pipelines into tasks, which are then distributed across the instances (nodes) of an agent.

When hosting agents within your VPC or VNet (also known as a Hybrid SaaS solution) it's necessary to right-size your agents to ensure you can obtain the level of performance and concurrency you need.

This guide provides details of the key considerations.

Note

If you're using our Full SaaS offering, we advise you contact us to discuss scaling your agent.

Tasks

Tasks are the smallest unit of work an agent can execute and consist of items such as:

A single orchestration component execution.
A specific execution of a transformation pipeline.

Note

Using Designer also generates tasks, for actions such as running a sample operation or loading a list of tables or columns. However, no limiting of these design-time tasks is undertaken.

Agent instances

In the Matillion Data Productivity Cloud, work is executed via agents, and each agent is made up of agent instances. In practice, this is implemented using containers. An agent is a named collection of containers, with each container being known as an agent instance.

When pipeline tasks are sent to an agent, they will be sent to any agent instance that has capacity. If there is no current capacity, then the pipeline task will be queued. When an instance subsequently has capacity, the pipeline task will be sent to that instance and be executed.

Note

Agents should be configured with a minimum of 2 agent instances to ensure the automatic upgrade process does not cause a service outage—since agent instances will be upgraded in a staggered fashion.

Agent instance capacity

To protect the stability of the agent instances under load, an agent instance won't take on a new task if:

The CPU load is deemed too high.
The available free memory is too low.
The agent instance is already executing 20 tasks.

Tasks that can't execute because there is no available agent instance will queue until an agent instance becomes available.

Scaling for load

Horizontal scaling

Each Data Productivity Cloud agent instance is limited to 20 concurrent tasks at any one time. This is regardless of the amount of resources assigned to the agent instance. As such, a high level of concurrency in your pipelines would result in tasks being queued and would result in the overall pipeline execution taking longer.

Horizontal scaling involves adding more agent instances. By adding more instances, you increase the number of tasks that can run in parallel—two agent instances allow 40 concurrent tasks to be executed, and so on—reducing task queuing.

The method for adding agent instances will vary depending on your container orchestrator—see here for detailed instructions:

Cost implications

Adding more agent instances does not result in extra charges from Matillion. Our credit charges are based on task execution time. Task queuing time does not consume credits.

However, running extra containers (e.g. agent instances) is likely to increase the infrastructure cost from your container orchestrator (e.g. AWS Fargate).

As such, ensuring enough agent instances are available for the required performance requires balance between desired performance and infrastructure cost/budget.

Transformation tasks - low load

Since transformation tasks generate SQL that is then executed by your cloud data warehouse, these tasks do not require a large amount of CPU time or memory on the agent instances. With this in mind, if your workload is "transformation heavy", a smaller agent (with a low number of agent instances) will likely suffice.

Data ingestion and scripting - high load

Components that move or ingest data—as well as those allowing the execution of customer scripts such as Python or Bash—place a high CPU and memory burden on agent instances. If workloads involve a high volume data ingestion or custom scripting, you'll need to run a larger number of agent instances.

Further considerations

Scale up delay

Once you have edited the agent service to start more agent instances, there is a delay of approximately 4 minutes for the agent instances to start and dial back to the Matillion Data Productivity Cloud to begin accepting tasks.

Scaling AWS agents from within a pipeline

If using AWS ECS Fargate to run your agents, you can scale up and down within a pipeline. The AWS command line tools are available within the Bash Script component and this can be used to edit the ECS service to change the desired number of instances.

Bash scripts executed in Hybrid SaaS environments obtain the IAM permissions assigned to the agent. If these permissions include amending an ECS Fargate service, then a script can be used to change the number of agent instances.

The below script can be used within a Bash Script orchestration component to do this:

###
# This script will alter the desired task count for a Matillion Agent
# Please set the required variables to the values seen in the AWS ECS Console
# Note: new agent instances usually take sround 4 minutes to be available for task processing
###

AWS_REGION=<AWS region e.g. eu-west-1>
AWS_ECS_SERVICE=<ECS Fargate Task Name>
AWS_ECS_CLUSTER=<ECS Cluster Name>
DESIRED_AGENT_COUNT=2

aws ecs update-service --service $AWS_ECS_SERVICE --desired-count $DESIRED_AGENT_COUNT --region $AWS_REGION --cluster $AWS_ECS_CLUSTER

This script needs the following permission in the IAM role attached to the Task Definition as the Task Role for the agent:

ecs:UpdateService

You must add this permission before running the script, as it's not added by default.

Note

New agent instances take around 4 minutes to become available. Please consider this when scheduling your scaling events.
Be aware of other pipelines or users who may be relying on the agent—this resize will affect any users or pipelines using the agent.

Scaling Snowflake

If you're not seeing the performance expected—even when using a Matillion Full SaaS solution of sufficient size for your workloads—it might be that your Snowflake warehouse defined in the Matillion environment needs scaling. Read Monitoring Warehouse Load to learn more.

If a warehouse is overloaded with many parallel queries, the queries will queue. A large queue time shown in the Snowflake graph indicates pipeline performance will benefit from scaling the warehouse.

Matillion recommends enabling multi-cluster warehousing if this is available in your Snowflake account. Using this mechanism, Snowflake will horizontally scale the warehouse by starting and stopping instances of the warehouse automatically. Matillion has found that this improves concurrent performance in a better way than simply increasing the size of the warehouse.