Performance monitor
The Performance tab in the lower-right panel displays important performance metrics for your Matillion ETL instance. These metrics provide valuable real-time insights about system performance, enabling more informed decision making during specific events.
Supported versions
The performance monitor is automatically enabled in versions 1.74 and above. You can turn off this feature by clicking Admin → Manage Optional Features and unticking the corresponding checkbox. Click OK and accept the server restart.
You can also enable this feature from version 1.68.13 and later by reading this Knowledge Base article: How To: Enable performance monitor tab in the bottom right task panel.
Usage
The Latest column in the Performance tab shows the latest recorded value of each metric. Performance metrics are sampled and updated in this column every few seconds.
Performance metrics are collected per node. Customers with high availability (HA) instances will see different values depending on which node they have connected to.
Click View Analytics to open a panel showing a graph of the performance history for the selected metric. By default, this graph shows the last five minutes of performance data. Use the drop-down menu in the upper-right of the graph to change this period to last hour or last three hours.
Use the Toggle metrics... drop-down to add multiple metrics to the same graph. For example, you could select both "METL heap" and "METL non-heap" to show both values as differently colored lines on a single graph.
Warning
Excessive sampling of performance metrics by multiple users can itself affect performance. Therefore, in situations where large numbers of users are involved, we recommend using role-based permissions to limit which users can access the Request Performance Metrics feature. Read Groups and Permissions for information on how to change this permission. By default, access to this feature is granted to all users.
Metrics
METL uptime
The uptime for the Matillion ETL instance, displayed as years:weeks:hours:minutes:seconds
. This is the length of time since Java (i.e. Tomcat) was started on this instance.
CPU usage
Displays the average CPU usage across all CPU cores. 100% represents 100% usage of all CPU cores.
We would expect this to have a low value while the Matillion ETL instance is idle; to have a high value when performing calculation-intensive activities such as validation and staging; and to have moderate values when carrying out push-down operations such as transformations.
Generally for Matillion ETL, the limiting factor on performance will be network IO or CPU, so this is an important figure to monitor. Some considerations:
- It may be worth staggering job runs so that they have their highest CPU periods at different times; this will improve overall throughput.
- If it isn't possible to stagger job runs, then a larger virtual machine (VM) with more resources may improve completion times.
- If the CPU never reaches 100%, then it may be possible for you to use a smaller VM, which will decrease your costs.
Memory usage
The following memory usage is monitored:
- METL heap: The memory usage by the "Matillion" part of Matillion ETL. This will generally increase and decrease quite often over time.
- METL non-heap: The memory usage by the "database connectivity" part of Matillion ETL, including JDBC caches. This will typically increase when database drivers are first loaded, but then remain relatively stable.
- METL committed: The amount currently claimed by running Java processes, not including the Java virtual machine itself. This is the best representation of the memory used by Matillion ETL. The amount of heap and non-heap memory used will tend to increase over time. When the total quantity used approaches the committed limit, Java will tend to "garbage collect". If garbage collection doesn't release sufficient memory for the current processes, Java will request more memory be committed to it from the OS. If the committed memory exceeds what is required, then Java may perform memory compaction, and release committed memory back to the operating system (OS).
-
Memory Free: The amount of memory that is immediately available on the Matillion ETL instance.
Note
Linux will use most of the available memory for caching disk activity, so this will typically be quite a small value for a long-running Matillion ETL instance. 1% would be typical.
-
Memory total: The memory available to the Matillion ETL instance. This is either the raw instance size or the containerized limit, depending on how Matillion ETL has been installed.
- Memory usage: The percentage of available memory being used.
Memory usage is given in bytes, KB, or MB.
The most important of these for Matillion ETL stability is committed memory. If it isn't possible to allocate more when required, then the server will crash with an out-of-memory error. Matillion ETL instances with a locally hosted database will have a lower amount of memory that can be committed from the total than those with an external database.
Memory usage will usually be correlated with job runs. If any job has a particularly high memory requirement, it may be effective to stagger it so that more jobs can be run on the Matillion ETL instance. It may also be worthwhile to use a larger VM with more memory available.
If memory usage is never high, then you might be able to improve hosting costs by using a smaller VM.
Disk usage and database size
Displays a percentage for local disks, and a number of megabytes (MB) for the database. As the database may be external, Matillion ETL can't calculate a percentage for it.
Matillion ETL requires some free space for its operation. Disk usage will change over time:
- Temporarily, due to staging operations and some in-progress components that require temporary files, such as the Excel Query component. These will reduce again as jobs complete.
- Permanently, usually due to log storage. This might be on disk, such as the Catalina logs, or in the database, such as task run histories.
If disk usage reaches a high value, Matillion ETL will not be able to execute jobs. You should make some disk space available by clearing or compacting logs, or by increasing the disk space available to the Matillion ETL instance. Similarly, if this value is never high, the disk space available can be reduced to save costs.
Log files usage
The size of the Tomcat logging directory.
Network usage
This includes the following metrics:
- Network input
- Network output
- Network input total
- Network output total
These metrics are measured in bits per second (b/s, Kb/s, or Mb/s). This value is taken from the Linux statistics, and will include non-Java activities such as Bash Script and Python Script components.
The network profile will vary depending on which jobs are being run, typically in much the same way as CPU usage does. Orchestration jobs will typically require more network use than transformation jobs; orchestration is quite active in co-ordinating resources, whereas transformation is generally pushed down to the cloud platform and Matillion ETL awaits the results.
If any single job has a particularly high network activity, then it may be worth staggering it from other jobs, to maximise throughput.
Lower network activity than expected may represent transient network conditions, or may mean that activity is being limited by some other resource. It would be worth comparing network usage with CPU usage to ensure that throughput isn't being limited by CPU resource. Depending on how your Matillion ETL instance is hosted, it may be possible to increase the network adapter size to improve throughput.
Network activity should never show as completely idle, but should stay under 1000 b/s when no jobs are running.
Network totals may be useful for auditing purposes.
Pending messages and websocket usage
Shows the number of pending messages, and a percentage showing the busiest client for websocket usage.
The Matillion ETL server interacts with connected clients by sending websocket messages between the server and each client. The messages are sent asynchronously, and it's normal to see some messages pending when they're created faster than they can be downloaded. This is usually the case when first joining a project.
Normally the number of pending messages should be quite low, and zero is optimal, indicating that messages are being downloaded as quickly as they are generated. Temporary small increases may represent a network issue for a client, and aren't worrying unless they don't decrease again.
We normally limit the websocket queue to one thousand items, to prevent out-of-memory issues on the server. However, large values increase the latency between client and server, and may cause lagginess. If the percentage usage is persistently high, we suggest that you contact support so that we can investigate the issue.
Thread usage
These metrics are simple counts of the number of threads. The status indications are based on the Java thread state. Read the Java documentation for details.
This includes the following metrics:
Metric | Description | Java thread state |
---|---|---|
Running threads | The thread is currently executing/executable. | RUNNABLE |
Waiting threads | The thread is waiting indefinitely for another thread to perform an action. | WAITING |
Sleeping threads | The thread is waiting for a specified time period for another thread to perform an action. | TIMED_WAITING |
Blocked threads | The thread is blocked and waiting for "monitor lock". | BLOCKED |
Zombie threads | The thread hasn't started, or has exited but its resources haven't been reclaimed. | NEW and TERMINATED |
The appearance of zombie threads is a concern, indicating a problem with the underlying Java virtual machine, but otherwise threads being started, waiting, and completing indicates normal Matillion ETL operation.
Typical "idle" thread distribution in Matillion ETL is about 20 running, 0 waiting, 50 sleeping, 100 blocked, and 0 zombie, depending on the Matillion ETL version. More threads will start running when tasks are running. The more jobs that are running concurrently, the more that they will have to synchronize, leading to more blocked, waiting, and sleeping threads. As long as they all get out of those states as other threads progress, then all is well.
If there are more runnable threads than processors available in your Matillion ETL instance, then it might be possible to increase throughput with more CPU cores; it may be worth cross-referencing this number with total CPU load during heavy usage.
Total quartz threads
Quartz (Quartz Enterprise Job Scheduler) is the scheduling library that Matillion ETL uses to manage scheduled jobs, message queues, API requests, and to share the triggered jobs between clusters. This metric shows how many Quartz threads are running.
Running jobs and pending jobs
The number of jobs that are pending or running, on the cluster not on the node.
Temporary files
The number and size of temporary file allocations.