Designing a job for a high availability cluster
Designing a job for HA
When a Matillion ETL Job runs on a high availability (HA) cluster and a node fails (because of problems with network failure, instances crashing, etc.) and the Job is invoked by the Matillion ETL job scheduler, then within a few seconds the failure will be detected by another node and the job re-submitted from the start. Jobs that are invoked any other way (through the UI, via API or SQS) are not restarted. Using Matillion ETL on HA clusters is an Enterprise Mode feature, meaning it is only available to customers on Enterprise Mode instance sizes. For more information, read the Instance Sizes Guide.
Making sure your job is suitable for execution in such an environment revolves around 2 concepts: - Transaction Control - Idempotence
Either concept, or more commonly a mixture of both, will result in durable jobs that will complete even in the event of total failure of an instance.
Transaction Control
The idea of a database transaction is that if an error occurs, data is rolled-back to before the transaction began. A transaction can span many operations, and on Redshift this includes DDL statements as well as DML operations. Only if all operations succeed will any other database users see the changes made during that transaction. Matillion implements transactions as separate Begin, Commit and Rollback components - if you do not use any such components in your job, then each individual statement behaves as if it were it's own transaction.
Rollback may occur on failure of the instance, a network outage or by running the Rollback component.
Advantages of Transaction Control:
- It's simple to implement.
- It's handled by the target database.
Disadvantages of Transaction Control:
- TRUNCATE is not safe. In Matillion, during a transaction, Truncate is implemented as "DELETE FROM" and will then require a vacuum to reclaim unused space.
- While the transaction is open, both old and new data must be kept. (Other users see the old data until a commit.)
- Only affects the database data and not storage areas (for example, S3).
Idempotence
The concept of idempotence is that if a job fails and is re-run, it doesn't matter where it failed since re-running the job will get things into a consistent state.
For example, if a staging table is truncated first, then a file loaded into it, then transformed and finally used to update a target table, then it doesn't really matter how far it progressed up to the final table update because re-running the job from the start will have the same impact overall. So provided the final table update is done inside a transaction (table update actually executes multiple statements), the rest of the job does not have to.
Advantages of idempotence:
- On failure, re-run with the confidence nothing can get worse.
- A good design principle is to design for failure.
- It works for any resource, not just the target database.
Disadvantages of idempotence:
- You may need additional components that appear to be unnecessary. For example, truncating a staging table which should, in normal circumstances, already be empty seems wasteful.
- It can sometimes be quite difficult to get just right.
Transformation Jobs
Warning
Transformation jobs cannot be run successfully in parallel on an HA cluster. Unexpected behaviour will occur.
If the same transformation job is executed more than once concurrently, unexpected behaviour will be observed.
- Differently named transformation jobs are distinct entities and can be run in parallel to one another.
- The same transformation job cannot be successfully executed more than once at a given moment, even though it is possible to attempt.
This is due to variable values used by a transformation job being confused with those of a different iteration of the same job.
To work around this limitation, it is recommended that users package or reference their transformation jobs within a Shared job when attempting to (or risking to) run parallel transformation jobs on an HA cluster.
Keep in mind that this does not necessarily guarantee the jobs will, in practice, run in parallel as that relies on their distribution across nodes in the cluster and this cannot be controlled.
Summary
Knowing that a job can simply be re-run in the event of a failure is a good safety net, and even without High Availability failover (where the re-run it automatic) is still a good design goal to strive for. It makes supporting the job in production easier because devops engineers won't need to know what the jobs do.
Understanding the limitations of HA jobs and how to design your workflows accordingly will save a lot of time troubleshooting issues later.