NRT replication In Redshift

Overview

Amazon Redshift offers a fast and scalable data processing platform at around 1/10th the cost of a typical equivalent on-premises solution. This makes it perfect for integrating multiple different source systems, to produce timely, consistent, joined-up reporting. Matillion ETL of course provides many mechanisms which can help you easily load data from a wide variety of sources and formats into Redshift.

However, by its very nature, Redshift is physically separate from those multiple source systems. There are a variety of tools, such as DMS or the AWS Data Pipeline, which you can use to configure replication - but these require a certain amount of administration to set up and maintain.

By combining a few simple out of the box techniques, you can use Matillion ETL to load your data into Redshift's massively parallel processing engine, while still giving your users access to a near real time copy of data from source systems.

A full technical article on microbatching in Matillion ETL can be found here...

Building a microbatch graphically

Let's imagine that you have an operational system which is continually recording sales transactions.

The business needs a constant flow of up-to-the-minute analytics to keep track of what's happening. Trying to run aggregate reports directly on this operational system puts it under great strain, because it's a workload that's difficult for operational systems to handle.

The solution is to run those large, complex queries on a copy of the data, held in Redshift, and to keep that copy as close as possible to the real data. The key to making this work is to very frequently copy just the few new sales records over to Redshift. This is a microbatch.

Matillion ETL's graphical job builder interface enables you to construct and document the basic logic to make this happen, using just three components:

On the left there is a source system query component , which is driven by a check on Redshift to find the latest sale transaction that we have already loaded. Both these queries run quickly on their respective target due to Redshift's power and the source system's ability to index.

On the right there is a Redshift Append component , which efficiently adds the few new sales records to the Redshift table.

Together, these three components represent the E, L and T steps of an "ELT" microbatch in miniature. On the left, the E and L acquire the new data from source. On the right, the T step adds it into its destination table in Redshift.

Using Matillion ETL's close AWS integration to maximise efficiency.

Now that the logic is in place to acquire and copy new sales transactions, the next step is to make this happen often, and with minimal delay. This will ensure that queries running on Redshift will have access to the very latest data.

A solution that's often proposed is to schedule the microbatch at regular intervals - for example five minutes apart. But that presents a couple of practical problems:

When the source system is very quiet, five minutes may be much too long, and it would be possible to fit multiple microbatches into this timeslot. This is a missed opportunity to be more up-to-date.
When the source system is very busy, five minutes may be too short, since the source system may take longer to respond with all the new data. This can lead to all kinds of undesirable problems such as missing batches, and having waiting batches start to stack up.

The answer is to make the scheduling much more dynamic. Using Matillion ETL's tight integration with the AWS Simple Queueing Service it's easy to make sure that the next microbatch starts at exactly the same time as the previous one finishes. Only one component is needed:

This component is the final step. It enqueues a short message into SQS which instructs Matillion ETL to launch the next microbatch.This link ensures that the microbatches always occur at exactly the right time: never too soon or too late.

In Matillion ETL, you'll see a task history start to build up like this:

The source system takes only a couple of seconds to identify and extract the new records, thanks to its efficient indexing. Redshift takes only a couple of seconds to add the new data and adjust the high water mark ready for the next run.

Implementation

A couple of simple extra features are very useful to make the system a little easier to manage at runtime.

Switching it off

Matillion ETL's Environment Variable subsystem provides an easy way to switch off the microbatch loop, which would otherwise simply run forever.

A single environment variable, normally set to "on", is checked every time, using an "IF" component.

When manually switched to "off", the microbatch simply ends successfully (following the orange line), rather than going on to request the next run (following the blue line).

Avoiding maintenance windows

All systems reserve time windows for maintenance activities such as backups. This applies to Redshift, RDS, and equally to in-house systems such as the sales transaction database in this example.

Matillion ETL's IF component can be made to branch based on a simple JavaScript expression, such as checking if the time is approaching a known maintenance window.

This works very similarly to the on/off branch shown above. If the time is approaching 1am the IF component simply ends the microbatch successfully (following the orange line again) without launching the next iteration.

Later on, at 2am, Matillion ETL's built-in scheduler will automatically re-start the process for the next day.

Putting it all together

The completed Matillion ETL job looks like this.

Matillion ETL's graphical working environment makes it easy to construct a data handling job like this, and ensure it's easy to understand and maintain in the future.

Using Matillion ETL, this business can now have access to sophisticated analytics, driven by Redshift, using data that's always bang up to date.