Tracking loaded files
Tracking Loaded Files
A common use case for Matillion ETL is to use Matillion's SQS integration to trigger the loading of a file from S3 into Redshift using lambda. Further details of this are here.
Suppose some transformations should be performed on the loaded data and these should only be done if one or more other files have also been loaded. This article looks at how to track the files loaded and only run the transformation if all data is available.First create a table to track which files have been loaded. This table can then be updated by the Orchestration jobs running and can be checked before a Transformation job is run. Create this table in a Transformation job in Matillion ETL as shown below using a Fixed Flow and Rewrite table component.
The table has 2 columns:
- The name of the file loaded
- An indicator to show if it has been loaded
The Fixed Flow component can also be used to populate the file names expected to be loaded. The default value for the loaded indicator should be false.
This transformation job only needs to be run once to initially create the table.
The next step is to update the table every time the triggered Orchestration job has been run. This can be done in another Transformation job:
This job takes the table created above and updates the indicator column to be true for the associated file name. The file name is passed into the job as a variable from the parent job.
In order to determine whether to run the Transformation which takes data from all the files, the tracking table needs to be queried. This can be done from a separate Orchestration job as this will need to be run after each load file Orchestration job.
First a Python script is used to obtain the number of records in the tracking table where the file hasn't been loaded. The result of this is written to a Matillion ETL variable using the code below.
The IF component then checks if that variable is equal to 0, if so the transform can be run and all flags need to be reset to false using a simple Transformation job:
If not the job ends successfully without the transform and waits for all other files to be loaded.