Executing Python scripts outside of Matillion

Matillion ETL can run Python scripts as part of a job. This is used for setting variable values, running quick queries on the target database using database cursors, or connecting to other services. However, there are some circumstances where the amount of processing required by the Python script means it can be more sensible to run the Python script elsewhere to minimize the impact on the rest of the Matillion ETL server. This can be because the Python script requires a large amount of processing power—used to host the Matillion ETL application, which may affect the resources on the EC2 instance—or because of corporate requirements around security or execution isolation.

Examples of where this is the case include processing large files such as running XML to JSON conversion, or zipping/unzipping large files, or running API Calls that require or return a lot of data. In this article, we will look at a simple Python script that is used to convert an XML file to a JSON file so it can be loaded into the target database. This script, while simple, can require a lot of memory and processing power to convert large files.

Below are some options for running Python scripts outside of the Matillion ETL instance using the Bash Script component.

SSH

It's possible to run the script remotely on another EC2 instance using an SSH command to connect to the instance. If required, the AWS CLI can be used to start and stop the instance on demand. The SSH command is used to connect to the instance using a key stored on the Matillion ETL server. The script is run using the python command. In this example, the Python script is passed through to the remote server using a HereDoc. Below is an example script:

ssh -o StrictHostKeyChecking=no -i id_rsa ec2-user@ip-172-31-30-130.eu-west-1.compute.internal "python" &lt;&lt; END_OF_PYTHON  
import xmltodict, json, boto3

s3 = boto3.resource('s3')
s3.meta.client.download_file('mtln-test', 'example.xml', '/tmp/example.xml')

with open ("/tmp/example.xml", "r") as myfile:
xml_data=myfile.read()

o = xmltodict.parse(xml_data)
#print json.dumps(o)
open('/tmp/example.json', 'w').write(json.dumps(o))
s3.meta.client.upload_file('/tmp/example.json, 'mtln-laura-test', 'example.json')
print 'file converted successfully'

END_OF_PYTHON

The result of the Python script will be printed in the Bash window:

Run Bash Script component

Pros

Control over the memory allocated on the remote server to the Python script.
All libraries can be installed and maintained on the remote server.
Version of Python can be changed as required.

Disadvantages

Cost and maintenance implication of a new server.
Cannot reuse Matillion ETL connection into Redshift to run commands on Redshift (database cursor) although a new connection could be used.
Variables are not available in the global namespace, nor can they be updated, i.e. it's not possible to reuse the Matillion ETL "context" object provided. However, referencing variables using the syntax ${this_syntax} will resolve to the variable value before the Bash script executes.

Lambda

AWS Lambda allows you to run code without worrying about servers. You simply pay for the compute time you use. To run a Python script in AWS Lambda, a Lambda function must first be created and then run when required. An example function is shown below:

Lambda function

The AWS CLI supports creating a Python function in Lambda and then running it. These CLI commands can be run from a Bash Script component within Matillion ETL. The example script below assumes the helloworld.py Python script exists on the Matillion ETL server in the /tmp directory and walks through the steps required to use this to create a Lambda function from it and run it accordingly.

Note

The below will redefine the function over and over. If the code does not change, it can simply be run using the run statement:

#Lambda requires function to be run is called index.py so first rename python function
cp /tmp/xmltojson.py index.py
#zip function up
zip function.zip index.py

#delete function if already exists
aws lambda delete-function --function-name helloworld --region eu-west-1

#create function in Lambda
aws lambda create-function --function-name helloworld --zip-file fileb://~/function.zip --handler index.handler --runtime python2.7 --role arn:aws:iam::557249109363:role/lambda-cli-role --region eu-west-1

#run function

aws lambda invoke --function-name helloworld --log-type Tail /tmp/outputfile.txt --region eu-west-1

#delete again
aws lambda delete-function --function-name helloworld --region eu-west-1

Lambda function in Matillion ETL

Note

AWS credentials for the environment used to run this job requires the IAM policy AWSLambdaBasicExecutionRole.

Pros

No need to worry about servers.
Just pay for compute resource used—there's no chance of accidentally leaving a server ON.

Cons

Script can't run for more than 15 minutes.
Requires access IAM policy AWSLambdaBasicExecutionRole.
Installing dependencies must be done by having all dependencies locally and packaging them with the lambda function or using the console.
Can't reuse Matillion ETL connection into Redshift to run commands on Redshift (database cursor).
Output from function is not displayed in the task history. Instead, it has to be read from a separate file.
Requires AWS Lambda, which has a cost implication.

Example with Variables

It is often the case that there is a requirement that part of the Python script is made dynamic. This is achieved using variables in Matillion ETL. In the SSH example discussed earlier, the S3 bucket and file names were hard-coded, but it is unlikely that these file names will always be consistent.

It is also unrealistic to update the Python script to simply update the file or the bucket names. Instead, job variables in the Matillion ETL job can set up and these can be referenced in the HereDoc, which pushes the variable value to the remote server:

Manage job variables

ssh -o StrictHostKeyChecking=no -i id_rsa ec2-user@ip-172-31-30-130.eu-west-1.compute.internal "python" &lt;&lt; END_OF_PYTHON

import xmltodict, json, boto3

s3 = boto3.resource('s3')
s3.meta.client.download_file('$s3Bucket', '$xmlFile', '/tmp/$xmlFile')

with open ("/tmp/$xmlFile", "r") as myfile:
 xml_data=myfile.read()

o = xmltodict.parse(xml_data)

#print json.dumps(o)
open('/tmp/$xmlFile.json', 'w').write(json.dumps(o))
s3.meta.client.upload_file('/tmp/$xmlFilejson', '$s3Bucket', '$xmlFile.json')
print 's3://$s3Bucket/$xmlFile.json'

END_OF_PYTHON

Note

The script has also been edited to return the bucket and file name created.
The output from the script run can also be pushed to an output file, which can then be read by another Matillion ETL component and passed into another variable value, if required.

Generating Python file

Some of the above options require that the Python to run is available as a file on the Matillion ETL server. However, you may wish to write the file from within Matillion ETL. Matillion offers two components for doing this.

Python script component

The natural place to write the Python code is using the Python script component. Here you can take advantage of the color coding of the code written and also use this approach to convert Python scripts already written into files to be run externally.

This approach uses a Bash script component to export the job with the Python script using the Matillion ETL API, parse out the relevant Python, and write the results to a file in /tmp. It requires a number of inputs to make the API call and identify the correct script.

Export job using Bash script

When the script has run, it will either say it has successfully created the file, or throw an error saying the API call was unsuccessful and why, or throw an error saying the name of the Python script component was incorrect.

Note

Variables are case-sensitive.

Considerations

Consider the following when developing Python scripts for use in Matillion ETL:

Running Python scripts may require considerable remote resources.
Using the remote options discussed ensures the script can be initiated from Matillion ETL but also run in isolation.
The result of the script can easily be fed back to Matillion ETL. There are, however, some cases where it is best to run the script directly within Matillion ETL—for example, any Jython code.
For a large .xml file, the Python script can take up to 10 minutes to run.

Warning

Running large .xml files of Python script on the Matillion ETL server would affect the CPU and memory available to run your Matillion ETL jobs and could also have an adverse effect for your fellow instance users.

Bash component

A different approach is to write the Python code directly in to a Bash Script component and use this to generate a file. This can be done with a HereDoc. An example is below.

Bash - manage script

The advantage of this approach is that it prevents the need for the API call and is also faster to create the file. It also means the users aren't tempted to "Run" the code from within Matillion ETL.