Python Script additional settings

This page covers additional settings and information related to the Python Script component.

Variables

Environment variables

The Python Script component creates a set of new variables of the same name, type, and default value as those listed in the environment variables list. Thus, environment variables can be used within the script (the syntax ${variable} is not required, you may simply use variable).

Since the Python Script already contains Python counterparts of the environment variables, users should be careful to not use those same names for their own variables, especially when of a different type.

Context variables

The Python script variables will disappear after the Python script ends. If you need to push values back to environment variables to use in other components later in the job, use the special context object, like so:

context.updateVariable("variable", "new value")

Both arguments are strings that should parse as the target variable type.

The total list of context methods can be found below.

updateVariable(key, value)
updateGridVariable(key, value)
getGridVariable(key)
cursor() (Jython only.)
isCancelled() (Returns a Boolean. Useful on Jython only.)

Grid variables

Grid Variables can also be accessed through the Python Script component. Details on using grid variables in this manner can be found in Grid Variables.

Database access (Jython only)

To access the database defined in the current environment, use the cursor object provided.

cursor = context.cursor()

The cursor object is described in the Python DB-API V2 and implemented via the zxJDBC package in Matillion ETL. The connection is made automatically for you using the current environment defined in Matillion ETL, and this connection will be closed automatically after the script terminates.

This feature is provided for convenience, not for retrieving large amounts of data. After executing a query, you should iterate the cursor to retrieve the results one row at a time, and avoid using fetchall() which may lead to out-of-memory issues.

If you execute database updates, don't try to commit or rollback. Transactions are handled for you, either automatically (Auto-commit mode) or manually using the Begin/Commit/Rollback components. This is a change compared to previous versions, so older scripts may still have commit() calls in them—these should be removed.

Task cancellation

Jython

Scripts are never forcibly killed. If you want a long-running script to respond to task cancellation, the script must check for cancellation and act accordingly, ensuring any resources are cleaned up. Cancellation can be checked by querying the context: context.isCancelled()

Since the cancellation is being handled within the script, the component will still end successfully, since no uncaught exception has been thrown. For cancellation to also mark the script task as a failure, raise an exception. For example:

if context.isCancelled():
raise Exception("Script cancelled during loop")

Python2 and Python3

A Timeout property is made available in the component if set to Python2 or Python3. If a script runs longer than its timeout (in seconds) it is forcibly killed—similar to the BASH component.

Additional modules

Python2

Install additional Python modules using pip. SSH into the Matillion ETL instance and run as root:

sudo pip install moduleName

Python directories can be found with the following Python code:

from distutils.sysconfig import get_python_lib;
print(get_python_lib())

Jython

Install additional Jython modules using pip2. SSH into the Matillion ETL instance and run while specifying the Jython path as the target. For example:

pip2 install moduleName --target=/usr/share/emerald/WEB-INF/lib/Lib/site-packages

Jython directories can be found with the following Jython code:

import site
print(site.getsitepackages())

Python3

Install additional Python modules using pip. To do this, SSH into the Matillion ETL instance and run the command as root to begin the installation.

Python3.6 (Matillion ETL version 1.40 and later)

sudo yum install python36-pip

Then, to install modules such as Boto3:

sudo pip-3.6 install boto3

Python3.4 (Matillion ETL version 1.39 and earlier)

sudo yum install python34-pip

Then, to install modules such as Boto3:

sudo pip-3.4 install boto3

If you encounter issues installing Python libraries and face an error such as ModuleNotFoundError: No module named moduleName, follow these steps:

Display the pip version: sudo python3 -m pip --version
Install/upgrade pip: sudo python3 -m pip install --upgrade pip
Install the requests library: sudo python3 -m pip install requests

Once you run these commands, running import requests in the Python script should resolve the ModuleNotFound error.

Uploading your own modules

As well as pip, you may also upload your own modules to the instance. In such cases, you must include the location of the modules in the Python search path, and this location must be readable by the Tomcat user. The general format is:

import sys
sys.path.append(&'/path/to/directory/with/python/modules/and/packages&')

For most users, the directories should be:

Jython: /usr/share/emerald/WEB-INF/lib/Lib/site-packages
Python2: /usr/lib/python2.7/dist-packages
Python3: /usr/lib/python3.6/dist-packages

For example:

import sys
sys.path.append(&'/usr/lib/python2.7/dist-packages&')
import requests

Note

Regardless of whether a module is installed with pip or manually, it must not rely on external C modules to run successfully on the embedded Jython interpreter. However, such scripts should work on Python2 and Python3.

Python script examples

Example 1

This example moves all the objects within an Amazon S3 bucket into another S3 bucket. You may wish to do this following an S3 Load, to ensure those same files are not loaded again by subsequent runs of this same job. The target bucket could also use Amazon S3 Glacier to reduce the cost of storing the already loaded files.

import boto3
s3_resource = boto3.resource('s3')
new_bucket_name = "targetBucketName"
bucket_to_copy = "sourceBucketName"

s3bucket = s3_resource.Bucket(bucket_to_copy)

for obj in s3bucket.objects.all():
    files = obj.key
    copy_source = {'Bucket': bucket_to_copy,'Key': files}
    s3_resource.meta.client.copy(copy_source, new_bucket_name, files)
    print(files)

The Python script imports the Boto module and uses it to move the files. The script copies the objects to the other bucket, and then removes the source object. A similar script could instead rename the objects and leave them within the same bucket. A list of available variables is given on the left of the window, and used in code written on the right. The script can be executed by clicking Run as though the component had been run on the Matillion ETL UI. The output of the code is shown beneath after running.

Example 2

The example script below shows a database query that retrieves a single (aggregate) row of data, and stores the result into a variable for use elsewhere in Matillion ETL.

cursor = context.cursor()
cursor.execute("select count(*) from flights")
result = cursor.fetchone()
print result
context.updateVariable("total_count", str(result[0]))