Python Script
Run a Python script. The script is executed in-process by an interpreter of the user's choice (Jython, Python2 or Python3). Any output written via print statements will appear as the task completion message, and so output should be brief.
While it is valid to handle exceptions within the script using try/except, any uncaught exceptions will cause the component to be marked as failed and its failure link to be followed.
Note
Once you have created and logged into your Matillion ETL instance, you need working instance credentials to run most tasks (for example, executing jobs and running Python scripts). Read Manage Credentials to learn more about credentials. To ensure that instance credentials access is managed correctly at all times, we always advise that customers limit scopes (permissions) where applicable.
Note
- Python scripts within this component are executed by the underlying virtual machine (VM) hosting Matillion ETL, and use the memory and compute of this VM. This component is not designed for large scale data processing scripts involving PySpark or other such compute-intensive packages.
- We strongly recommend that any customers wishing to run Boto3 do so using Python 3, not Jython.
- We urge customers to avoid uninstalling Python or pip via the
sudo yum remove python-pip
command. This command may erase and remove Matillion ETL and disrupt your workflows.
You may import any modules from the Python Standard Library. Matillion ETL does not uninstall any customer-installed Python libraries. Matillion ETL runs as a Tomcat user and care must be taken to ensure this user has sufficient access to resources.
Note
For Jython, Python 2, and Python 2.7, the 'Boto' and 'Boto3' APIs are made available to enable interaction with the rest of AWS. The AWS credentials defined in Matillion ETL are automatically made available, therefore it is not recommended (or necessary) to put security keys in the script.
Warning
Calling sys.exit()
from a Jython script will shut down Matillion ETL and is to be avoided. Jython scripts can be safely terminated by allowing the script to run to the end or by using quit()
.
Properties
Name
= string
A human-readable name for the component.
Script
= code editor
The Python script to execute.
Interpreter
= drop-down
Select the Python version interpreter. The default is Python 3. Read Manage Interpreters to learn how to set a version as an interpreter.
Timeout
= integer
The number of seconds to wait for script termination. After the set number of seconds has elapsed, the script is forcibly terminated. Doesn't work with Jython. The default is 300 seconds (5 minutes).
User
= drop-down
Set the user type. For legacy Jobs, this property will not be available. The default setting is Restricted to prevent accidental or inexperienced coding errors that could damage the Matillion ETL instance.
- Privileged: a privileged user will have access to credentials and Tomcat folders.
- Restricted: restricted users do not have access to credentials stored in the instance, nor any Tomcat folders.
If FORCE_RESTRICTED_USER_FOR_SCRIPT_COMPONENTS has been selected in the Manage optional features dialog, Restricted is the only available option.
This property is only visible if a restricted user has been created.
Strategy
Runs the python script, redirecting any output it produces into the task message.
Enabling the User property
To enable the User property, follow these steps:
- ssh into the Matillion ETL instance.
-
Create a .sh file and paste the following script into the
.sh
file:#!/bin/bash # add new restricted user to tomcat group useradd -r restricteduser usermod -a -G restricteduser tomcat SUDOERS_FILE="/etc/sudoers.d/matillion-sudo" TEMP_FILE=$(mktemp) # sudo configuration echo "# User rules for tomcat" | sudo tee -a "$TEMP_FILE" echo "Cmnd_Alias EMD = /sbin/service tomcat restart, /sbin/service tomcat restart, /bin/systemctl restart tomcat.service, /bin/systemctl restart tomcat.service, /sbin/sv restart tomcat, /sbin/sv restart tomcat, /bin/mkdir /usr/share/*, /usr/bin/yum -y check-update matillion-*, /usr/bin/yum -y update matillion-*, /bin/touch /usr/share/*, /bin/chmod 755 /usr/share/*, /bin/chown tomcat\\:tomcat /usr/share/emerald/oom, /bin/chown tomcat\\:tomcat /usr/share/emerald/oom/*, /usr/sbin/logrotate --force /etc/logrotate.d/tomcat, /usr/sbin/logrotate --force /etc/logrotate.d/tomcat, /usr/bin/su restricteduser -c /bin/bash /tmp/interpreter-input-*.tmp, /usr/bin/su restricteduser -c /usr/bin/python /tmp/interpreter-input-*.tmp, /usr/bin/su restricteduser -c /usr/bin/python3 /tmp/interpreter-input-*.tmp" | sudo tee -a "$TEMP_FILE" echo "tomcat ALL=(ALL) NOPASSWD:EMD" | sudo tee -a "$TEMP_FILE" # Check if the configuration is valid sudo visudo -c -f "$TEMP_FILE" # If the configuration is valid, replace the sudoers file if [ $? -eq 0 ]; then sudo cp "$TEMP_FILE" "$SUDOERS_FILE" echo "Sudoers file updated successfully." else echo "Error: Invalid sudoers configuration. Please check and correct the syntax." fi # Clean up temporary file rm "$TEMP_FILE"
-
Save and close this .sh file and then run the following command:
chmod +x {FILE_NAME}
- The above command makes the
.sh
file executable. Once you have done that, run the following command:./{FILE_NAME}
- The newly created
.sh
file should run successfully. - In Matillion ETL, click Admin and then click Restart Server. Once the server restarts, the User property should be selectable on the Python Script component.
Restricting component availability
You can, if required, configure the Python Script component to be unavailable on your Matillion ETL instance. To do this, follow these steps:
- SSH into the instance.
- Open the file emerald.properties as a root user (sudo).
- Locate
ALLOW_PYTHON_COMPONENTS
and set the value tofalse
. - Save and close the file and restart the server.
Note
- By default,
ALLOW_PYTHON_COMPONENTS
is set totrue
. - When setting variables in Emerald.properties, the
MTLN_
prefix is not required.
Variables
When run, the Python Script component creates a set of new variables of the same name, type, and default value as those defined as Matillion ETL job variables and environment variables. Thus, your job and environment variables can be used within the script.
For a full discussion of using variables in Python scripts, read Variables.
Database Access (Jython only)
To access the database defined in the current environment, use the 'cursor' object provided.
cursor = context.cursor()
The cursor object is described in the Python DB-API V2 and implemented via the "zxJDBC" package in Matillion ETL. The connection is made automatically for you using the current environment defined in Matillion ETL, and this connection will be closed automatically after the script terminates.
This feature is provided for convenience, and is not designed for retrieving large amounts of data. After executing a query, you should iterate the cursor to retrieve the results one row at a time, and avoid using fetchall()
which may lead to out-of-memory issues.
If you execute database updates, you should not try to commit or rollback. Transactions are handled for you, either automatically (Auto-commit mode) or manually using the Begin/Commit/Rollback components. This is a change compared to previous versions, so older scripts may still have commit() calls in them—these should be removed.
Grid Variables
Similar to Variables, Grid Variables can also be accessed through the Python Script component. Details on using Grid Variables in this manner can be found in the Grid Variables documentation.
Additional modules for Jython and Python 2
For Python3 additional modules, click here.
Additional Python modules may be installed by running the pip command. Log in to the instance with SSH and run the command as root:
sudo pip install modulename
As well as pip, you may also upload your own modules to the instance. In that case, you must include the location of the modules in the Python search path, and this location must be readable by the 'tomcat' user. The general format is as below:
import sys
sys.path.append('/path/to/directory/with/python/modules/and/packages')
Python directories can also be found with the following python code:
from distutils.sysconfig import get_python_lib;
print(get_python_lib())
For most users, the directories should be:
- Jython:
/usr/share/emerald/WEB-INF/lib/Lib/site-packages
- Python2:
/usr/lib/python2.7/dist-packages
- Python3:
/usr/lib/python3.6/dist-packages
For example:
import sys
sys.path.append('/usr/lib/python2.7/dist-packages')
import requests
Note
Regardless of whether a module is installed with pip or manually, it must not rely on external C modules to run successfully on the embedded Jython interpreter. However, such scripts should work on Python2 and Python3.
Additional modules for Python 3
Additional modules may be installed by using the Package Installer for Python (pip). To do this, log in to the instance via SSH and run the command as root to begin the installation.
We recommend following the Python documentation for this process as the specifics may depend on your chosen Python version.
-
Install pip.
sudo yum install python-pip
-
If you already have pip, ensure it is up to date.
sudo python3.8 -m pip install --upgrade pip
For version 3.6.8 of Python, use
sudo python3 -m pip install --upgrade pip
. -
Install your modules. For example, Boto3:
sudo pip install boto3
Task Cancellation (Jython Only)
Scripts are never forcibly killed. If you want a long-running script to respond to task cancellation, the script must check for cancellation and act accordingly, ensuring any resources are cleaned up. Cancellation can be checked by querying the context: context.isCancelled()
.
Since the cancellation is being handled within the script, the component will still end successfully, since no uncaught exception has been thrown. For cancellation to also mark the script task as a failure, raise an exception. For example:
if context.isCancelled():
raise Exception("Script cancelled during loop")
Task Cancellation (Python2 and Python3 Only)
A Timeout property is made available in the component if set to Python2 or Python3. If a script runs longer than its timeout (in seconds) it is forcibly killed—similar to the Bash component.
Example 1
This example moves all the objects within an S3 bucket into another S3 bucket. You may wish to do this following an S3 Load, to ensure those same files are not loaded again by subsequent runs of this same job. The target bucket could also use Amazon Glacier to reduce the cost of storing the already loaded files.
import boto3
s3_resource = boto3.resource('s3')
new_bucket_name = "targetBucketName"
bucket_to_copy = "sourceBucketName"
s3bucket = s3_resource.Bucket(bucket_to_copy)
for obj in s3bucket.objects.all():
files = obj.key
copy_source = {'Bucket': bucket_to_copy,'Key': files}
s3_resource.meta.client.copy(copy_source, new_bucket_name, files)
print(files)
The Python script imports the "Boto" module and uses it to move the files. In fact, the script copies the objects to the other bucket, and then removes the source object. A similar script could instead rename the objects and leave them within the same bucket. A list of available variables is given on the left of the window, and used in code written on the right. The script can be executed by clicking Run as though the component had been run on the Matillion ETL UI. The output of the code is shown beneath after running.
Example 2
The example script below shows a database query that retrieves a single (aggregate) row of data, and stores the result into a variable for use elsewhere in Matillion ETL.
cursor = context.cursor()
cursor.execute("select count(*) from flights")
result = cursor.fetchone()
print result
context.updateVariable("total_count", str(result[0]))
Snowflake | Delta Lake on Databricks | Amazon Redshift | Google BigQuery | Azure Synapse Analytics |
---|---|---|---|---|
✅ | ✅ | ✅ | ✅ | ✅ |