Skip to content

Managing Python on a Matillion ETL virtual machine (VM)

Overview

Note

This guide was provided by Matillion's solution architect team.

Python is a popular programming language for data engineers. Recognizing this, Matillion ETL offers a Python Script component, which allows Matillion users to integrate their own Python scripts into their Matillion ETL jobs. When using Python scripts in Matillion ETL jobs, an important consideration is if the nature of the Python is appropriate to run on a Matillion ETL instance, which is typically sized for an ELT workload. Read Executing Python Scripts Outside of Matillion for more insights into that topic.

When Matillion ETL users reach out to Matillion Support related to their use of Python, we've found that, often, the issues raised relate to the management of Python on their Matillion ETL instances. The aim of this guide is to highlight some best practices around managing your Python environment on a Matillion ETL instance.



Best Practices

  • Do not uninstall Python on a Matillion ETL instance. An alternative is to disable the availability of the Python Script component.
  • Do not change the version of Python installed on a Matillion ETL instance.
  • Maintain an inventory of any manually installed Python libraries, including the version of the Python library.
  • When automating the deployment of new Matillion ETL VMs, account for any Python library dependencies as part of the automated deployment.
  • When using the Python Script component, use the Python3 interpreter.
  • Do not change the version of pip installed on a Matillion ETL VM unless there is a Python library dependency. If required to update the version of pip, ensure you have a current backup of the Matillion ETL VM.


Python Interpreters and Versions

The Python Script component in Matillion ETL allows a user to run a Python script against different Python interpreters: Jython, Python2, and Python3.

The Jython interpreter is a Java interpreter, based on Python2, for running Python scripts within a Java application. Matillion ETL is written in Java, which is where the support for this interpreter comes from. Historically, the main benefit of using the Jython interpreter is that it allows the Python script execution to use the connection to the cloud data warehouse (CDW) that the Matillion ETL job creates at runtime. One would typically use this feature to assign values to Matillion variables during a job's execution.

Python2 and Python3 run outside of Java and thus do not have this capability and would require the Python script to additionally create a connection to the CDW. Matillion ETL now has a Query Result To Scalar component, and a Query Result To Grid component, which provide a simpler way to populate variables within a Matillion ETL job, and remove the need for using Jython for this particular use case.

Python2 was sunset on 1st January 2020 and is no longer being maintained. As of version 1.61.6, Matillion ETL comes with Python 2.7.5 installed.

For the above reasons, we recommend using Python3 for any new Python Script development. Matillion ETL continues to support the Jython and Python2 interpreters, primarily to support users who have existing Jython or Python2 dependencies in their jobs. However, it is possible for support of the Jython or Python2 interpreters to be removed in a future release of Matillion ETL.

For any users who do have dependencies on Jython or Python2 in their jobs, we recommend transitioning those over to Python3, so as to best future-proof your Matillion ETL jobs. As of version 1.61.6, Matillion ETL comes with Python 3.6.8 installed.

We do not recommend changing the version of Python that is installed on your Matillion ETL instance. When needed, updates to the version of Python will be bundled along with new versions of Matillion ETL. Updating or changing the version of Python2 or Python3 on a Matillion ETL instance can result in issues or instability with the Matillion ETL application itself.

In addition to the above recommendation, please also note that Python should not be uninstalled from the Matillion ETL instance. Prior to version 1.60, the Matillion ETL application has dependencies on Python, and removing Python from a Matillion ETL instance can result in uninstalling the Matillion ETL application as well. As an alternative to uninstalling Python from a Matillion ETL instance, consider disabling the Python Script component instead.



PIP Versions

Python installations will by default have a set of standard libraries installed. Additionally, and frequently, Matillion ETL users may want to install additional Python libraries on their Matillion ETL instances. This can be done using pip, Python's package manager. Read Additional Modules to learn how to add Python libraries to a Matillion ETL instance.

As of version 1.61.6, Matillion ETL comes with version 8.1.2 of pip for Python2 and version 9.0.3 of pip for Python3. Typically, the version of pip that ships with Matillion ETL should be sufficient for most use cases and we do not recommend updating the version of pip. Matillion ETL releases are tested based on the version of pip that is included with Matillion ETL. Installing a different version of pip could result in unexpected behavior within Matillion ETL. If you find a need to upgrade the version of pip on your Matillion ETL instance, please ensure you have a current backup of your instance that can be restored if necessary.



Migrating Matillion Instances

There can be many reasons to launch a new Matillion ETL instance and migrate your existing Matillion ETL instance over to it. The most common scenario for this is when updating to a new version of Matillion ETL following our recommended best practice to launch a new Matillion instance and then use Matillion's Migrate feature. When migrating, note that Python libraries are included in the list of resources that are not automatically migrated. We will outline here the best practice steps for migrating Python libraries that your Matillion ETL jobs may have dependencies on, to a new Matillion ETL instance.

  1. Confirm versions of Python and Pip on the CURRENT Matillion ETL instance.
    1. SSH on to the Matillion instance
    2. Run the following commands and note the output:
      1. Current version of Python2 installed: sudo python --version
      2. Current version of Pip installed for Python2: sudo pip --version
      3. Current version of Python3 installed: sudo python3 --version
      4. Current version of Pip installed for Python3: sudo pip3 --version
  2. Confirm the Python Libraries installed and the version of each on the CURRENT Matillion ETL instance.
    1. SSH on to the Matillion instance
    2. Run the following commands and note the output:
      1. List of installed Python2 libraries: sudo pip freeze
      2. List of installed Python3 libraries: sudo pip3 freeze
  3. Confirm versions of Python and Pip on the NEW Matillion ETL instance.
    1. SSH on to the Matillion instance
    2. Run the following commands and compare the output against the CURRENT Matillion ETL instance. The NEW Matillion ETL instance should be of the same or later version of Python2/3 as the CURRENT Matillion ETL instance.
      1. Python2 sudo python --version
      2. Python2 sudo pip --version
      3. Python3 sudo python3 --version
      4. Python3 sudo pip3 --version
  4. Confirm the Python Libraries installed and the version of each on the NEW Matillion ETL instance.
    1. SSH on to the Matillion instance
    2. Run the following commands and note the output. Note that by default, a new Matillion instance will have some Python libraries installed.
      1. Python2 Libraries: sudo pip freeze
      2. Python3 Libraries: sudo pip3 freeze
  5. Compare the list of Python Libraries on the CURRENT Matillion instance against the list of Python Libraries on the NEW Matillion instance. Based on the comparison, identify the Python libraries that need to be installed on the new instance.
  6. Install each Python library that is required on the NEW Matillion instance, ensuring to install the exact same version of the library installed on the CURRENT Matillion instance.
    1. Python2 - Install a library of a specific version:
      1. sudo pip install [modulename]==[version number]
      2. Example: sudo pip install boto3==1.14.53
    2. Python3 - Install a library of a specific version:
      1. sudo pip3 install [modulename]==[version number]
      2. Example: sudo pip3 install boto3=1.17.45


Advanced Topics

The topics below focus on "advanced topics" related to using Python with Matillion ETL. Many of the sections below link to other documentation topics.



Disabling Python on a Matillion ETL instance

It is possible to disable the Python Script feature on a Matillion instance. Following are the steps to do so:

  1. SSH on to the Matillion instance
  2. Make a backup of a configuration file that will be edited:
    sudo cp /usr/share/emerald/WEB-INF/classes/Emerald.properties /usr/share/emerald/WEB-INF/classes/Emerald.properties.backup
  3. Edit the same configuration file:
    1. File: /usr/share/emerald/WEB-INF/classes/Emerald.properties
    2. Add the following to the end of the file: ALLOW_PYTHON_COMPONENTS=false
  4. Restart the Matillion service.
    1. From an SSH session:
      1. sudo service tomcat stop
      2. sudo service tomcat start
    2. From a Matillion UI session:
      1. Click AdminRestart Server


Restricted Users

By default, when a Python Script component executes, it runs as an external process on the Matillion ETL server directly, with the same privileges as the web server. It is possible to restrict the execution of the Python script to a user of lesser privileges. Read How to place restrictions on Bash and Python components.



Virtual Environments

Advanced Python users sometimes enquire about whether Matillion ETL supports Python Virtual Environments. As of version 1.61.6 of Matillion ETL, Python Virtual Environments are not currently supported. However, there exists a Matillion ETL Shared Job on the Matillion Exchange job list that can be used to emulate Python Virtual Environments. Here is a link to the Pip3 Install Shared Job that serves this purpose.



Executing Python Scripts external to Matillion ETL

As discussed in the beginning of this article, when running Python scripts, one should consider the resources required to run that Python code. If the nature of the Python code is such that it exceeds the resources available on a Matillion ETL instance, there are different methods to approach this scenario.

Executing Python Scripts Outside of Matillion describes numerous techniques on approaching this scenario.

Another approach to this scenario is described in our blog article about serverless file conversion with Matillion ETL. The pattern described in this blog post leverages another Shared Job on the Matillion Exchange. The Run Serverless Shared Job offers the capability to dynamically create and execute your own Python code via an AWS Lambda function.