Skip to content

Migration: Python

The Data Productivity Cloud includes a Python Script component, but also offers several other options for performing tasks that would require a Python Script in Matillion ETL. This includes native components such as Print Variables, which gives you a simple way to perform a task that would have required a script in Matillion ETL. You also have the option to run your Python scripts within your Snowflake environment, benefiting from scalable Snowflake compute resources, additional library support, and database connectivity.

The Python Script component itself works differently in the Data Productivity Cloud because there is no underlying virtual machine. The component requires a Hybrid SaaS deployment, and you need to bear the following in mind:

  • Python scripts can't assume there is a filesystem that will persist after the script completes. Unlike Matillion ETL, which runs on a Linux VM with a persistent disk, Data Productivity Cloud agents are made up of multiple containers. There is no guarantee that the process that writes the file runs on the same container as the later process that consumes it.
  • Data Productivity Cloud agents have much lower CPU and memory resources than Matillion ETL VMs—they're not designed for any compute-intensive operations, thus, large scripts may run very slowly or cause the agent to become unstable.
  • Only Python 3 is available in the Data Productivity Cloud. The migration tool will warn you that Python 3 is the only option if you have a component that uses the Python 2 or Jython interpreters. You may need to update your Python 2 or Jython scripts to be compatible with Python 3.
  • Third party packages can't be installed with pip or apt, as the file system on the agent is immutable. There is a mechanism to supply packages via Amazon S3 or Azure Blob, but it has some limitations. Read Loading additional Python libraries for details.

Migration path

Our recommendations for migration of Python Script components are (in order of preference):

  1. Replace with native components where possible. The Data Productivity Cloud has new components that will natively perform some of the common tasks Python scripts were used for in Matillion ETL, such as Print Variables, with more coming soon (Move File, Delete File, Send Email with Attachment, for example).
  2. In a Snowflake environment, consider refactoring the pipeline to use the Python Pushdown component. Advantages of this component over the Python Script component are:

    • Runs on scalable Snowflake warehouses.
    • Direct access to Snowflake database connections.
    • Designed for heavy data processing, including pandas.
    • Full access to Data Productivity Cloud variables.
    • Many packages accessible by default.
    • Can be secured with network access control.

    However, when choosing to use Python Pushdown, consider the following:

    • May require some Python code refactoring.
    • Some initial setup of your Snowflake account will be required.
    • Some initial configuration of network access will be required.
    • This feature is only available on Snowflake.
  3. Use the Python Script component. This option can be low friction, but has some limitations:

    • The script may need to be refactored so it doesn't rely on a filesystem.
    • Available for Hybrid SaaS deployments only.
    • Only Python 3 is supported.
      • Components with Python 2 scripts will migrate successfully, but the scripts may need to be made compatible for Python 3 before they will run. Read Porting Python 2 Code to Python 3 for details.
      • Components with Jython scripts will have to be replaced with the Python Pushdown component (available in Snowflake environments only).
  4. Use the Bash Pushdown component to run a Python script on your own Linux machine via Bash. Advantages of this approach are:

    • You can set the CPU and memory on the Linux VM as needed.
    • You can install any packages or third-party applications you need on the Linux VM.

    However, when choosing to use Bash Pushdown, consider the following:

    • You need to set up, secure, update, and manage the Linux machine yourself.
    • You need network access from the Data Productivity Cloud agent to the compute source.

Automatic variables

The Data Productivity Cloud doesn't support directly accessing automatic variables through the Python Script component.

If you require this functionality, you can use an Update Scalar component to write the values to user-defined variables, which can then be passed to the script.