Upgrade: Python
The Data Productivity Cloud includes a Python Script component, but also offers several other options for performing tasks that would require a Python Script in Matillion ETL. This includes native components such as Print Variables, which gives you a simple way to perform a task that would have required a script in Matillion ETL. You also have the option to run your Python scripts within your Snowflake environment using the Python Pushdown component, benefiting from scalable Snowflake compute resources, additional library support, and database connectivity.
The Python Script component itself works differently in the Data Productivity Cloud because there is no underlying virtual machine. The component requires a Hybrid SaaS deployment, and you need to bear the following in mind:
- Python scripts can't assume there is a filesystem that will persist after the script completes. Unlike Matillion ETL, which runs on a Linux VM with a persistent disk, Data Productivity Cloud agents are made up of multiple containers. There is no guarantee that the process that writes the file runs on the same container as the later process that consumes it.
- Data Productivity Cloud agents have much lower CPU and memory resources than Matillion ETL VMs—they're not designed for any compute-intensive operations, thus, large scripts may run very slowly or cause the agent to become unstable.
-
Only Python 3 is available in the Data Productivity Cloud. If you have a component that uses the Python 2 or Jython interpreters, the migration tool will warn you that Python 3 is the only option, and you may need to update your scripts to be compatible with Python 3.
- Jython scripts can be automatically converted to Python 3 during migration, but there are some important differences in operation, including use of the cursor object and grid variables, as discussed below.
- Python 2 scripts can be automatically converted to Python 3 during migration, as discussed below.
-
Third party packages can't be installed with
piporapt, as the file system on the agent is immutable. There is a mechanism to supply packages via Amazon S3 or Azure Blob, but it has some limitations. Read Loading additional Python libraries for details. - The version of Python currently used by the Data Productivity Cloud is 3.10. If you are migrating a version of Python later than this, you should use the Python Pushdown component instead (available in Snowflake projects only).
Upgrade path
Our recommendations for migration of Python Script components are (in order of preference):
- Replace with native components where possible. The Data Productivity Cloud has new components that will natively perform some of the common tasks Python scripts were used for in Matillion ETL, such as Print Variables, with more coming soon (Move File, Delete File, Send Email with Attachment, for example).
-
In a Snowflake project, consider refactoring the pipeline to use the Python Pushdown component. Advantages of this component over the Python Script component are:
- Runs on scalable Snowflake warehouses.
- Direct access to Snowflake database connections.
- Designed for heavy data processing, including
pandas. - Full access to Data Productivity Cloud variables.
- Many packages accessible by default.
- Can be secured with network access control.
However, when choosing to use Python Pushdown, consider the following:
- May require some Python code refactoring.
- Some initial setup of your Snowflake account will be required.
- Some initial configuration of network access will be required.
- This feature is only available in Snowflake projects.
Read Upgrading to Python Pushdown, below, for details on how to convert Python Script components to Python Pushdown during migration.
-
Use the Python Script component. This option can be low friction, but has some limitations:
- The script may need to be refactored so it doesn't rely on a filesystem.
- Available for Hybrid SaaS deployments only.
- Only Python 3 is supported. See below for Python 2 and Jython, and also read Porting Python 2 Code to Python 3 in the Python documentation for further details.
-
Use the Bash Pushdown component to run a Python script on your own Linux machine via Bash. Advantages of this approach are:
- You can set the CPU and memory on the Linux VM as needed.
- You can install any packages or third-party applications you need on the Linux VM.
However, when choosing to use Bash Pushdown, consider the following:
- You need to set up, secure, update, and manage the Linux machine yourself.
- You need network access from the Data Productivity Cloud agent to the compute source.
Upgrading Python 2 scripts to Python 3
The Data Productivity Cloud supports Python 3 only. If you have a Python 2 script in your Matillion ETL job, you can choose to have it automatically converted to Python 3 during migration.
When converting a Python 2 script to Python 3, the migration tool uses the 2to3 utility that's part of the standard Python library. This tool can handle many of the more common changes between Python 2 and Python 3, but you may need to make additional manual changes to the script to ensure it works as expected. Read the Python documentation for more details.
When you import a Python 2 script to the Data Productivity Cloud, the migration report will identify it in the Manual refactor section and alert you that an upgrade is needed.
To perform the script upgrade:
- Click Edit preferences in the Importing files panel.
- Toggle Convert Python 2.x to 3 to On. The default setting for this option is Off, meaning that no scripts will be converted unless you explicitly enable it.
- Click Apply & re-run.
- If you again examine the migration report, it will now show "Your Python script has been auto-converted using the 2to3 tool."
- Click Import to complete the import process, and continue as described in Import to the Data Productivity Cloud.
After conversion, both the original and the converted versions of the scripts are stored in the Data Productivity Cloud branch. In the branch's Files panel, navigate to the .matillion → migration → <migration date> → <pipeline name> folder. There you will find two versions of each converted script, labelled _before and _after. This allows you to compare the two versions and see what changes were made during the conversion. These _before and _after scripts are not used by the migrated pipeline component, so you can delete them both when you no longer need them for verification purposes.
Warning
Matillion can't guarantee the converted script will work as expected. You should always review the converted script and test it thoroughly.
Using automatic variables in a Python script
The Data Productivity Cloud doesn't support directly accessing automatic variables through the Python Script component.
If you require this functionality, you can use an Update Scalar component to write the values to user-defined variables, which can then be passed to the script.
Upgrading Jython scripts
The Data Productivity Cloud supports Python 3 only. If you have a Jython script in your Matillion ETL job, you can choose to have it automatically converted to Python 3 during migration.
As Jython is essentially Python 2, the migration tool will treat Jython scripts in the same way as Python 2 scripts.
When converting a Jython script to Python 3, the migration tool uses the 2to3 utility that's part of the standard Python library. This tool can handle many of the more common changes between Jython and Python 3, but you may need to make additional manual changes to the script to ensure it works as expected. Read the Python documentation for more details.
When you import a Jython script to the Data Productivity Cloud, the migration report will identify it in the Manual refactor section and alert you that an upgrade is needed.
To perform the script upgrade:
- Click Edit preferences in the Importing files panel.
- Toggle Convert Python 2.x to 3 to On. The default setting for this option is Off, meaning that no scripts will be converted unless you explicitly enable it.
- Click Apply & re-run.
- If you again examine the migration report, it will now show "Your Python script has been auto-converted using the 2to3 tool."
- Click Import to complete the import process, and continue as described in Import to the Data Productivity Cloud.
After conversion, both the original and the converted versions of the scripts are stored in the Data Productivity Cloud branch. In the branch's Files panel, navigate to the .matillion → migration → <migration date> → <pipeline name> folder. There you will find two versions of each converted script, labelled _before and _after. This allows you to compare the two versions and see what changes were made during the conversion. These _before and _after scripts are not used by the migrated pipeline component, so you can delete them both when you no longer need them for verification purposes.
Jython scripts that use the cursor object
In Jython in Matillion ETL, you could use the context.cursor() object to access a database cursor. Python 3 does not support this functionality, but the Data Productivity Cloud uses project variables to store database connection details and replicate the functionality of the cursor object.
As part of this conversion, the migration tool creates several project variables to store database connection details. These will be shown in the migration report with the comment "Variable created to facilitate the use of context.cursor() within Python Script components."
Note
It is important that you do not delete any of these variables from your project, as they are required for the converted script to work.
We recommend that you create the environments that will use this functionality prior to importing your Jython scripts. The project variables are created based on the environment configuration at the time of import. If you create additional environments after the import, the project will not have the necessary variables to support them, and you will need to add them manually.
The project variables created during import are listed below. You would have to create these manually if you did not create an environment prior to import. The variables required depend on the cloud platform and data warehouse you are using.
Cloud platform
- Azure
mtln_azure_key_vault_uri: When using an environment with an Azure agent, if this variable has an empty value the default keyvault provided in theDEFAULT_KEYVAULTenvironment variable will be used. Read Configuring a key vault for Azure agent for further details.
- AWS
- N/A
Data warehouse
- Snowflake
- Connection parameters:
mtln_snowflake_accountmtln_snowflake_usernamemtln_snowflake_rolemtln_snowflake_warehousemtln_snowflake_databasemtln_snowflake_schema
- Key pair authentication:
mtln_snowflake_private_key_secret_namemtln_snowflake_passphrase_secret_namemtln_snowflake_passphrase_secret_key
- Password authentication:
mtln_snowflake_password_secret_namemtln_snowflake_password_secret_key
- Connection parameters:
- Redshift
- Connection parameters:
mtln_redshift_hostmtln_redshift_databasemtln_redshift_portmtln_redshift_username
- Password authentication:
mtln_redshift_password_secret_namemtln_redshift_password_secret_key
- Connection parameters:
- Databricks
- Connection parameters:
mtln_databricks_hostmtln_databricks_http_path: The value for this variable must be populated manually. You can find this value by following https://docs.databricks.com/aws/en/integrations/compute-details.mtln_databricks_catalogmtln_databricks_schema
- Personal access token authentication:
mtln_databricks_access_token_secret_namemtln_databricks_access_token_secret_key
- OAuth authentication:
mtln_databricks_client_idmtln_databricks_client_secret_secret_namemtln_databricks_client_secret_secret_key
- Connection parameters:
Jython scripts that use grid variables
The handling of grid variables in Python scripts in the Data Productivity Cloud is different than how Jython scripts in Matillion ETL handle them. You need to be aware of this and make appropriate changes when converting any Jython scripts into Python as part of a migration to the Data Productivity Cloud.
In a Data Productivity Cloud Python script, modifying list data retrieved from a grid variable can unexpectedly alter the data source for subsequent reads within the same script. This is an expected effect of Python's standard behavior of handling lists as mutable objects passed by reference, meaning changes can affect the original structure. To ensure modifications are isolated, you should create an independent copy of the list (for example, using copy.deepcopy()) before altering it in the Python script.
The differences in behavior can be summarized as follows:
- Jython in Matillion ETL:
context.getGridVariable()returns a new (shallow) copy of the list object each time it is called. Assigning this to a variable creates another reference to that new copy. Therefore, modifying the local variable (copy_list) doesn't affect subsequent retrievals of the grid variable. - Python in the Data Productivity Cloud:
context.getGridVariable()returns a reference to the same underlying object representation upon subsequent calls within the script. Simple assignment (copy_list = ...) creates another reference pointing to that exact same object. Consequently, modifying the list via any reference (such ascopy_list) changes the single object, and subsequent retrievals read that change.
Therefore, it is crucial to follow Python 3 standard practice in Data Productivity Cloud Python scripts and use copy.deepcopy() (or appropriate alternatives like list comprehensions for nested structures) when you intend to modify retrieved grid variable data without side effects, whereas you may not have done this in your Matillion ETL Jython scripts.
Upgrading to Python Pushdown
Python Pushdown is an orchestration component that lets you execute a Python script using the Snowpark service in your Snowflake account.
If you have a job that contains a Python Script component, follow these steps in the Data Productivity Cloud to configure the imported pipeline to use Python Pushdown:
- Click Edit preferences in the Importing files panel.
- Toggle Convert to Python Pushdown to On. The default setting for this option is Off, meaning that no Python Script components will be converted unless you explicitly enable it.
-
This will present you with a set of properties needed to configure the connection to the Snowpark service. Complete these as follows:
- Warehouse: The Snowflake warehouse to use for executing the Python script. The special value
[Environment Default]uses the warehouse defined in the Data Productivity Cloud environment. Read Overview of Warehouses to learn more. - Python Version: Select the Python version you want to use for your script. The default setting is currently
3.10. Available Python versions may change as Snowflake Snowpark adds or removes support for specific versions. - Script Timeout: The number of seconds to wait for script termination. After the set number of seconds has elapsed, the script is forcibly terminated. The default is
360seconds (6 minutes). The script timeout can't exceed the limit defined in the Snowflake internal query timeouts. For more information, read Query Timeouts in Snowflake.
The selected values will be applied to all Python Pushdown components in the pipelines being imported. Other Python Pushdown properties will have to be configured on the individual components. See the Python Pushdown documentation for a description of all properties.
- Warehouse: The Snowflake warehouse to use for executing the Python script. The special value
-
If you are using a Python 2 or Jython script, these must be converted to Python 3 to run in Snowpark. To enable automatic conversion during migration, toggle Convert Python 2.x to 3 to On.
- Click Apply & re-run.
- If you examine the migration report, it will now show "Python Script component has been converted to a Python Pushdown component."
- Click Import to complete the import process, and continue as described in Import to the Data Productivity Cloud.