Upgrade: Python
The Data Productivity Cloud includes a Python Script component, but also offers several other options for performing tasks that would require a Python Script in Matillion ETL. This includes native components such as Print Variables, which gives you a simple way to perform a task that would have required a script in Matillion ETL. You also have the option to run your Python scripts within your Snowflake environment, benefiting from scalable Snowflake compute resources, additional library support, and database connectivity.
The Python Script component itself works differently in the Data Productivity Cloud because there is no underlying virtual machine. The component requires a Hybrid SaaS deployment, and you need to bear the following in mind:
- Python scripts can't assume there is a filesystem that will persist after the script completes. Unlike Matillion ETL, which runs on a Linux VM with a persistent disk, Data Productivity Cloud agents are made up of multiple containers. There is no guarantee that the process that writes the file runs on the same container as the later process that consumes it.
- Data Productivity Cloud agents have much lower CPU and memory resources than Matillion ETL VMs—they're not designed for any compute-intensive operations, thus, large scripts may run very slowly or cause the agent to become unstable.
-
Only Python 3 is available in the Data Productivity Cloud. If you have a component that uses the Python 2 or Jython interpreters, the migration tool will warn you that Python 3 is the only option, and you may need to update your scripts to be compatible with Python 3.
- Jython scripts can be automatically converted to Python 3 during migration, but there are some important differences in operation, including use of the cursor object and grid variables, as discussed below.
- Python 2 scripts can be automatically converted to Python 3 during migration, as discussed below.
-
Third party packages can't be installed with
pip
orapt
, as the file system on the agent is immutable. There is a mechanism to supply packages via Amazon S3 or Azure Blob, but it has some limitations. Read Loading additional Python libraries for details.
Note
The version of Python currently used by the Data Productivity Cloud is 3.10. If you are migrating a version of Python later than this, you should use the Python Pushdown component instead (available in Snowflake projects only).
Upgrade path
Our recommendations for migration of Python Script components are (in order of preference):
- Replace with native components where possible. The Data Productivity Cloud has new components that will natively perform some of the common tasks Python scripts were used for in Matillion ETL, such as Print Variables, with more coming soon (Move File, Delete File, Send Email with Attachment, for example).
-
In a Snowflake project, consider refactoring the pipeline to use the Python Pushdown component. Advantages of this component over the Python Script component are:
- Runs on scalable Snowflake warehouses.
- Direct access to Snowflake database connections.
- Designed for heavy data processing, including
pandas
. - Full access to Data Productivity Cloud variables.
- Many packages accessible by default.
- Can be secured with network access control.
However, when choosing to use Python Pushdown, consider the following:
- May require some Python code refactoring.
- Some initial setup of your Snowflake account will be required.
- Some initial configuration of network access will be required.
- This feature is only available in Snowflake projects.
-
Use the Python Script component. This option can be low friction, but has some limitations:
- The script may need to be refactored so it doesn't rely on a filesystem.
- Available for Hybrid SaaS deployments only.
-
Only Python 3 is supported.
- Components with Python 2 scripts will migrate successfully, but the scripts may need to be made compatible with Python 3 before they will run. See below, and also read Porting Python 2 Code to Python 3 in the Python documentation for further details.
- Components with Jython scripts will have to be replaced with the Python Pushdown component (available in Snowflake projects only).
-
Use the Bash Pushdown component to run a Python script on your own Linux machine via Bash. Advantages of this approach are:
- You can set the CPU and memory on the Linux VM as needed.
- You can install any packages or third-party applications you need on the Linux VM.
However, when choosing to use Bash Pushdown, consider the following:
- You need to set up, secure, update, and manage the Linux machine yourself.
- You need network access from the Data Productivity Cloud agent to the compute source.
Upgrading Python 2 scripts to Python 3
The Data Productivity Cloud supports Python 3 only. If you have a Python 2 script in your Matillion ETL job, you can choose to have it automatically converted to Python 3 during migration.
When converting a Python 2 script to Python 3, the migration tool uses the 2to3 utility that's part of the standard Python library. This tool can handle many of the more common changes between Python 2 and Python 3, but you may need to make additional manual changes to the script to ensure it works as expected. Read the Python documentation for more details.
When you import a Python 2 script to the Data Productivity Cloud, the migration report will identify it in the Manual refactor section and alert you that an upgrade is needed.
To perform the script upgrade:
- Click Edit preferences in the Importing files panel.
- Toggle Convert Python 2.x to 3 to On. The default setting for this option is Off, meaning that no scripts will be converted unless you explicitly enable it.
- Click Apply & re-run.
- If you again examine the migration report, it will now show "Your Python script has been auto-converted using the 2to3 tool."
- Click Import to complete the import process, and continue as described in Import to the Data Productivity Cloud.
After conversion, both the original and the converted versions of the scripts are stored in the Data Productivity Cloud branch. In the branch's Files panel, navigate to the .matillion
→ migration
→ <migration date>
→ <pipeline name>
folder. There you will find two versions of each converted script, labelled _before
and _after
. This allows you to compare the two versions and see what changes were made during the conversion. These _before
and _after
scripts are not used by the migrated pipeline component, so you can delete them both when you no longer need them for verification purposes.
Warning
Matillion can't guarantee the converted script will work as expected. You should always review the converted script and test it thoroughly.
Using automatic variables in a Python script
The Data Productivity Cloud doesn't support directly accessing automatic variables through the Python Script component.
If you require this functionality, you can use an Update Scalar component to write the values to user-defined variables, which can then be passed to the script.
Upgrading Jython scripts
The Data Productivity Cloud supports Python 3 only. If you have a Jython script in your Matillion ETL job, you can choose to have it automatically converted to Python 3 during migration.
As Jython is essentially Python 2, the migration tool will treat Jython scripts in the same way as Python 2 scripts.
When converting a Jython script to Python 3, the migration tool uses the 2to3 utility that's part of the standard Python library. This tool can handle many of the more common changes between Jython and Python 3, but you may need to make additional manual changes to the script to ensure it works as expected. Read the Python documentation for more details.
When you import a Jython script to the Data Productivity Cloud, the migration report will identify it in the Manual refactor section and alert you that an upgrade is needed.
To perform the script upgrade:
- Click Edit preferences in the Importing files panel.
- Toggle Convert Python 2.x to 3 to On. The default setting for this option is Off, meaning that no scripts will be converted unless you explicitly enable it.
- Click Apply & re-run.
- If you again examine the migration report, it will now show "Your Python script has been auto-converted using the 2to3 tool."
- Click Import to complete the import process, and continue as described in Import to the Data Productivity Cloud.
After conversion, both the original and the converted versions of the scripts are stored in the Data Productivity Cloud branch. In the branch's Files panel, navigate to the .matillion
→ migration
→ <migration date>
→ <pipeline name>
folder. There you will find two versions of each converted script, labelled _before
and _after
. This allows you to compare the two versions and see what changes were made during the conversion. These _before
and _after
scripts are not used by the migrated pipeline component, so you can delete them both when you no longer need them for verification purposes.
Jython scripts that use the cursor object
In Jython in Matillion ETL, you could use the context.cursor()
object to access a database cursor. Python 3 does not support this functionality, but the Data Productivity Cloud uses project variables to store database connection details and replicate the functionality of the cursor object.
As part of this conversion, the migration tool creates several project variables to store database connection details. These will be shown in the migration report with the comment "Variable created to facilitate the use of context.cursor() within Python Script components."
Note
It is important that you do not delete any of these variables from your project, as they are required for the converted script to work.
We recommend that you create the environments that will use this functionality prior to importing your Jython scripts. The project variables are created based on the environment configuration at the time of import. If you create additional environments after the import, the project will not have the necessary variables to support them, and you will need to add them manually.
The project variables created during import are listed below. You would have to create these manually if you did not create an environment prior to import. The variables required depend on the cloud platform and data warehouse you are using.
Cloud platform
- Azure
mtln_azure_key_vault_uri
: When using an environment with an Azure agent, if this variable has an empty value the default keyvault provided in theDEFAULT_KEYVAULT
environment variable will be used. Read Configuring a key vault for Azure agent for further details.
- AWS
- N/A
Data warehouse
- Snowflake
- Connection parameters:
mtln_snowflake_account
mtln_snowflake_username
mtln_snowflake_role
mtln_snowflake_warehouse
mtln_snowflake_database
mtln_snowflake_schema
- Key pair authentication:
mtln_snowflake_private_key_secret_name
mtln_snowflake_passphrase_secret_name
mtln_snowflake_passphrase_secret_key
- Password authentication:
mtln_snowflake_password_secret_name
mtln_snowflake_password_secret_key
- Connection parameters:
- Redshift
- Connection parameters:
mtln_redshift_host
mtln_redshift_database
mtln_redshift_port
mtln_redshift_username
- Password authentication:
mtln_redshift_password_secret_name
mtln_redshift_password_secret_key
- Connection parameters:
- Databricks
- Connection parameters:
mtln_databricks_host
mtln_databricks_http_path
: The value for this variable must be populated manually. You can find this value by following https://docs.databricks.com/aws/en/integrations/compute-details.mtln_databricks_catalog
mtln_databricks_schema
- Personal access token authentication:
mtln_databricks_access_token_secret_name
mtln_databricks_access_token_secret_key
- OAuth authentication:
mtln_databricks_client_id
mtln_databricks_client_secret_secret_name
mtln_databricks_client_secret_secret_key
- Connection parameters:
Jython scripts that use grid variables
The handling of grid variables in Python scripts in the Data Productivity Cloud is different than how Jython scripts in Matillion ETL handle them. You need to be aware of this and make appropriate changes when converting any Jython scripts into Python as part of a migration to the Data Productivity Cloud.
In a Data Productivity Cloud Python script, modifying list data retrieved from a grid variable can unexpectedly alter the data source for subsequent reads within the same script. This is an expected effect of Python's standard behavior of handling lists as mutable objects passed by reference, meaning changes can affect the original structure. To ensure modifications are isolated, you should create an independent copy of the list (for example, using copy.deepcopy()
) before altering it in the Python script.
The differences in behavior can be summarized as follows:
- Jython in Matillion ETL:
context.getGridVariable()
returns a new (shallow) copy of the list object each time it is called. Assigning this to a variable creates another reference to that new copy. Therefore, modifying the local variable (copy_list
) doesn't affect subsequent retrievals of the grid variable. - Python in the Data Productivity Cloud:
context.getGridVariable()
returns a reference to the same underlying object representation upon subsequent calls within the script. Simple assignment (copy_list = ...
) creates another reference pointing to that exact same object. Consequently, modifying the list via any reference (such ascopy_list
) changes the single object, and subsequent retrievals read that change.
Therefore, it is crucial to follow Python 3 standard practice in Data Productivity Cloud Python scripts and use copy.deepcopy()
(or appropriate alternatives like list comprehensions for nested structures) when you intend to modify retrieved grid variable data without side effects, whereas you may not have done this in your Matillion ETL Jython scripts.