Skip to content

Migration: Python

The Data Productivity Cloud includes a Python Script component, but also offers several other options for performing tasks that would require a Python Script in Matillion ETL. This includes native components such as Print Variables, which gives you a simple way to perform a task that would have required a script in Matillion ETL. You also have the option to run your Python scripts within your Snowflake environment, benefiting from scalable Snowflake compute resources, additional library support, and database connectivity.

The Python Script component itself works differently in the Data Productivity Cloud because there is no underlying virtual machine. The component requires a Hybrid SaaS deployment, and you need to bear the following in mind:

  • Python scripts can't assume there is a filesystem that will persist after the script completes. Unlike Matillion ETL, which runs on a Linux VM with a persistent disk, Data Productivity Cloud agents are made up of multiple containers. There is no guarantee that the process that writes the file runs on the same container as the later process that consumes it.
  • Data Productivity Cloud agents have much lower CPU and memory resources than Matillion ETL VMs—they're not designed for any compute-intensive operations, thus, large scripts may run very slowly or cause the agent to become unstable.
  • Only Python 3 is available in the Data Productivity Cloud. The migration tool will warn you that Python 3 is the only option if you have a component that uses the Python 2 or Jython interpreters. You may need to update your Python 2 or Jython scripts to be compatible with Python 3. A Jython script can be refactored into a Python script that will run in the Data Productivity Cloud, but there are some important differences in operation, discussed below.
  • Third party packages can't be installed with pip or apt, as the file system on the agent is immutable. There is a mechanism to supply packages via Amazon S3 or Azure Blob, but it has some limitations. Read Loading additional Python libraries for details.

Note

The version of Python currently used by the Data Productivity Cloud is 3.10.


Migration path

Our recommendations for migration of Python Script components are (in order of preference):

  1. Replace with native components where possible. The Data Productivity Cloud has new components that will natively perform some of the common tasks Python scripts were used for in Matillion ETL, such as Print Variables, with more coming soon (Move File, Delete File, Send Email with Attachment, for example).
  2. In a Snowflake environment, consider refactoring the pipeline to use the Python Pushdown component. Advantages of this component over the Python Script component are:

    • Runs on scalable Snowflake warehouses.
    • Direct access to Snowflake database connections.
    • Designed for heavy data processing, including pandas.
    • Full access to Data Productivity Cloud variables.
    • Many packages accessible by default.
    • Can be secured with network access control.

    However, when choosing to use Python Pushdown, consider the following:

    • May require some Python code refactoring.
    • Some initial setup of your Snowflake account will be required.
    • Some initial configuration of network access will be required.
    • This feature is only available on Snowflake.
  3. Use the Python Script component. This option can be low friction, but has some limitations:

    • The script may need to be refactored so it doesn't rely on a filesystem.
    • Available for Hybrid SaaS deployments only.
    • Only Python 3 is supported.
      • Components with Python 2 scripts will migrate successfully, but the scripts may need to be made compatible for Python 3 before they will run. Read Porting Python 2 Code to Python 3 for details.
      • Components with Jython scripts will have to be replaced with the Python Pushdown component (available in Snowflake environments only).
  4. Use the Bash Pushdown component to run a Python script on your own Linux machine via Bash. Advantages of this approach are:

    • You can set the CPU and memory on the Linux VM as needed.
    • You can install any packages or third-party applications you need on the Linux VM.

    However, when choosing to use Bash Pushdown, consider the following:

    • You need to set up, secure, update, and manage the Linux machine yourself.
    • You need network access from the Data Productivity Cloud agent to the compute source.

Using automatic variables in a Python script

The Data Productivity Cloud doesn't support directly accessing automatic variables through the Python Script component.

If you require this functionality, you can use an Update Scalar component to write the values to user-defined variables, which can then be passed to the script.


Migrating Jython scripts that use the cursor object

Python in the Data Productivity Cloud does not provide a method to obtain a cursor object, as Jython in Matillion ETL does. If you have a Jython script that accesses the target cloud data warehouse via a cursor, you have a number of options for refactoring the job to run as a Data Productivity Cloud pipeline.

Recommended best practice is to refactor the pipelines to use Query Result to Grid, Query Result to Scalar or SQL Script components in the Data Productivity Cloud instead of a Python script. However, this may require considerable refactoring work during the migration, and you may prefer to use one of the following instead:

  • In the Python Pushdown component (available in Snowflake environments only), you can use the session.sql function to replicate the context.cursor() function from Matillion ETL. This method is described in detail below.
  • In the Python Script component (available in Hybrid SaaS deployments only), you can add a block of code to the start of the script to replicate the context.cursor() function from Matillion ETL. This method is described in detail below.

Python Pushdown method

Note

The Python Pushdown component is available on Snowflake only.

In Jython scripts in Matillion ETL, you could use the context object to access a database cursor. For example:

cursor = context.cursor()
cursor.execute('select count(*) from mytable')
rowcount = cursor.fetchone()[0]

This allows the Jython script to query the database directly from within the script, and then use the query results within the same script. This functionality is not available within Python 3 used by the Data Productivity Cloud, but it can be replicated using the session.sql function and .collect() method, as follows:

row = session.sql('select count(*) from mytable').collect()
rowcount = row[0]

The session.sql function runs the query, and the .collect() method saves the result in the row list variable local to the script. Other types of query, such as create, can also be used within the session.sql function.

Python Script method

Note

  • The method requires the Python Script component. Because of this, it will only work in a Hybrid SaaS deployment, as the Python Script component is not available on Full SaaS deployments.
  • The method uses the agent's cloud credentials to access either AWS Secrets Manager or Azure Key Vault.
  • We recommend creating the necessary variables for this method as project variables.

The exact method depends on your cloud platform and cloud data warehouse, as described in the following sections.

Snowflake on AWS, using a private key

Create and define the following project variables:

Variable Description
snowflake_account The Snowflake account. Read Specify data warehouse credentials for details.
snowflake_username The Snowflake username. Read Specify data warehouse credentials for details.
snowflake_role The Snowflake role. Read Select data warehouse defaults for details.
snowflake_warehouse The Snowflake warehouse. Read Specify data warehouse credentials for details.
snowflake_database The Snowflake database. Read Specify data warehouse credentials for details.
snowflake_schema The Snowflake schema. Read Specify data warehouse credentials for details.
snowflake_private_key_secret_name The name of the secret in AWS Secrets Manager containing the Snowflake private key.

Add the following code to the top of your Python script:

################### cursor setup ###################
import boto3
import json
import re
import snowflake.connector

def extract_key_content(key_string):
    pattern = r'-----BEGIN PRIVATE KEY-----\s*(.*?)\s*-----END PRIVATE KEY-----'
    match = re.search(pattern, key_string, re.DOTALL)
    if match:
        return match.group(1).replace('\n', '').replace(' ', '')
    return None

def dpc_cursor():
    client = boto3.client('secretsmanager')

    secret_value_response = client.get_secret_value(SecretId=snowflake_private_key_secret_name)
    snowflake_private_key = extract_key_content(secret_value_response['SecretString'])

    connection = conn = snowflake.connector.connect(
        account=snowflake_account,
        user=snowflake_username,
        private_key=snowflake_private_key,
        role=snowflake_role,
        warehouse=snowflake_warehouse,
        database=snowflake_database,
        schema=snowflake_schema
    )

    return connection.cursor()

context.cursor = dpc_cursor
################### end of cursor setup ###################

Snowflake on Azure, using a private key

When the agent is installed on Azure, there are two Python packages that need to be added to the agent. Add the following custom Python packages, using the instructions found in Installing Python libraries.

  • azure_identity
  • azure-keyvault-secrets

Create and define the following project variables:

Variable Description
snowflake_account The Snowflake account. Read Specify data warehouse credentials for details.
snowflake_username The Snowflake username. Read Specify data warehouse credentials for details.
snowflake_role The Snowflake role. Read Select data warehouse defaults for details.
snowflake_warehouse The Snowflake warehouse. Read Specify data warehouse credentials for details.
snowflake_database The Snowflake database. Read Specify data warehouse credentials for details.
snowflake_schema The Snowflake schema. Read Specify data warehouse credentials for details.
azure_key_vault_url The URL of an Azure Key Vault that contains the secret for the private key.
snowflake_private_key_secret_name The name of the secret in Azure Key Vault containing the Snowflake private key.

Add the following code to the top of your Python script:

################### cursor setup ###################
from azure.identity import ManagedIdentityCredential
from azure.keyvault.secrets import SecretClient
import re
import snowflake.connector

def extract_key_content(key_string):
    pattern = r'-----BEGIN PRIVATE KEY-----\s*(.*?)\s*-----END PRIVATE KEY-----'
    match = re.search(pattern, key_string, re.DOTALL)
    if match:
        return match.group(1).replace('\n', '').replace(' ', '')
    return None

def dpc_cursor():
    client_id = os.environ["AZURE_CLIENT_ID"]
    credential = ManagedIdentityCredential(client_id=client_id)
    key_vault_client = SecretClient(
        vault_url=azure_key_vault_url,
        credential=credential
    )
    key_vault_secret = key_vault_client.get_secret(snowflake_private_key_secret_name)
    snowflake_private_key = extract_key_content(key_vault_secret.value)

    connection = conn = snowflake.connector.connect(
        account=snowflake_account,
        user=snowflake_username,
        private_key=snowflake_private_key,
        role=snowflake_role,
        warehouse=snowflake_warehouse,
        database=snowflake_database,
        schema=snowflake_schema
    )

    return connection.cursor()

context.cursor = dpc_cursor
################### end of cursor setup ###################

Databricks on AWS, using a personal access token

Create and define the following project variables:

Variable Description
databricks_server_hostname The server hostname value for your cluster or SQL warehouse. For more details, read the Databricks documentation.
databricks_http_path The HTTP path value for your cluster or SQL warehouse. For more details, read the Databricks documentation.
databricks_pat_secret_name The name of the secret in AWS Secrets Manager containing the Databricks personal access token.
databricks_pat_secret_key The key within the secret identified by databricks_pat_secret_name that has the personal access token as its value.

Add the following code to the top of your Python script:

################### cursor setup ###################
import boto3
from databricks import sql as dbks_sql
import json

def dpc_cursor():
    client = boto3.client('secretsmanager')

    secret_value_response = client.get_secret_value(SecretId=databricks_pat_secret_name)
    secret_dictionary = json.loads(secret_value_response['SecretString'])
    access_token = secret_dictionary[databricks_pat_secret_key]

    connection = dbks_sql.connect(
        server_hostname=databricks_server_hostname,
        http_path=databricks_http_path,
        access_token=access_token,
    )

    return connection.cursor()

context.cursor = dpc_cursor
################### end of cursor setup ###################

Databricks on Azure, using a personal access token

When the agent is installed on Azure, there are two Python packages that need to be added to the agent. Add the following custom Python packages, using the instructions found in Installing Python libraries.

  • azure_identity
  • azure-keyvault-secrets

Create and define the following project variables:

Variable Description
databricks_server_hostname The server hostname value for your cluster or SQL warehouse. For more details, read the Databricks documentation.
databricks_http_path The HTTP path value for your cluster or SQL warehouse. For more details, read the Databricks documentation.
azure_key_vault_url The URL of an Azure Key Vault that contains the Databricks personal access token.
databricks_pat_secret_name The name of the secret in the Azure Key Vault that contains the personal access token.

Add the following code to the top of your Python script:

################### cursor setup ###################
from azure.identity import ManagedIdentityCredential
from azure.keyvault.secrets import SecretClient
from databricks import sql as dbks_sql

def dpc_cursor():
    client_id = os.environ["AZURE_CLIENT_ID"]
    credential = ManagedIdentityCredential(client_id=client_id)

    key_vault_client = SecretClient(
        vault_url=azure_key_vault_url,
        credential=credential
    )
    key_vault_secret = key_vault_client.get_secret(databricks_pat_secret_name)
    access_token = key_vault_secret.value

    connection = dbks_sql.connect(
        server_hostname=databricks_server_hostname,
        http_path=databricks_http_path,
        access_token=access_token,
    )

    return connection.cursor()

context.cursor = dpc_cursor
################### end of cursor setup ###################

Redshift on AWS, using username and password

Create and define the following project variables:

Variable Description
redshift_endpoint The Redshift endpoint. Read Specify data warehouse credentials for details.
redshift_database The Redshift database. Read Specify data warehouse credentials for details.
redshift_port The Redshift port. Read Specify data warehouse credentials for details.
redshift_username The Redshift username. Read Specify data warehouse credentials for details.
redshift_password_secret_name The name of the secret in AWS Secrets Manager containing the Redshift password.
redshift_password_secret_key The key within the secret identified by redshift_password_secret_name that has the Redshift password as its value.

Add the following code to the top of your Python script:

################### cursor setup ###################
import boto3
import json
import redshift_connector

def dpc_cursor():
    client = boto3.client('secretsmanager')

    secret_value_response = client.get_secret_value(SecretId=redshift_password_secret_name)
    secret_dictionary = json.loads(secret_value_response['SecretString'])
    redshift_password = secret_dictionary[redshift_password_secret_key]

    connection = redshift_connector.connect(
        host=redshift_endpoint,
        database=redshift_database,
        port=redshift_port,
        user=redshift_username,
        password=redshift_password
    )

    return connection.cursor()

context.cursor = dpc_cursor
################### end of cursor setup ###################

Migrating Jython scripts that use grid variables

The handling of grid variables in Python scripts in the Data Productivity Cloud is different than how Jython scripts in Matillion ETL handle them. You need to be aware of this and make appropriate changes when converting any Jython scripts into Python as part of a migration to the Data Productivity Cloud.

In a Data Productivity Cloud Python script, modifying list data retrieved from a grid variable can unexpectedly alter the data source for subsequent reads within the same script. This is an expected effect of Python's standard behavior of handling lists as mutable objects passed by reference, meaning changes can affect the original structure. To ensure modifications are isolated, you should create an independent copy of the list (for example, using copy.deepcopy()) before altering it in the Python script.

The differences in behavior can be summarized as follows:

  • Jython in Matillion ETL: context.getGridVariable() returns a new (shallow) copy of the list object each time it is called. Assigning this to a variable creates another reference to that new copy. Therefore, modifying the local variable (copy_list) doesn't affect subsequent retrievals of the grid variable.
  • Python in the Data Productivity Cloud: context.getGridVariable() returns a reference to the same underlying object representation upon subsequent calls within the script. Simple assignment (copy_list = ...) creates another reference pointing to that exact same object. Consequently, modifying the list via any reference (such as copy_list) changes the single object, and subsequent retrievals read that change.

Therefore, it is crucial to follow Python 3 standard practice in Data Productivity Cloud Python scripts and use copy.deepcopy() (or appropriate alternatives like list comprehensions for nested structures) when you intend to modify retrieved grid variable data without side effects, whereas you may not have done this in your Matillion ETL Jython scripts.