Skip to content

Project quickstart guide

The Data Productivity Cloud uses data pipelines to extract your data from diverse data sources and load that data into your chosen cloud data platform (warehouse). Therefore, before you run your data pipelines, you need a connection to a suitable cloud data platform account. These connections are housed in projects.

The Data Productivity Cloud currently supports Snowflake, Databricks, and Amazon Redshift.

When you create a project, you're setting up a workspace in the Data Productivity Cloud that will group the following resources:

  • Branches for version control and collaborative working.
  • Environment connections to your cloud data platform.
  • Secret definitions to reference secrets such as passwords, API keys, and so on.
  • Cloud credentials to connect to objects and services on AWS, Azure, and GCP.
  • OAuth connections to access your data at third-party services such as Facebook, Salesforce, and so on.
  • Schedules to run your pipelines at your preferred intervals.
  • Access permissions for your project for other members in your Data Productivity Cloud organization.

Create a project

When you create a project, set the following details in the corresponding fields:

  • A Project name.
  • An optional Description of your project.
  • The Data platform you wish to connect to.

Click Continue.


Select a project configuration

Select how you want your project to be configured:

  • Matillion managed
  • Advanced settings

For the purpose of this guide, select Matillion managed. Matillion will set up and manage the following infrastructure:

  • Git repository
  • Secrets
  • Agents

Note

Select Advanced settings if you want to configure and set up a third-party Git repository or deploy a Hybrid SaaS agent. For more information about project settings, read Projects.

Click Continue.


Create an environment

Enter an environment name.

Click Continue.

From this point forward, use the tabs below to follow instructions specific to your cloud data platform.


Specify data warehouse credentials

Parameter Description
Account Enter your Snowflake account name and region (part of the URL you use to log in to Snowflake). Uses the format [accountName].[region_id].
Username Your Snowflake username.
Password For Full SaaS deployment model only. Your Snowflake password.
Parameter Description
Instance name Your Databricks instance name. Read the Databricks documentation to learn how to determine your instance name.
Username Your Databricks username.
Password Your Databricks password.

The instance name is the first part of the URL when you log in to your Databricks deployment. For example, cust-success.cloud.databricks.com. When you access Databricks through a browser, the instance name is clear in the URL in the address bar. For example, https://cust-success.cloud.databricks.com/explore/data/hive_metastore contains the instance address cust-success.cloud.databricks.com.

We also recommend, as best practice, using a personal access token for authentication. Read Databricks personal access token authentication for details. A personal access token is required to use a personal staging location (PSL), which is an option when selecting a Staging location in query components such as Database Query.

Note

Before you create a project you are required to create cloud provider credentials for Amazon Redshift, because a default S3 bucket is required when you want to stage data.

Since Amazon Redshift is exclusively on AWS, the Specify AWS cloud credentials page will be displayed. Cloud provider credentials are required so that a default S3 bucket can be selected on the next page. The default S3 bucket is mandatory for staging data, as mentioned earlier.

Parameter Description
Endpoint The physical address of the leader node. This will be either a name or an IP address.
Port This is usually 5439 or 5432, but it can be configured differently when setting up your Amazon Redshift cluster.
Use SSL Select this to encrypt communications between Data Productivity Cloud and Amazon Redshift. Some Amazon Redshift clusters may be configured to require this.
Username The username for the environment connection.
Password Your Amazon Redshift password.

Click Continue.


Select data warehouse defaults

Property Description
Default role The default Snowflake role for this environment connection. Read Overview of Access Control to learn more.
Default warehouse The default Snowflake warehouse for this environment connection. Read Overview of Warehouses to learn more.
Default database The default Snowflake database for this environment connection. Read Database, Schema, and Share DDL to learn more.
Default schema The default Snowflake schema for this environment connection. Read Database, Schema, and Share DDL to learn more.
Property Description
Compute The Databricks cluster that Data Productivity Cloud will connect to.
Catalog Choose a Databricks Unity Catalog to connect to.
Schema (Database) Choose a Databricks schema (database) to connect to.

The Default compute drop-down shows all clusters and SQL warehouses. The drop-down also shows the status (Running, Stopped, Starting, or Error) of each compute resource. Selecting a Stopped compute resource will trigger it to start, and change the displayed status to Starting. Starting a Databricks compute resource can take a few minutes, and during this time you won't be able to continue the configuration. Once the compute resource has started, you can select a catalog.

The Default catalog drop-down shows both the Hive_metastore as a top-level catalog, and catalogs governed by Unity Catalog. Matillion recommends the usage of Unity Catalog for access to our full suite of features. Read Work with Unity Catalog and the legacy Hive metastore for details.

Note

Databricks sometimes use the words Schema and Database interchangeably in their documentation. We always use the word Schema in component parameters.

Property Description
Default database The database you created when setting up your Amazon Redshift cluster. You may run with multiple database names—in which case, choose the one you want to use for this environment.
Default schema This is public by default, but if you have configured multiple schemas within your Amazon Redshift database, you should specify the schema you want to use.
S3 bucket Your default Amazon S3 bucket.

Click Finish to create your project. You can now add a branch and begin building data pipelines in Designer.