Delta Lake on Databricks configuration for Matillion ETL
Overview
This guide explains how new Databricks resources should be configured for use with Matillion ETL when you launch from the Hub.
Refer to the following sections to help you set up your Delta Lake configuration:
Workspace
A Databricks workspace is an environment where you can access all of your Databricks assets. The workspace lets you organize objects (notebooks, libraries, and experiments) into folders. Furthermore, the workspace provides access to data and computational resources such as clusters.
When you sign up to Databricks, you are given access to your first workspace. For more information, read Sign up for a free Databricks trial.
To create new Databricks workspaces, you will first need to use your Databricks Account API to select a custom plan that allows for multiple workspaces per account. For more information, read Create a new workspace using the Account API.
Workspace ID
When you are logged in to your Databricks workspace, you will see in the browser's search bar a URL that looks like this:
https://XXX-XXXXXXX-XXXX.cloud.databricks.com
Where XXX-XXXXXXX-XXXX is a sequence of letters and numbers, separated by hyphens. This portion of the URL is your Databricks workspace ID.
You will need your Databricks workspace ID when establishing a Databricks cluster connection in the Matillion ETL Create Project wizard. For more information about creating a Matillion ETL project, read Create Project (Delta Lake on Databricks).
Databricks Clusters
A Databricks cluster is "a set of computation resources and configurations on which you run data engineering, data science, and data analytics workloads, such as production ETL pipelines, streaming analytics, ad-hoc analysis, and machine learning".
Matillion ETL requires that you have created a Databricks cluster. Provided that a secure connection is established when creating a project in Matillion ETL, any available clusters on your Databricks workspace will be available for selection on step 4 of the Create Project wizard, Delta Lake Defaults.
To create a Databricks cluster, follow these steps:
- Use the left-hand menu, and select Clusters. This will direct you to Clusters.
- Click the + Create Cluster button.
- Complete the New Cluster form. For guidance, read Create a cluster.
On the right-hand side of the New Cluster page, you can select either UI or JSON for the cluster setup.
- Once you are happy with your cluster's configuration, click Create Cluster. You can select your new cluster from the Clusters menu. To learn more about clusters in Databricks, read Clusters.
Databricks Notebook
In Databricks, a notebook is a web-based interface to a document that contains runnable code (such as SQL), visualizations, and narrative text. To learn how to create a new notebook, read Create a notebook.
Databricks Database
In Databricks, a database is a collection of tables, and a table is a collection of structured data. You can query tables with Spark APIs and Spark SQL. For more information, read Databases and tables.
Follow these steps to create a database:
-
Open a new or existing Databricks notebook.
-
In the notebook, run the
CREATE DATABASE
command. To learn more about the command, including specific parameter information, read CREATE DATABASE. -
When you create a project in Matillion ETL, step 4 of the Create Project wizard requires a working Databricks database.