Set up Delta Lake on Databricks
This page is a guide to configuring your Databricks account to use Delta Lake as a destination within Data Loader.
A Delta Lake on Databricks destination can be set up on the fly either as part of the pipeline building process or on its own. Ingested data is first staged in an Amazon S3 bucket before being batched and loaded to the Delta Lake on Databricks destination.
Supported runtimes
Data Loader supports the following Databricks runtime versions:
Category | Values |
---|---|
Runtime versions | 11.3 LTS and upwards. |
For more information on Databricks releases, read Databricks runtime releases.
Prerequisites
- An Amazon Web Services (AWS) account with a Delta Lake on Databricks (AWS) deployment.
Instructions for configuring a Delta Lake on Databricks (AWS) deployment are outside the scope of this article; our instructions assume that you have Delta Lake on Databricks (AWS) up and running. Refer to the Databricks documentation.
-
An existing Amazon S3 bucket that is:
- In the same AWS account as the Delta Lake on Databricks deployment.
- In the same region as your Data Loader account.
-
Permissions to manage S3 buckets in AWS
- The AWS user must be able to add and modify bucket policies in the AWS account or accounts where the S3 bucket and Databricks deployment reside.
- The AWS account should have permission from the
DatabricksRolePolicy
. - To connect to an S3 bucket, it should have permission from the
databricks-s3-access
.
- You have access to an available workspace in Databricks.
- You have permissions to a cluster and the relevant databases (schemas) you want to read from Workspace is created.
- Databricks workspace URL in the format: <deployment name>.cloud.databricks.com
- You have access to one or more Databricks clusters to facilitate operations on your Delta Lake tables.
- The database credentials (hostname, port, and HTTP path), WorkspaceID and Personal Access Token (PAT) of the Databricks instance.
Configuring Delta Lake on Databricks as a Destination
Once the prerequisites have been satisfied, perform the following steps to configure Delta Lake on Databricks as a Destination:
Step 1: Configure S3 bucket access in AWS
The S3 bucket you use must be in the same region as your Data Loader account. Using a bucket in another region will result in an error.
Grant Data Loader access to the Amazon S3 bucket To allow Data Loader to access the S3 bucket, you'll need to add a bucket policy using the AWS console. Follow the instructions in the tab below to add the bucket policy.
- Sign in to your Amazon Web Services (AWS) account as a user with privileges that allows you to manage S3 buckets.
- Click Services near the top-left corner of the page.
- Under the Storage option, click S3.
- A page listing all buckets currently in use will display. Click the name of the bucket will be used with Databricks.
- Click the Permissions tab.
- In the Permissions tab, click the Edit button within the Bucket Policy section.
- In the Bucket policy editor, paste the bucket policy for your Data Loader data pipeline region and replace <YOUR-BUCKET-NAME> with the name of your S3 bucket.
- When finished, click Save.
Grant Databricks access to the Amazon S3 bucket
Next, you'll configure your AWS account to allow access from Databricks by creating an IAM role and policy. This is required to complete loading data into Delta Lake on Databricks (AWS).
Follow steps 1-4 in the Databricks documentation to create an IAM policy and role for Databricks.
Step 2: Connect your Databricks Warehouse
Option 1: Create a Databricks cluster
- Log in to your Databricks account.
- In the Databricks console, go to Data Science & Engineering → Create → Cluster.
- Enter a Cluster name of your choice.
- In the Databricks Runtime Version field, select a version Supported runtimes. This is required for Databricks Delta Lake (AWS) to work with Data Loader:
- Expand the Advanced options section and select the Spark tab.
-
In the Spark Config box, paste the following code that specifies the configurations needed to read the data from your S3 account:
spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation true
-
When finished, click Create Cluster to create your cluster.
Option 2: Create a Databricks SQL endpoint
- Log in to your Databricks account.
- In the Databricks console, select SQL from the drop-down.
- Click Create → SQL Endpoint.
- In the New SQL Endpoint window:
- Specify a Name for the endpoint.
- Select your Cluster Size.
- Configure other endpoint options, as required.
- Click Create.
Step 3: Get the Databricks credentials
Once you have a cluster that you want to load data to, you must obtain the cluster details to be provided while configuring Databricks in Data Loader. To do this:
- Click Compute in the left navigation bar of the Databricks console.
- Click the cluster you want to use.
- In the Configuration tab, scroll down to the Advanced Options window, and select JDBC/ODBC.
- Make a note of the following values. You will need them to configure in UI.
- Server hostname
- Port
- HTTP Path
Step 4: Generate an Access Token
- In the Databricks console, click Settings in the left navigation bar, and then click User Settings.
- Click the Access Tokens tab.
- Click Generate New Token.
- Optionally, provide a description Comment and the token Lifetime (Expiration Period).
- Click Generate.
- Copy the generated token. This token would be used to connect Databricks as a Destination in Data Loader as an alternate username.
To learn more about Databricks personal access tokens, read How to generate a new Databricks Token.
Step 5: Get the WorkspaceID
- Log in to your Databricks account.
- You will see in the browser's search bar a URL that looks something like this:
https://XXX-XXXXXXX-XXXX.cloud.databricks.com
Where XXX-XXXXXXX-XXXX
is a sequence of letters and numbers, separated by hyphens. This portion of the URL is your Databricks WorkspaceID.
Step 6: Get the Database
In Databricks, a database is a collection of tables. These tables can be queried with Spark APIs and Spark SQL. For more information, read Databases and tables in the Databricks documentation.
To create a database, open a new or existing Databricks notebook:
In the notebook, run the CREATE DATABASE
command. To learn more about the command, including specific parameter information, read Create Database in the Databricks documentation.
Step7: Connect to Databricks
The next step is to connect to Databricks in Data Loader. Please read Connect to Databricks in the Databricks documentation.