Databricks
Before you run any pipelines in the Data Productivity Cloud, you need a connection to a suitable cloud data platform account. This topic discusses the basics of connecting the Data Productivity Cloud to Databricks.
Full SaaS or Hybrid SaaS?
The Data Productivity Cloud can be run in a Full SaaS or hybrid SaaS architecture.
- Databricks on AWS is compatible with both Full SaaS and hybrid SaaS.
- Databricks on Azure currently only offers Full SaaS. Azure support for hybrid SaaS is coming soon.
Compute types
The Data Productivity Cloud supports the following Databricks compute types:
- All-purpose compute (Databricks runtimes 10.4 and above are supported).
- Classic SQL warehouses.
- Serverless SQL warehouses (recommended).
Read Compute in the Databricks documentation for more information.
Authentication to Databricks
Currently, the Data Productivity Cloud supports these authentication methods when connecting to Databricks:
- Username/password
- Personal access token
Note
To use personal access token authentication in pipeline components, enter token
as the username and the actual value of the token as the password.
Catalog types
We recommend using Unity Catalog enabled workspaces. The Data Productivity Cloud does support Hive catalogs, but many of its advanced features (such as Unity Catalog staging) and future features will be reliant on Unity Catalog workspaces.
Feature support
Some features will only work with specific Databricks runtimes and configurations:
Feature | Minimum Databricks runtime | Notes |
---|---|---|
Unity Catalog Volumes staging | 13.4+ | |
Run Notebook | 10.4+ | If you are using a serverless SQL or classic SQL compute, you can only run SQL notebooks using the Run notebook component. |
Personal Staging | 10.4+ | You must be using personal access token authentication to use personal staging. This feature is being deprecated. |
S3 buckets and Azure Blob storage
If you wish to load data from, or stage via, S3 buckets or Azure Blob storage, you must create and associate AWS or Azure cloud credentials to your environment.
You should also make sure that the instance profile attached to your Databricks compute resources also has access to the same AWS or Azure storage.