Databricks Lakehouse
Overview
This destination syncs data to Delta Lake on Databricks Lakehouse. Each stream is written to its own delta-table.
This connector requires a JDBC driver to connect to the Databricks cluster. By using the driver and the connector, you must agree to the JDBC ODBC driver license. This means that you can only use this connector to connect third party applications to Apache Spark SQL within a Databricks offering using the ODBC and/or JDBC protocols.
Currently, this connector requires 30+MB of memory for each stream. When syncing multiple streams, it may run into an out-of-memory error if the allocated memory is too small. This performance bottleneck is tracked in this issue. Once this issue is resolved, the connector should be able to sync an almost infinite number of streams with less than 500MB of memory.
Getting started
Databricks AWS Setup
1. Create a Databricks Workspace
- Follow Databricks guide
Create a workspace using the account console.
IMPORTANT: Don't forget to create a cross-account IAM role for workspaces
TIP: Alternatively use Databricks quickstart for new workspace
2. Create a metastore and attach it to workspace
IMPORTANT: The metastore should be in the same region as the workspaces you want to use to access the data. Make sure that this matches the region of the cloud storage bucket you created earlier.
Setup storage bucket and IAM role in AWS
Follow Configure a storage bucket and IAM role in AWS to setup AWS bucket with necessary permissions.
Create metastore
-
Login into Databricks account console with admin permissions.
-
Go to Data tab and hit Create metastore button:
-
Provide all necessary data and click Create:
Name
Region
The metastore should be in same region as the workspace.S3 bucket path
created at Setup storage bucket and IAM role in AWS step.IAM role ARN
created at Setup storage bucket and IAM role in AWS step. Example:arn:aws:iam::<AWS_ACCOUNT_ID>:role/<AWS_IAM_ROLE_NAME>
-
Select the workspaces in
Assign to workspaces
tab and click Assign.
3. Create Databricks SQL Warehouse
TIP: If you use Databricks cluster skip this step
-
Open the workspace tab and click on created workspace console:
-
Create SQL warehouse:
-
- Switch to SQL tab
- Click New button
- Choose SQL Warehouse
-
After SQL warehouse was created we can it's Connection details to con
4. Databricks SQL Warehouse connection details
TIP: If you use Databricks cluster skip this step
-
Open workspace console.
-
Go to SQL Warehouse section and open it
-
Open Connection Details tab:
IMPORTANT:
Server hostname
,Port
,HTTP path
are used for Airbyte connection
5. Create Databricks Cluster
TIP: If you use Databricks SQL Warehouse skip this step
-
Open the workspace tab and click on created workspace console:
-
Create Cluster:
- Switch to Data science & Engineering
- Click New button
- Choose Cluster