Monday, 13 April 2026
DBX workspace and Job with Github action
Monday, 30 March 2026
Databricks Serverless Jobs with Terraform
๐ Databricks Serverless Jobs using Terraform (Step-by-Step Guide)
๐ข Part 1: Create Databricks Workspace
Step 1: Account-Level Provider
provider "databricks" {
alias = "account"
host = "https://accounts.cloud.databricks.com"
account_id = var.account_id
username = var.username
password = var.password
}
Step 2: Create Workspace
resource "databricks_mws_workspaces" "workspace" {
provider = databricks.account
account_id = var.account_id
aws_region = var.aws_region
workspace_name = "demo-workspace"
credentials_id = var.credentials_id
storage_configuration_id = var.storage_config_id
}
Step 3: Configure Workspace Provider
provider "databricks" {
host = databricks_mws_workspaces.workspace.workspace_url
token = var.workspace_token
}
๐ Part 2: Job WITH Notebook
Step 1: Create Notebook
resource "databricks_notebook" "notebook" {
path = "/Shared/demo-notebook"
language = "PYTHON"
content_base64 = base64encode(<<EOF
print("Hello from Notebook Job")
EOF
)
}
Step 2: Create Job
resource "databricks_job" "notebook_job" {
name = "notebook-job"
task {
task_key = "task1"
notebook_task {
notebook_path = databricks_notebook.notebook.path
}
environment_key = "serverless_env"
}
environment {
key = "serverless_env"
spec {
client = "1"
}
}
schedule {
quartz_cron_expression = "0 0 0 * * ?"
timezone_id = "UTC"
}
}
๐ Part 3: Job WITHOUT Notebook (Python Script)
Step 1: Create Python Script
resource "databricks_workspace_file" "script" {
path = "/Shared/demo-script.py"
content_base64 = base64encode(<<EOF
print("Hello from Python Script Job")
EOF
)
}
Step 2: Create Job
resource "databricks_job" "python_job" {
name = "python-job"
task {
task_key = "task1"
spark_python_task {
python_file = databricks_workspace_file.script.path
}
environment_key = "serverless_env"
}
environment {
key = "serverless_env"
spec {
client = "1"
}
}
schedule {
quartz_cron_expression = "0 0 0 * * ?"
timezone_id = "UTC"
}
}
▶️ Execution Steps
terraform init
terraform plan
terraform apply
๐ฅ Key Takeaways
- Workspace is created at account level
- Jobs and notebooks are workspace-level resources
- Serverless jobs require
environment_key - Use Python scripts for production workloads
๐ Pro Tip
For production workloads, avoid notebooks and use Python scripts or packaged jobs with CI/CD pipelines.
Tuesday, 24 March 2026
Databricks Data Engineer Associate - Complete Guide
Databricks Data Engineer Associate - Complete Step-by-Step Guide
Step 1: Databricks Fundamentals
Theory
- Lakehouse = Data Lake + Data Warehouse
- Built on Apache Spark
- Uses Delta Lake for reliability
Core Components
- Workspace
- Cluster
- Notebook
- Jobs
Practical
spark.range(10).show()
Key Concept: Driver = brain, Workers = execution
Step 2: Apache Spark
Theory
- DataFrames are distributed tables
- Lazy evaluation (execution happens only on action)
Transformations vs Actions
| Type | Example |
|---|---|
| Transformation | filter, select |
| Action | show, count |
Practical
Read Data
df = spark.read.format("csv").option("header", True).load("/FileStore/data.csv")
Transform
df2 = df.filter(df.age > 30).select("name", "age")
Aggregate
df3 = df.groupBy("city").count()
Join
df.join(df2, "id", "inner")
Step 3: Delta Lake (Critical)
Theory
- Provides ACID transactions
- Supports updates, deletes, and merges
- Supports time travel
Practical
Create Table
df.write.format("delta").save("/delta/table1")
Read Table
df = spark.read.format("delta").load("/delta/table1")
Update
UPDATE table1 SET age = 40 WHERE id = 1;
Delete
DELETE FROM table1 WHERE id = 2;
Merge (Important)
MERGE INTO target t
USING source s
ON t.id = s.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
Time Travel
SELECT * FROM table1 VERSION AS OF 2;
Step 4: Data Ingestion
Theory
- Batch = one-time processing
- Streaming = continuous processing
Practical
Batch
df = spark.read.json("/data/input")
df.write.format("delta").save("/data/output")
Streaming
df = spark.readStream.format("json").load("/input")
df.writeStream
.format("delta")
.option("checkpointLocation", "/chk")
.start("/output")
Important: Checkpointing prevents data loss
Step 5: ETL Pipeline (Medallion Architecture)
| Layer | Purpose |
|---|---|
| Bronze | Raw data |
| Silver | Cleaned data |
| Gold | Aggregated data |
Practical
Bronze
df.write.format("delta").save("/bronze/data")
Silver
df_clean = df.filter("age IS NOT NULL")
df_clean.write.format("delta").save("/silver/data")
Gold
df.groupBy("city").count().write.format("delta").save("/gold/data")
Step 6: Databricks SQL
Practical
Create Table
CREATE TABLE users USING DELTA LOCATION '/delta/users';
Query
SELECT city, COUNT(*) FROM users GROUP BY city;
Temp View
CREATE TEMP VIEW temp_users AS SELECT * FROM users;
Step 7: Jobs & Automation
- Create jobs from notebooks
- Schedule using cron
- Supports task dependencies
Step 8: Performance Optimization
Practical
Caching
df.cache()
Partitioning
df.write.partitionBy("city").format("delta").save("/data")
Step 9: Security
- Unity Catalog for governance
- Table-level permissions
Final Project (Recommended)
- Ingest JSON → Bronze
- Clean → Silver
- Aggregate → Gold
- Query using SQL
Final Checklist
- Spark transformations
- Delta MERGE, UPDATE, DELETE
- Streaming basics
- ETL pipelines
- Jobs & scheduling
Pro Tip
If you already have experience with AWS, EMR, or streaming systems, focus mainly on:
- Delta Lake
- Databricks UI
Saturday, 21 March 2026
Databricks Workspace Resources – Complete Guide
Databricks Workspace Resources – Full Guide
A Databricks workspace provides an environment where you can create, organize, and manage compute resources, data objects, automation workflows, analytics assets, and machine learning components. The Databricks UI supports creating notebooks, queries, dashboards, jobs, pipelines, experiments, models, and more through the + New menu and the workspace sidebar [1](https://learn.microsoft.com/en-us/azure/databricks/jobs/pipeline). Workspace objects include notebooks, jobs, libraries, data files, experiments, and more [2](https://www.youtube.com/watch?v=cNFKzWpRvsw).
Summary Table of Creatable Databricks Workspace Resources
| Resource | Description | How to Create (UI Steps) |
|---|---|---|
| Notebook | Interactive document for Python, SQL, R, Scala code execution [2](https://www.youtube.com/watch?v=cNFKzWpRvsw). |
1. Click + New → Notebook. 2. Enter notebook name. 3. Choose language. 4. Select compute (cluster not covered). 5. Click Create [1](https://learn.microsoft.com/en-us/azure/databricks/jobs/pipeline). |
| Query | SQL query used for dashboards & alerts [1](https://learn.microsoft.com/en-us/azure/databricks/jobs/pipeline). |
1. Click + New → Query. 2. SQL Editor opens. 3. Select SQL Warehouse. 4. Write SQL and click Save. |
| Dashboard | Visual BI dashboard created from queries [1](https://learn.microsoft.com/en-us/azure/databricks/jobs/pipeline). |
1. Open a saved query. 2. Click Add to Dashboard. 3. Create new or choose existing. 4. Arrange visuals → Save. |
| Alert | Condition-based SQL alert [1](https://learn.microsoft.com/en-us/azure/databricks/jobs/pipeline). |
1. Open a SQL Query. 2. Click Create Alert. 3. Add condition + recipients. 4. Save. |
| Repo | Git-connected source repo [1](https://learn.microsoft.com/en-us/azure/databricks/jobs/pipeline). |
1. Click + New → Repo. 2. Choose Git provider. 3. Paste repository URL. 4. Authenticate and click Create. |
| File | Workspace-level file (CSV, Python script, config) [2](https://www.youtube.com/watch?v=cNFKzWpRvsw). |
1. Open Workspace browser. 2. Click Add → File Upload. 3. Upload file. |
| Library | Install Python/JAR packages for use in notebooks/jobs [2](https://www.youtube.com/watch?v=cNFKzWpRvsw). |
1. Go to Workspace → Libraries. 2. Click Install New. 3. Upload wheel/JAR or specify PyPI package. 4. Click Install. |
| Job | Automation for notebooks, scripts, JARs, pipelines [2](https://www.youtube.com/watch?v=cNFKzWpRvsw). |
1. Click Jobs in sidebar. 2. Click Create Job. 3. Name the job. 4. Click Add Task and choose task type. 5. Configure task details. 6. Assign compute (cluster selection only). 7. Add schedule if needed. 8. Click Create [1](https://learn.microsoft.com/en-us/azure/databricks/jobs/pipeline). |
| Pipeline | DLT / Lakeflow ETL pipelines (triggered or continuous) [3](https://docs.databricks.com/aws/en/getting-started/concepts). |
1. Click Jobs & Pipelines. 2. Click Create Pipeline. 3. Enter name. 4. Select pipeline mode. 5. Add SQL/Python pipeline code. 6. Select target catalog/schema. 7. Configure settings. 8. Click Create. |
| Experiment | MLflow tracking experiment for ML models [1](https://learn.microsoft.com/en-us/azure/databricks/jobs/pipeline). |
1. Click + New → Experiment. 2. Enter name and location. 3. Click Create. |
| Model | MLflow model stored in Model Registry [4](https://devstacktips.com/development/programming-languages/2025/06/06/mastering-databricks-jobs-api-build-and-orchestrate-complex-data-pipelines/). |
1. Open an MLflow run. 2. Click Register Model. 3. Select or create model name. 4. Register. |
| Serving Endpoint | Real-time inference endpoint for ML models [1](https://learn.microsoft.com/en-us/azure/databricks/jobs/pipeline). |
1. Click + New → Serving Endpoint. 2. Select model. 3. Configure autoscaling. 4. Click Create Endpoint. |
Visual Diagram of All Databricks Workspace Resources
The following diagram shows how notebooks, pipelines, jobs, dashboards, alerts, and ML workflows connect logically inside a Databricks workspace.
```mermaid flowchart TD A[Notebook] --> B[Job] B --> C[Pipeline] C --> D[Tables / Data Assets] A --> E[Experiment] E --> F[Model] F --> G[Serving Endpoint] B --> H[Dashboards] H --> I[Alerts] J[Repo] --> A K[Files / Libraries] --> A
Thursday, 19 March 2026
Databricks Serverless Job with JAR from S3 via Volume
Databricks Serverless Job with JAR (S3 → Volume → Notebook → Job)
๐ฏ Goal
- Upload JAR to S3
- Create Databricks Volume
- Copy JAR to Volume
- Create Notebook
- Create Job via UI
- Run and validate
๐งช 1️⃣ Sample Test JAR
HelloSpark.java
package com.example;
import org.apache.spark.sql.SparkSession;
public class HelloSpark {
public static void main(String[] args) {
SparkSession spark = SparkSession.builder()
.appName("Test JAR Job")
.getOrCreate();
long count = spark.range(1, 100).count();
System.out.println("Count is: " + count);
spark.stop();
}
}
pom.xml
<project>
<modelVersion>4.0.0</modelVersion>
<groupId>com.example</groupId>
<artifactId>hello-spark</artifactId>
<version>1.0</version>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.5.0</version>
<scope>provided</scope>
</dependency>
</dependencies>
</project>
Build JAR
mvn clean package
Output: target/hello-spark-1.0.jar
☁️ 2️⃣ Upload JAR to S3
aws s3 cp target/hello-spark-1.0.jar s3://my-artifact-bucket/libs/
๐งฑ 3️⃣ Databricks UI Setup
Step 1: Create Storage Credential
- Go to: Data → Credentials
- Click: Create Credential
- Name:
my_cred - IAM Role ARN: your role ARN
Step 2: Create External Location
- Go to: Data → External Locations
- Name:
my_ext_loc - URL:
s3://my-volume-bucket/ - Credential:
my_cred
Step 3: Create Volume
- Go to: Catalog → Schema
- Create Volume:
my_volume
๐ 4️⃣ Copy JAR to Volume
volume_path = "/Volumes/my_catalog/my_schema/my_volume/"
dbutils.fs.cp(
"s3://my-artifact-bucket/libs/hello-spark-1.0.jar",
volume_path + "hello-spark.jar"
)
display(dbutils.fs.ls(volume_path))
๐ 5️⃣ Notebook Example
print("Running Databricks Job with JAR")
df = spark.range(1, 10)
display(df)
⚙️ 6️⃣ Create Job (UI)
- Go to: Workflows → Jobs → Create Job
- Job Name:
test-jar-job - Task Type: Notebook
- Select Notebook
Add Library
/Volumes/my_catalog/my_schema/my_volume/hello-spark.jar
Compute
Serverless
▶️ 7️⃣ Run Job
- Click Run Now
- Check logs under Runs
✅ 8️⃣ Expected Output
Count is: 99
๐งช 9️⃣ Test Scenarios
Positive
- JAR loads successfully
- Notebook executes
- Volume accessible
Negative
- No Volume permission → Access Denied
- Wrong IAM Role → S3 Access Denied
- Missing JAR → File not found
๐ ๐ฅ 10️⃣ Enterprise Best Practices
- Use separate S3 bucket for artifacts
- Use Unity Catalog Volumes for governance
- Restrict S3 access to specific prefixes
- Enable audit logging
๐ฏ Final Flow
S3 (JAR) → Volume → Notebook → Job → Output
Databricks Job with S3, Volume and JAR (Serverless)
Databricks Serverless Job with S3 + Volume + Notebook
๐งฉ Architecture
S3 (JAR / Libraries) → Databricks Volume → Notebook → Databricks Job
- S3 stores JAR files and artifacts
- Volume (Unity Catalog) provides governed access
- Notebook runs logic
- Job executes workload
๐ IAM Role & Policy
IAM Policy
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "S3ReadArtifacts",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::my-artifact-bucket",
"arn:aws:s3:::my-artifact-bucket/*"
]
},
{
"Sid": "S3VolumeAccess",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::my-volume-bucket/*"
]
}
]
}
Trust Policy
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam:::root"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": ""
}
}
}
]
}
๐งฑ Unity Catalog Setup
Create Storage Credential
CREATE STORAGE CREDENTIAL my_cred WITH IAM_ROLE = 'arn:aws:iam::123456789012:role/databricks-role';
Create External Location
CREATE EXTERNAL LOCATION my_ext_loc URL 's3://my-volume-bucket/' WITH (STORAGE CREDENTIAL my_cred);
Create Volume
CREATE VOLUME my_catalog.my_schema.my_volume LOCATION 's3://my-volume-bucket/vol/';
๐ Notebook Example
volume_path = "/Volumes/my_catalog/my_schema/my_volume/"
# Copy JAR from S3 to Volume
dbutils.fs.cp(
"s3://my-artifact-bucket/libs/my-app.jar",
volume_path + "my-app.jar"
)
# List files
display(dbutils.fs.ls(volume_path))
⚙️ Using JAR in Job
Option 1: Add JAR in Notebook
spark.sparkContext.addJar("/Volumes/my_catalog/my_schema/my_volume/my-app.jar")
Option 2: Job Configuration (Recommended)
/Volumes/my_catalog/my_schema/my_volume/my-app.jar
๐ Databricks Job JSON
{
"name": "jar-test-job",
"tasks": [
{
"task_key": "run-notebook",
"notebook_task": {
"notebook_path": "/Workspace/Users/test/notebook"
},
"libraries": [
{
"jar": "/Volumes/my_catalog/my_schema/my_volume/my-app.jar"
}
]
}
]
}
๐งช Test Scenarios
Positive Tests
- JAR loads successfully
- Notebook executes without error
- Volume is accessible
Negative Tests
- No permission on volume → Access Denied
- Invalid IAM role → Storage credential failure
- Missing JAR → Job failure
✅ Summary
| Component | Purpose |
|---|---|
| S3 | Stores JAR and artifacts |
| IAM Role | Grants access to S3 |
| Storage Credential | Connects Databricks to AWS |
| External Location | Maps S3 to Databricks |
| Volume | Secure file access layer |
| Notebook | Executes logic |
| Job | Runs the workflow |
Wednesday, 18 March 2026
S3 Bucket Security for Databricks on AWS – Do You Need a Bucket Policy
S3 Bucket Security for Databricks on AWS – Do You Need a Bucket Policy?
Short Answer
Yes — if you don’t have a bucket policy (or any explicit deny), any IAM principal in that AWS account with
s3:* (or sufficient S3 permissions) can access the bucket, including data used by Databricks.
Why This Happens
AWS authorization follows this rule:
- Access is allowed if there is at least one ALLOW and no explicit DENY
So if:
- An IAM role/user has
s3:* - The bucket has no restrictive bucket policy
Then access is granted.
What This Means
Without Bucket Policy
- Databricks role → ✅ Access (expected)
- Any other IAM role with S3 permissions → ❗ Also has access
This includes:
- Admin roles
- Other application roles
- Over-permissioned users
Security Risk
- ❌ No data isolation
- ❌ Violates zero-trust principles
- ❌ Compliance risk (PII, GDPR, etc.)
Example: Another team’s EC2 role with s3:* can read your Databricks data.
Recommended Fix – Bucket Policy
Allow Only Databricks Role
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::<ACCOUNT_ID>:role/<UC_ROLE>"
},
"Action": "s3:*",
"Resource": [
"arn:aws:s3:::my-data-bucket",
"arn:aws:s3:::my-data-bucket/*"
]
}
Deny Everyone Else (Critical)
{
"Effect": "Deny",
"NotPrincipal": {
"AWS": "arn:aws:iam::<ACCOUNT_ID>:role/<UC_ROLE>"
},
"Action": "s3:*",
"Resource": [
"arn:aws:s3:::my-data-bucket",
"arn:aws:s3:::my-data-bucket/*"
]
}
Security Model
IAM Role Policy → Defines WHAT actions are allowed
+
Bucket Policy → Defines WHO can access the bucket
Key Insight
- IAM policy answers: “What can this role do?”
- Bucket policy answers: “Who is allowed to access this bucket?”
Without a bucket policy, you lose resource-level protection.
Final Answer
- ✔ Yes — without a bucket policy, any IAM role with
s3:*can access your Databricks S3 data - ✔ Use a bucket policy to restrict access
- ✔ Allow only the Unity Catalog role
- ✔ Add explicit deny for all other principals
This ensures a secure, enterprise-grade, zero-trust setup for Databricks on AWS.
Databricks Serverless on AWS – IAM Roles, Policies, and Security Best Practices
Databricks Serverless on AWS – IAM Roles, Policies, and Security Best Practices
Overview
In Databricks Serverless on AWS, IAM roles are required to securely enable cross-account access between the Databricks control plane and your AWS account. Unlike traditional clusters, serverless removes the need for instance profiles and instead relies on Unity Catalog and cross-account roles.
1. Cross-Account Role (Control Plane Role)
Purpose
- Allows Databricks control plane to access AWS resources
- Used for workspace validation, metadata access, and configuration
- Does NOT perform data modifications
Trust Policy (External ID Required)
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DatabricksAssumeRole",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::<DATABRICKS_ACCOUNT_ID>:root"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "<UNIQUE_EXTERNAL_ID>"
}
}
}
]
}
Why External ID is Required
- Prevents the Confused Deputy Problem
- Ensures only your Databricks workspace can assume the role
- Mandatory for secure cross-account access
Permissions Policy
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "S3ReadAccessForValidation",
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": "arn:aws:s3:::my-data-bucket"
},
{
"Sid": "GlueReadAccess",
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:GetDatabases",
"glue:GetTable",
"glue:GetTables",
"glue:GetPartitions"
],
"Resource": "*"
},
{
"Sid": "CloudWatchReadLogs",
"Effect": "Allow",
"Action": [
"logs:DescribeLogGroups",
"logs:DescribeLogStreams"
],
"Resource": "*"
}
]
}
Why These Permissions
- s3:ListBucket – Validate bucket existence
- s3:GetBucketLocation – Ensure region alignment
- glue:Get* – Read metadata for tables
- logs:Describe* – Optional monitoring and debugging
Security Note: No write access is granted to ensure least privilege.
2. Unity Catalog Storage Credential Role (Data Access Role)
Purpose
- Provides data access for serverless compute
- Used by Unity Catalog for governance
- Replaces instance profile roles
Trust Policy
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DatabricksUnityCatalogAccess",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::<DATABRICKS_ACCOUNT_ID>:root"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "<UNIQUE_EXTERNAL_ID>"
}
}
}
]
}
Permissions Policy
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "S3DataAccess",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject"
],
"Resource": "arn:aws:s3:::my-data-bucket/*"
},
{
"Sid": "S3ListAccess",
"Effect": "Allow",
"Action": [
"s3:ListBucket"
],
"Resource": "arn:aws:s3:::my-data-bucket"
}
]
}
Why These Permissions
- s3:GetObject – Read data
- s3:PutObject – Write data
- s3:DeleteObject – Cleanup/overwrite
- s3:ListBucket – Required for query planning
Optional: KMS Permissions
If your S3 bucket uses encryption:
{
"Effect": "Allow",
"Action": [
"kms:Decrypt",
"kms:Encrypt",
"kms:GenerateDataKey"
],
"Resource": "<KMS_KEY_ARN>"
}
Why Needed
- Decrypt data during reads
- Encrypt data during writes
What You Should NOT Include
- s3:* – Too broad
- iam:* – Security risk
- ec2:* – Not required in serverless
- glue:* (write) – Prevent schema tampering
Architecture Summary
Databricks Serverless Compute
│
▼
Assume Role (with External ID)
│
├── Cross-Account Role → Metadata access
└── Storage Credential Role → S3 data access
Key Takeaways
- External ID is mandatory for security
- No instance profile role in serverless
- Cross-account role = control plane access
- Unity Catalog role = data plane access
- Strict least privilege must be enforced
Final Summary
- Cross-account role (read-only, control plane)
- Unity Catalog storage credential role (data access)
- Optional KMS permissions
This setup ensures a secure, scalable, and enterprise-ready Databricks Serverless architecture on AWS.
Friday, 13 March 2026
Databricks Roles Full Reference Matri
Databricks Roles – Full Reference Matrix
This table includes Workspace Roles, Account Roles, and Unity Catalog Roles with exact capabilities.
| Role | Category | Capabilities / Permissions | Notes |
|---|---|---|---|
| Workspace Admin | Workspace |
|
Full control of workspace; does NOT grant automatic data access in Unity Catalog |
| User | Workspace |
|
Cannot manage other users or workspace settings |
| Can Manage / Job Creator | Workspace |
|
Limited admin; cannot manage other users or workspace-wide settings |
| Viewer | Workspace |
|
No write permissions |
| Account Admin | Account |
|
Full control over account; workspace-level roles must still be respected |
| Billing / Support Roles | Account |
|
Cannot manage workspace or data; read-only account permissions |
| Metastore Admin | Unity Catalog |
|
Full control over UC metadata; does NOT give workspace admin rights |
| Catalog Owner | Unity Catalog |
|
Limited to one catalog; cannot manage other catalogs |
| Schema Owner | Unity Catalog |
|
Cannot manage catalog-level permissions |
| Volume Owner | Unity Catalog |
|
Access to volume paths only |
| Data Access Roles (SELECT / MODIFY / USAGE) | Unity Catalog |
|
Applied per-object; separate from workspace admin rights |
Databricks on AWS – Least Privilege Permission Matrix
Databricks on AWS – Least Privilege Permission Matrix
This matrix describes the minimal AWS and Databricks permissions required to create or manage common platform resources when using Databricks on AWS. The goal is to follow enterprise least-privilege security principles.
| Resource | Primary Owner Role | Required AWS Permissions | Required Databricks Permissions | Purpose | Security / Least Privilege Notes |
|---|---|---|---|---|---|
| Workspace | Platform Admin | iam:CreateRole, iam:AttachRolePolicy, ec2:CreateVpc, s3:CreateBucket | Account Admin | Create Databricks workspace | Automate using Terraform and restrict to platform team |
| Cross Account IAM Role | AWS Cloud Admin | iam:CreateRole, iam:PutRolePolicy, sts:AssumeRole | None | Allows Databricks control plane access | Trust only Databricks account |
| Root Storage (DBFS) | AWS Cloud Admin | s3:CreateBucket, s3:PutBucketPolicy | None | Workspace default storage | Enable encryption and versioning |
| Unity Catalog Metastore | Data Platform Admin | s3:GetObject, s3:PutObject, s3:ListBucket | Metastore Admin | Central governance metadata store | Dedicated metastore bucket |
| Metastore Assignment | Platform Admin | None | Account Admin | Attach metastore to workspace | Single metastore per region recommended |
| Storage Credential | Data Platform Admin | iam:PassRole, sts:AssumeRole | CREATE STORAGE CREDENTIAL | Connect Unity Catalog to S3 | IAM role should allow only specific S3 path |
| External Location | Data Governance Admin | s3:GetObject, s3:PutObject | CREATE EXTERNAL LOCATION | Expose S3 path to Unity Catalog | Use path-level permissions |
| Catalog | Data Governance Admin | Access to storage location | CREATE CATALOG | Top governance layer | One catalog per domain recommended |
| Schema | Data Owner | None | CREATE SCHEMA | Database container | Grant schema-level privileges |
| Delta Table | Data Engineer | S3 read/write | CREATE TABLE | Structured table storage | Use Unity Catalog governance |
| External Table | Data Engineer | S3 read | CREATE TABLE | Reference external dataset | Avoid direct S3 access |
| Notebook | Data Engineer / Analyst | None | Workspace Editor | Analytics code | Store production code in Git |
| Git Repo Integration | Developer | None | Workspace Editor | Version control integration | Use GitHub / GitLab PAT |
| Job / Workflow | Data Engineer | None | CREATE JOB | Automated pipelines | Define jobs as code |
| Cluster | Platform Admin | ec2:RunInstances, iam:PassRole | CREATE CLUSTER | Compute resource | Restrict using cluster policies |
| SQL Warehouse | Data Engineer | None | CREATE SQL WAREHOUSE | Serverless SQL analytics | Limit compute size via policies |
| Cluster Policy | Platform Admin | None | CREATE CLUSTER POLICY | Restrict compute usage | Important governance control |
| Feature Store Table | ML Engineer | S3 read/write | CREATE TABLE | Machine learning features | Stored as Delta tables |
| ML Model Registry | ML Engineer | S3 artifact storage | CREATE MODEL | Track ML model versions | Store artifacts in secure bucket |
| Streaming Checkpoints | Data Engineer | s3:PutObject, s3:GetObject | Job permission | Streaming progress tracking | Separate checkpoint directory |
| Unity Catalog Volume | Data Platform Admin | S3 access | CREATE VOLUME | File storage governance | Alternative to DBFS |
| Audit Logs | Security Team | S3 write | Account Admin | Security auditing | Send logs to SIEM |
| PrivateLink Networking | AWS Cloud Admin | ec2:CreateVpcEndpoint | Account Admin | Private connectivity | Required for highly secure environments |
| DBFS File Upload | User | s3:PutObject | Workspace User | Temporary file storage | Avoid for production data |
Databricks Architecture Matrix with DR and Terraform Resources
Databricks Architecture Matrix (Serverless on AWS)
This document shows where major Databricks components live when using Serverless on AWS, including Disaster Recovery strategies and Terraform resources used for automation.
Control Plane Components
| Component | Purpose | Where It Runs | Plane | DR Best Practice | Terraform Resource |
|---|---|---|---|---|---|
| Workspace | Main analytics workspace | Databricks SaaS | Control Plane | Create secondary workspace in another region | databricks_mws_workspaces |
| Users | User identity | Databricks account | Control Plane | Use centralized IdP | databricks_user |
| Groups | Access management | Databricks account | Control Plane | Manage via SCIM | databricks_group |
| Group Membership | User-group association | Databricks account | Control Plane | Recreate from IaC | databricks_group_member |
| Notebook Source Code | Notebook scripts | Workspace storage | Control Plane | Store notebooks in Git | databricks_notebook |
| Repos (Git Integration) | Source code integration | Workspace metadata | Control Plane | Keep Git remote as source of truth | databricks_repo |
| Job Scheduler | Pipeline scheduling | Databricks control services | Control Plane | Define jobs as code | databricks_job |
| Cluster Configuration | Compute definition | Databricks control services | Control Plane | Recreate clusters via IaC | databricks_cluster |
| SQL Warehouse | Serverless SQL endpoint | Databricks control services | Control Plane | Recreate warehouse in DR region | databricks_sql_endpoint |
| Unity Catalog Metastore | Metadata store | Databricks metadata service | Control Plane | Replicate configuration | databricks_metastore |
| Unity Catalog Catalog | Top level data container | Databricks governance service | Control Plane | Recreate catalogs | databricks_catalog |
| Unity Catalog Schema | Database layer | Databricks governance service | Control Plane | Recreate schema structure | databricks_schema |
| Permissions | Access control policies | Databricks governance service | Control Plane | Store as code | databricks_grants |
| Model Registry | ML model version tracking | Databricks metadata services | Control Plane | Replicate model metadata | databricks_mlflow_model |
| Feature Store Metadata | ML feature definitions | Databricks metadata services | Control Plane | Store definitions in Git | databricks_feature_table |
Data Plane Components (AWS)
| Component | Purpose | Where It Runs | Plane | DR Best Practice | Terraform Resource |
|---|---|---|---|---|---|
| S3 Data Lake | Primary storage | AWS S3 | Data Plane | Enable cross-region replication | aws_s3_bucket |
| Delta Tables | Structured data storage | S3 | Data Plane | Replicate bucket | aws_s3_bucket |
| DBFS Root Storage | Databricks filesystem | S3 | Data Plane | Enable bucket versioning | aws_s3_bucket |
| MLflow Artifact Storage | Stores ML models | S3 | Data Plane | Replicate artifact bucket | aws_s3_bucket |
| Streaming Checkpoints | Streaming progress tracking | S3 | Data Plane | Replicate checkpoint folders | aws_s3_bucket |
| Feature Store Data | ML training features | S3 | Data Plane | Enable replication | aws_s3_bucket |
| Execution Logs | Spark logs | S3 | Data Plane | Central logging system | aws_s3_bucket |
| Serverless Spark Compute | Job execution | AWS compute | Data Plane | Use multi-region workspace | N/A (managed by Databricks) |
| Temporary Spark Shuffle Data | Intermediate processing | Compute disk | Data Plane | No DR required | N/A |
Architecture Flow
Databricks Architecture Matrix with DR Best Practices
Databricks Architecture Matrix (Serverless on AWS) with DR Best Practices
This document explains where major Databricks components reside when running Databricks Serverless on AWS and the recommended Disaster Recovery (DR) strategy for each component.
Control Plane Components
| Component | Purpose | Where It Runs | Plane | DR Best Practice |
|---|---|---|---|---|
| Workspace UI | User interface for notebooks and jobs | Databricks SaaS | Control Plane | Create secondary workspace in another region |
| Workspace APIs | Automation APIs | Databricks SaaS | Control Plane | Automate infrastructure using Terraform |
| Users & Groups | User identity management | Databricks account services | Control Plane | Use centralized IdP like Okta/Azure AD |
| Authentication / SSO | Login via external identity provider | Databricks account services | Control Plane | Configure SSO redundancy at IdP level |
| Permissions / RBAC | Access control policies | Databricks control services | Control Plane | Store policies as code using Terraform |
| Notebook Source Code | Notebook scripts | Workspace storage | Control Plane | Sync notebooks with Git repositories |
| Notebook Outputs | Charts and query results | Workspace storage | Control Plane | Do not rely on outputs; regenerate from data |
| Workspace Files | Files uploaded to workspace | Workspace storage | Control Plane | Store important files in S3 or Git |
| Repos (Git Integration) | Git source control integration | Workspace metadata | Control Plane | Maintain source code in GitHub/GitLab |
| Job Scheduler | Schedules workflows | Databricks orchestration service | Control Plane | Define jobs using Infrastructure-as-Code |
| Workflows | Pipeline orchestration | Databricks orchestration service | Control Plane | Export workflows via API and Terraform |
| SQL Query Planner | SQL optimization engine | Databricks query services | Control Plane | No DR needed (managed by Databricks) |
| SQL Warehouse Management | Serverless SQL management | Databricks control services | Control Plane | Recreate warehouses in secondary region |
| Unity Catalog | Central governance system | Databricks governance service | Control Plane | Replicate catalogs configuration using scripts |
| Metastore | Metadata storage | Databricks metadata services | Control Plane | Export metadata periodically |
| Data Lineage | Tracks data relationships | Databricks governance services | Control Plane | Export lineage metadata via APIs |
| Audit Logs | Security logs | Databricks governance services | Control Plane | Send logs to centralized SIEM storage |
| Cluster Management | Compute lifecycle management | Databricks control services | Control Plane | Recreate clusters via automation |
| Feature Store Metadata | Feature definitions | Databricks metadata services | Control Plane | Backup definitions in Git |
| Model Registry Metadata | ML model tracking | Databricks metadata services | Control Plane | Replicate registry configuration |
| Lakehouse Monitoring Metadata | Dataset monitoring metrics | Databricks monitoring services | Control Plane | Export monitoring metrics |
| Vector Search Metadata | Vector index configuration | Databricks control services | Control Plane | Recreate vector indexes from embeddings |
Data Plane Components (Customer AWS)
| Component | Purpose | Where It Runs | Plane | DR Best Practice |
|---|---|---|---|---|
| Serverless Spark Compute | Executes jobs | AWS compute | Data Plane | Deploy in multi-region workspace |
| SQL Warehouse Compute | SQL query execution | AWS compute | Data Plane | Provision warehouses in secondary region |
| Delta Table Data | Table storage | S3 | Data Plane | Enable S3 cross-region replication |
| Managed Tables | Managed table storage | S3 | Data Plane | Use versioned S3 buckets |
| External Tables | External dataset storage | S3 | Data Plane | Replicate underlying S3 storage |
| DBFS Root | Databricks filesystem | S3 | Data Plane | Enable bucket replication |
| Unity Catalog Managed Storage | Catalog table storage | S3 | Data Plane | Cross-region replication |
| Unity Catalog Volumes | Governed file storage | S3 | Data Plane | Replicate S3 buckets |
| MLflow Model Artifacts | ML models | S3 | Data Plane | Replicate artifact bucket |
| Feature Store Data | ML feature datasets | S3 | Data Plane | S3 replication and versioning |
| Vector Search Index Data | Embedding storage | S3 | Data Plane | Rebuild indexes from replicated embeddings |
| Streaming Checkpoints | Streaming progress tracking | S3 | Data Plane | Replicate checkpoint directories |
| Temporary Spark Shuffle Data | Intermediate processing | Compute disk | Data Plane | No DR required (recomputed) |
| Job Execution Logs | Spark logs | S3 | Data Plane | Send logs to centralized logging system |
| ML Training Data | Training datasets | S3 | Data Plane | Multi-region S3 replication |
| Delta Transaction Logs | Table version metadata | S3 | Data Plane | Protect using S3 versioning |
Architecture Flow
Databricks Architecture Matrix
Databricks Architecture Matrix (Serverless on AWS)
This document explains where major Databricks components reside when running Databricks Serverless on AWS. Databricks architecture is divided into two planes:
- Control Plane – Managed by Databricks
- Data Plane – Runs in the customer AWS account
Control Plane Components
| Component | Purpose / What It Does | Where It Runs or Is Stored | Plane |
|---|---|---|---|
| Workspace UI | Web interface to access notebooks, jobs, dashboards | Databricks SaaS infrastructure | Control Plane |
| Workspace APIs | REST APIs for automation, Terraform, CLI | Databricks SaaS | Control Plane |
| Users & Groups | Identity and user management | Databricks account services | Control Plane |
| Authentication / SSO | Integrates with IdP such as Okta or Azure AD | Databricks account services | Control Plane |
| Permissions / RBAC | Access control policies | Databricks control services | Control Plane |
| Notebook Source Code | Python / SQL / Scala notebooks | Workspace storage | Control Plane |
| Notebook Outputs | Charts and result previews | Workspace storage | Control Plane |
| Workspace Files | Files uploaded to workspace | Workspace storage | Control Plane |
| Repos (Git Integration) | GitHub / Git integration | Workspace metadata | Control Plane |
| Job Scheduler | Schedules pipelines and jobs | Databricks orchestration service | Control Plane |
| Workflows | Pipeline orchestration | Databricks orchestration service | Control Plane |
| SQL Query Planner | Optimizes SQL queries | Databricks query services | Control Plane |
| SQL Warehouse Management | Manages serverless SQL endpoints | Databricks control services | Control Plane |
| Unity Catalog | Central governance system | Databricks governance service | Control Plane |
| Metastore | Stores catalog and table metadata | Databricks metadata services | Control Plane |
| Data Lineage | Tracks data dependencies | Databricks governance services | Control Plane |
| Audit Logs | Security and governance logs | Databricks governance services | Control Plane |
| Cluster Management | Manages compute lifecycle | Databricks control services | Control Plane |
| Feature Store Metadata | ML feature definitions | Databricks metadata services | Control Plane |
| Model Registry Metadata | Tracks ML model versions | Databricks metadata services | Control Plane |
| Lakehouse Monitoring Metadata | Tracks dataset quality | Databricks monitoring services | Control Plane |
| Vector Search Metadata | Vector index configuration | Databricks control services | Control Plane |
Data Plane Components (Customer AWS Account)
| Component | Purpose / What It Does | Where It Runs or Is Stored | Plane |
|---|---|---|---|
| Serverless Spark Compute | Runs notebooks and jobs | AWS compute instances | Data Plane |
| Databricks SQL Warehouse Compute | Executes SQL queries | AWS compute instances | Data Plane |
| Delta Table Data | Actual table data | S3 | Data Plane |
| Managed Tables | Databricks managed tables | S3 | Data Plane |
| External Tables | Tables referencing external datasets | S3 | Data Plane |
| DBFS Root | Databricks File System root | S3 bucket | Data Plane |
| Unity Catalog Managed Storage | Table storage governed by Unity Catalog | S3 | Data Plane |
| Unity Catalog Volumes | File governance storage | S3 | Data Plane |
| MLflow Model Artifacts | ML models and artifacts | S3 | Data Plane |
| Feature Store Data | ML feature datasets | S3 | Data Plane |
| Vector Search Index Data | Vector embeddings | S3 | Data Plane |
| Streaming Checkpoints | Streaming job progress | S3 | Data Plane |
| Temporary Spark Shuffle Data | Intermediate processing data | Local disk / S3 | Data Plane |
| Job Execution Logs | Spark execution logs | S3 | Data Plane |
| ML Training Data | Training datasets | S3 | Data Plane |
| Delta Transaction Logs | Table versioning metadata | S3 | Data Plane |
Architecture Flow
Key Architecture Rule
| Type | Location |
|---|---|
| Metadata | Databricks Control Plane |
| Data | Customer AWS S3 |
| Compute | Customer AWS |
| Governance Policies | Unity Catalog (Control Plane) |
Monday, 26 January 2026
Databricks APIs – Overview and Python Examples
Databricks APIs – Architecture, Types, and Python Examples
Databricks provides a comprehensive set of REST APIs to automate platform setup, workspace administration, data governance, compute management, and analytics workflows. These APIs are commonly used for infrastructure automation, CI/CD pipelines, and application onboarding.
Common Python Setup
import requests
import json
DATABRICKS_HOST = "https://<databricks-instance>"
TOKEN = "<DATABRICKS_TOKEN>"
HEADERS = {
"Authorization": f"Bearer {TOKEN}",
"Content-Type": "application/json"
}
1. Account API
Purpose: Manage Databricks accounts and workspaces.
Documentation: Databricks Account API
Create a Workspace
url = f"{DATABRICKS_HOST}/api/2.0/accounts/<ACCOUNT_ID>/workspaces"
payload = {
"workspace_name": "dev-workspace",
"aws_region": "us-east-1",
"credentials_id": "cred-id",
"storage_configuration_id": "storage-id",
"network_id": "network-id"
}
response = requests.post(url, headers=HEADERS, json=payload)
print(response.json())
2. SCIM API
Purpose: Manage users, groups, and service principals.
Documentation: Databricks SCIM API
Create a Service Principal
url = f"{DATABRICKS_HOST}/api/2.0/preview/scim/v2/ServicePrincipals"
payload = {
"displayName": "my-app-sp"
}
response = requests.post(url, headers=HEADERS, json=payload)
print(response.json())
3. Unity Catalog API
Purpose: Centralized data governance for catalogs, schemas, and tables.
Documentation: Unity Catalog API
Create a Catalog
url = f"{DATABRICKS_HOST}/api/2.1/unity-catalog/catalogs"
payload = {
"name": "sales_catalog",
"comment": "Catalog for sales domain"
}
response = requests.post(url, headers=HEADERS, json=payload)
print(response.json())
Grant Catalog Permission
url = f"{DATABRICKS_HOST}/api/2.1/unity-catalog/permissions/catalogs/sales_catalog"
payload = {
"changes": [
{
"principal": "data_analysts",
"add": ["USE_CATALOG"]
}
]
}
response = requests.patch(url, headers=HEADERS, json=payload)
print(response.json())
4. Workspace API
Purpose: Manage clusters, jobs, notebooks, and workspace objects.
Documentation: Workspace API
Create a Cluster
url = f"{DATABRICKS_HOST}/api/2.0/clusters/create"
payload = {
"cluster_name": "demo-cluster",
"spark_version": "13.3.x-scala2.12",
"node_type_id": "Standard_DS3_v2",
"num_workers": 1,
"autotermination_minutes": 30
}
response = requests.post(url, headers=HEADERS, json=payload)
print(response.json())
5. Jobs API
Purpose: Orchestrate batch and streaming workloads.
Documentation: Jobs API
Create a Job
url = f"{DATABRICKS_HOST}/api/2.1/jobs/create"
payload = {
"name": "sample-job",
"tasks": [
{
"task_key": "run_notebook",
"notebook_task": {
"notebook_path": "/Shared/sample_notebook"
},
"new_cluster": {
"spark_version": "13.3.x-scala2.12",
"node_type_id": "Standard_DS3_v2",
"num_workers": 1
}
}
]
}
response = requests.post(url, headers=HEADERS, json=payload)
print(response.json())
6. Repos API
Purpose: Integrate Git repositories.
Documentation: Repos API
Create a Repo
Databricks APIs – Types & Summary
Types of Databricks APIs
Databricks provides a rich set of APIs to manage both the platform and workspace workloads. These APIs are categorized based on their scope and functionality, and they are critical for automation, CI/CD, governance, and onboarding of applications.
1. Account-Level APIs (Control Plane)
These APIs manage the Databricks account itself. They allow platform engineers to create and configure workspaces, set up Unity Catalog (metastores), manage networking, storage credentials, and service principals.
Official Databricks Account API Docs
2. Workspace-Level APIs (Data Plane)
These APIs operate inside a single workspace to manage data workloads such as:
- Clusters, Jobs, and Libraries
- DBFS file storage
- Secrets & instance pools
- SQL Warehouses
Official Workspace REST API Docs
3. Unity Catalog / Metastore APIs
These APIs manage metadata, governance, and data access across multiple workspaces:
- Create catalogs, schemas, tables, and external locations
- Grant permissions at table, column, or catalog level
- Attach or detach workspaces to a metastore
4. Repos API
Used to manage Git repositories integrated with Databricks (GitHub, GitLab, Azure DevOps). Enables CI/CD automation for notebooks.
5. Tokens & Authentication APIs
Used to manage personal access tokens (PATs) and service principal tokens for automation pipelines.
6. SCIM API
Manages users, groups, and service principals for identity management and enterprise compliance. SCIM v2.1 is standard.
7. SQL API
Enables programmatic execution of SQL queries and management of SQL endpoints / warehouses.
8. MLflow API
Manages the machine learning lifecycle including experiments, runs, and model registry.
Summary Table of Databricks APIs
| API Type | Scope | Purpose | Official Documentation |
|---|---|---|---|
| Account API | Account | Platform setup & governance: workspaces, metastore, network, credentials, service principals | Docs |
| Workspace REST API | Workspace | Data plane workloads: clusters, jobs, DBFS, libraries, secrets | Docs |
| SCIM API | Workspace / Account | Identity management: users, groups, service principals | Docs |
| Unity Catalog / Metastore API | Account + Workspaces | Data governance, catalogs, schemas, tables, permissions, external locations | Docs |
| Repos API | Workspace | Git repository integration for CI/CD | Docs |
| Tokens / Authentication API | Account / Workspace | Manage PATs & service principal tokens | Docs |
| SQL API | Workspace | Programmatic SQL execution & SQL endpoint management | Docs |
| MLflow API | Workspace | Machine learning lifecycle: experiments, runs, model registry | Docs |
For the full index of all Databricks APIs and SDKs: Databricks API Reference
Friday, 16 January 2026
Enterprise Databricks on AWS – Zero Trust, Unity Catalog & Audit-Ready Architecture
Enterprise Databricks on AWS: Zero-Trust, Unity Catalog & Audit-Ready Architecture
This document explains how to design and implement Databricks on AWS using Zero-Trust principles, Unity Catalog enforced security, cross-account data sharing, and an audit-ready architecture.
1. Zero-Trust Databricks Deployment (AWS)
What Zero-Trust Means for Databricks
- No public IPs
- No inbound internet access
- Explicit identity-based access only
- All access is authenticated, authorized, and logged
Core AWS Components
- Dedicated VPC per Databricks workspace
- Private subnets only
- VPC Endpoints (PrivateLink)
- IAM roles with least privilege
- Security Groups with deny-by-default
VPC Design
VPC (10.0.0.0/16) ├── Private Subnet A (10.0.1.0/24) - Databricks Compute ├── Private Subnet B (10.0.2.0/24) - Databricks Compute ├── VPC Endpoint Subnet └── No Internet Gateway
Required VPC Endpoints
com.amazonaws.<region>.s3com.amazonaws.<region>.stscom.amazonaws.<region>.logscom.amazonaws.<region>.monitoring- Databricks Control Plane PrivateLink endpoints
Security Groups
- No inbound rules
- Outbound only to:
- VPC endpoints
- Databricks control plane CIDRs
2. Unity Catalog Enforced Security
Why Unity Catalog Is Mandatory for Enterprises
- Centralized governance
- Fine-grained RBAC (catalog, schema, table, column, row)
- Cross-workspace data sharing
- Built-in auditing
Unity Catalog Core Objects
Metastore ├── Catalog (prod_sales) │ ├── Schema (orders) │ │ └── Table (transactions)
Metastore Setup (AWS)
- Create S3 bucket for UC storage
- Enable versioning & encryption (SSE-KMS)
- Attach IAM role to Databricks
S3 Bucket Policy: - Allow Databricks IAM Role - Deny public access - Enforce TLS
RBAC Example
Group: analytics_team Permissions: - USE CATALOG prod_sales - USE SCHEMA prod_sales.orders - SELECT ON TABLE prod_sales.orders.transactions
Row-Level Security (Dynamic Views)
CREATE VIEW prod_sales.orders.secure_transactions AS SELECT * FROM prod_sales.orders.transactions WHERE region = current_user();
3. Cross-Account Data Sharing (Unity Catalog)
Use Case
- Producer account owns raw data
- Consumer account reads curated data
- No data copy
Architecture
Account A (Producer)
└── Unity Catalog Metastore
└── Shared Catalog
Account B (Consumer)
└── Databricks Workspace
└── Read-only access
How Sharing Works
- Delta Sharing protocol
- IAM role trust between accounts
- Read-only permissions
Security Guarantees
- No write access
- All queries logged
- Column and row filters enforced
4. Audit-Ready Architecture
Audit Requirements Covered
- Who accessed what data
- When queries were run
- From which workspace
- Using which identity
Audit Logs
- Databricks audit logs → S3
- CloudTrail for IAM & API calls
- S3 access logs
Audit Log Flow
Databricks → S3 (Audit Logs) AWS CloudTrail → S3 S3 → SIEM / Athena / OpenSearch
What Auditors Love
- No shared credentials
- Identity-based access
- Immutable logs
- Separation of duties
5. End-to-End Control Summary
| Layer | Control |
|---|---|
| Network | Private VPC, PrivateLink, no internet |
| Identity | IAM + Databricks SCIM groups |
| Compute | Cluster policies & group binding |
| Data | Unity Catalog RBAC + RLS |
| Audit | Centralized logs in S3 |
Final Outcome
- Zero-trust Databricks deployment
- Centralized governance via Unity Catalog
- Secure cross-account data sharing
- Fully audit-ready enterprise platform
This architecture scales cleanly across Dev / Test / Prod, supports regulated workloads, and aligns with financial-grade security standards.
Databricks on AWS – Networking, Security & PrivateLink Architecture (Deep Dive)
Databricks on AWS – Complete Networking & Security Architecture Guide
This document explains how Databricks is deployed securely on AWS, focusing on:
- VPC & subnet design
- Control plane vs data plane
- IAM roles & instance profiles
- Security groups & traffic flow
- PrivateLink (frontend & backend)
1️⃣ Databricks Architecture Overview
::contentReference[oaicite:0]{index=0}Control Plane vs Data Plane
| Plane | Owned By | What Runs Here |
|---|---|---|
| Control Plane | Databricks | UI, REST APIs, Jobs scheduler, notebooks metadata |
| Data Plane | Customer AWS Account | Clusters, Spark executors, DBFS root, data access |
2️⃣ VPC Design (Customer-Managed)
Why Customer-Managed VPC?
- Network isolation
- PrivateLink support
- Compliance (SOC2, PCI, HIPAA)
Recommended VPC Layout
VPC (10.0.0.0/16)
│
├── Private Subnet A (10.0.1.0/24)
│ └── Databricks Workers
│
├── Private Subnet B (10.0.2.0/24)
│ └── Databricks Workers
│
├── Public Subnet (optional)
│ └── NAT Gateway
│
└── VPC Endpoints
├── S3
├── STS
├── Kinesis (optional)
└── Databricks PrivateLink
3️⃣ Subnets & Routing
Private Subnets
- No public IPs
- Route to NAT Gateway (only if needed)
- Preferred: VPC endpoints instead of NAT
Route Table (Private Subnet)
0.0.0.0/0 → NAT Gateway (optional) pl-xxxxxx → Databricks PrivateLink s3 → Gateway Endpoint
4️⃣ Security Groups (CRITICAL)
Databricks Cluster Security Group
| Direction | Port | Source | Purpose |
|---|---|---|---|
| Inbound | All | Self | Worker ↔ Worker communication |
| Outbound | 443 | 0.0.0.0/0 or VPC endpoints | Control plane, S3, APIs |
5️⃣ IAM Roles & Instance Profiles
Why IAM Roles?
- No access keys on clusters
- Least privilege data access
- Auditable via CloudTrail
Databricks EC2 Role
Trust Policy: Service: ec2.amazonaws.com
Permissions Policy
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::prod-data",
"arn:aws:s3:::prod-data/*"
]
}
Instance Profile
- IAM Role → Instance Profile
- Attached to Databricks clusters
6️⃣ PrivateLink Architecture
::contentReference[oaicite:1]{index=1}Frontend PrivateLink
- Users access Databricks UI privately
- No public internet exposure
Backend PrivateLink
- Clusters talk to control plane privately
- No NAT gateway required
Required VPC Endpoints
| Endpoint | Type |
|---|---|
| Databricks Control Plane | Interface |
| S3 | Gateway |
| STS | Interface |
| CloudWatch | Interface |
7️⃣ Traffic Flow (End-to-End)
User Browser ↓ (PrivateLink) Databricks Control Plane ↓ (PrivateLink) Cluster Driver (Private Subnet) ↓ S3 via VPC Endpoint
8️⃣ Common Enterprise Decisions
| Decision | Recommendation |
|---|---|
| Public vs Private workspace | Private (PrivateLink) |
| NAT Gateway | Avoid if endpoints available |
| IAM Users | Never |
| Data access | IAM Roles + Unity Catalog |
9️⃣ What This Enables Next
- Zero-trust Databricks deployment
- Unity Catalog enforced security
- Cross-account data sharing
- Audit-ready architecture
10️⃣ Typical Enterprise Follow-Up Topics
- Terraform modules for networking
- Private DNS for Databricks
- Multi-account AWS architecture
- Cost & network optimization
Thursday, 15 January 2026
Databricks REST API – Complete Enterprise Automation Guide (Python + AWS)
Databricks REST API – Complete Enterprise Automation Guide
This guide documents almost all commonly used Databricks REST API endpoints with working Python examples for enterprise automation on AWS.
0️⃣ Authentication & Base Configuration
Account-Level APIs
Base URL: https://accounts.cloud.databricks.com Auth: Account PAT
Workspace-Level APIs
Base URL: https://dbc-xxxx.region.databricks.com Auth: Workspace PAT
import requests
ACCOUNT_ID = "xxxx"
ACCOUNT_HOST = "https://accounts.cloud.databricks.com"
WORKSPACE_HOST = "https://dbc-xxxx.us-east-1.databricks.com"
ACCOUNT_HEADERS = {
"Authorization": "Bearer ACCOUNT_TOKEN",
"Content-Type": "application/json"
}
WORKSPACE_HEADERS = {
"Authorization": "Bearer WORKSPACE_TOKEN",
"Content-Type": "application/json"
}
1️⃣ Identity & SCIM APIs
| Endpoint | Purpose |
|---|---|
| POST /scim/v2/Users | Create user |
| GET /scim/v2/Users | List users |
| POST /scim/v2/Groups | Create group |
| PATCH /scim/v2/Groups/{id} | Add/remove members |
Create User
url = f"{ACCOUNT_HOST}/api/2.0/accounts/{ACCOUNT_ID}/scim/v2/Users"
payload = {
"userName": "alice@company.com",
"displayName": "Alice",
"active": true
}
requests.post(url, headers=ACCOUNT_HEADERS, json=payload).raise_for_status()
Create Group
url = f"{ACCOUNT_HOST}/api/2.0/accounts/{ACCOUNT_ID}/scim/v2/Groups"
payload = {"displayName": "data-engineers"}
group = requests.post(url, headers=ACCOUNT_HEADERS, json=payload).json()
2️⃣ Workspace (Account-Level) APIs
| Endpoint | Description |
|---|---|
| POST /workspaces | Create workspace |
| GET /workspaces | List workspaces |
| POST /permissionassignments | Assign groups to workspace |
Create Workspace
url = f"{ACCOUNT_HOST}/api/2.0/accounts/{ACCOUNT_ID}/workspaces"
payload = {
"workspace_name": "prod",
"aws_region": "us-east-1",
"credentials_id": "cred-123",
"storage_configuration_id": "storage-123",
"network_id": "network-123"
}
workspace = requests.post(url, headers=ACCOUNT_HEADERS, json=payload).json()
3️⃣ Cluster APIs
| Endpoint | Description |
|---|---|
| POST /clusters/create | Create cluster |
| GET /clusters/list | List clusters |
| POST /clusters/start | Start cluster |
| POST /clusters/delete | Delete cluster |
Create Cluster
url = f"{WORKSPACE_HOST}/api/2.0/clusters/create"
payload = {
"cluster_name": "engineering",
"spark_version": "13.3.x-scala2.12",
"node_type_id": "m5.xlarge",
"num_workers": 2
}
cluster = requests.post(url, headers=WORKSPACE_HEADERS, json=payload).json()
Set Cluster Permissions
url = f"{WORKSPACE_HOST}/api/2.0/permissions/clusters/{cluster['cluster_id']}"
payload = {
"access_control_list": [
{
"group_name": "data-engineers",
"permission_level": "CAN_ATTACH_TO"
}
]
}
requests.patch(url, headers=WORKSPACE_HEADERS, json=payload)
4️⃣ Jobs API
| Endpoint | Purpose |
|---|---|
| POST /jobs/create | Create job |
| POST /jobs/run-now | Run job |
| GET /jobs/list | List jobs |
Create Job
url = f"{WORKSPACE_HOST}/api/2.0/jobs/create"
payload = {
"name": "etl-job",
"new_cluster": {
"spark_version": "13.3.x-scala2.12",
"node_type_id": "m5.large",
"num_workers": 2
},
"notebook_task": {
"notebook_path": "/Shared/etl"
}
}
job = requests.post(url, headers=WORKSPACE_HEADERS, json=payload).json()
5️⃣ SQL & Warehouses API
| Endpoint | Description |
|---|---|
| POST /sql/warehouses | Create SQL warehouse |
| POST /sql/statements | Execute SQL |
Execute SQL
url = f"{WORKSPACE_HOST}/api/2.0/sql/statements"
payload = {
"statement": "SELECT current_user(), current_date()",
"warehouse_id": "wh-123"
}
requests.post(url, headers=WORKSPACE_HEADERS, json=payload)
6️⃣ DBFS & Workspace APIs
| Endpoint | Description |
|---|---|
| POST /dbfs/put | Upload file |
| GET /workspace/list | List notebooks |
| POST /workspace/import | Import notebook |
Upload File to DBFS
url = f"{WORKSPACE_HOST}/api/2.0/dbfs/put"
payload = {
"path": "/tmp/data.txt",
"contents": "SGVsbG8="
}
requests.post(url, headers=WORKSPACE_HEADERS, json=payload)
7️⃣ Unity Catalog APIs (Most Used)
| Endpoint | Description |
|---|---|
| POST /unity-catalog/catalogs | Create catalog |
| POST /unity-catalog/schemas | Create schema |
| POST /unity-catalog/tables | Create table |
| PATCH /unity-catalog/permissions | Grant access |
Create Catalog
url = f"{WORKSPACE_HOST}/api/2.1/unity-catalog/catalogs"
payload = {"name": "finance"}
requests.post(url, headers=WORKSPACE_HEADERS, json=payload)
Grant Table Access
url = f"{WORKSPACE_HOST}/api/2.1/unity-catalog/permissions/table/finance.payments.txns"
payload = {
"changes": [{
"principal": "data-scientists",
"add": ["SELECT"]
}]
}
requests.patch(url, headers=WORKSPACE_HEADERS, json=payload)
8️⃣ Tokens, Secrets, Repos
| Endpoint | Use |
|---|---|
| POST /token/create | Create PAT |
| POST /secrets/scopes/create | Create secret scope |
| POST /repos | Create repo |
9️⃣ Enterprise Best Practices
- Terraform for bootstrap & security
- Python APIs for day-2 operations
- Unity Catalog for ALL data access
- No IAM-based data access
Next Topics You Can Publish
- Databricks CI/CD pipelines
- API error handling & retries
- Zero-trust data architecture
- Cross-account Unity Catalog sharing
Enterprise Databricks on AWS – Identity, Workspace Isolation, Unity Catalog & RBAC (Terraform-Only)
Enterprise Databricks on AWS – Terraform-First Architecture
This article explains how to build a fully automated, enterprise-grade Databricks platform on AWS using Terraform only, covering:
- SCIM & Identity automation
- Workspace creation and isolation
- Unity Catalog metastore & data isolation
- Catalog, schema, table-level RBAC
- Row-level security using dynamic views
- Cross-account AWS data sharing
High-Level Enterprise Architecture
AWS Account (Databricks Account)
│
├── Account Console
│ ├── SCIM Users & Groups (Terraform)
│ ├── Unity Catalog Metastore (Terraform)
│ └── Workspaces (Dev / QA / Prod)
│
├── AWS Account A (Prod Data)
│ ├── S3 UC Managed Location
│ └── IAM Role (External Location)
│
├── AWS Account B (Analytics)
│ └── Read-only access via UC Sharing
│
└── Azure AD / Okta
└── Identity Source (SSO + SCIM)
1. SCIM Group Automation with Terraform
Why SCIM Matters
SCIM ensures that Databricks users and groups are never created manually. Azure AD (or Okta) remains the source of truth.
Terraform – Databricks Account Provider
provider "databricks" {
alias = "account"
host = "https://accounts.cloud.databricks.com"
account_id = var.databricks_account_id
}
Create Groups (Mirrors Azure AD)
resource "databricks_group" "data_engineers" {
provider = databricks.account
display_name = "data-engineers"
}
resource "databricks_group" "data_scientists" {
provider = databricks.account
display_name = "data-scientists"
}
Assign Users (SCIM)
resource "databricks_user" "alice" {
provider = databricks.account
user_name = "alice@company.com"
}
resource "databricks_group_member" "alice_engineers" {
provider = databricks.account
group_id = databricks_group.data_engineers.id
member_id = databricks_user.alice.id
}
2. Workspace Creation & Environment Isolation
Enterprise Workspace Strategy
- One workspace per environment
- Dev cannot modify Prod
- Shared metastore across workspaces
Create Workspace (AWS)
resource "databricks_mws_workspaces" "prod" {
provider = databricks.account
workspace_name = "prod-workspace"
aws_region = "us-east-1"
credentials_id = databricks_mws_credentials.this.credentials_id
storage_configuration_id = databricks_mws_storage_configurations.this.storage_configuration_id
}
Attach Groups to Workspace
resource "databricks_mws_permission_assignment" "prod_admins" {
provider = databricks.account
workspace_id = databricks_mws_workspaces.prod.workspace_id
principal_id = databricks_group.data_engineers.id
permissions = ["ADMIN"]
}
3. Unity Catalog Metastore – Terraform-Only
Create Metastore
resource "databricks_metastore" "main" {
provider = databricks.account
name = "enterprise-metastore"
region = "us-east-1"
storage_root = "s3://uc-metastore-root/"
}
Attach Metastore to Workspace
resource "databricks_metastore_assignment" "prod" {
provider = databricks.account
workspace_id = databricks_mws_workspaces.prod.workspace_id
metastore_id = databricks_metastore.main.id
}
4. Unity Catalog RBAC as Code (grants.tf)
Create Catalogs per Domain
resource "databricks_catalog" "finance" {
name = "finance"
}
Create Schemas
resource "databricks_schema" "payments" {
name = "payments"
catalog_name = databricks_catalog.finance.name
}
Grant Permissions
resource "databricks_grants" "finance_read" {
catalog = databricks_catalog.finance.name
grant {
principal = "data-scientists"
privileges = ["USE_CATALOG"]
}
}
5. Row-Level Security (Dynamic Views)
Use Case
- US team sees US data
- EU team sees EU data
Dynamic View
CREATE OR REPLACE VIEW finance.payments.secure_payments AS SELECT * FROM finance.payments.raw WHERE region = current_user();
6. Cross-Account AWS Sharing with Unity Catalog
Producer Account (Prod)
CREATE SHARE finance_share; ALTER SHARE finance_share ADD TABLE finance.payments.raw;
Consumer Account
CREATE CATALOG finance_shared USING SHARE finance_share WITH PROVIDER databricks;
7. Decision Diagrams for Architects
Identity Decision
Azure AD ├── Manual Users ❌ └── SCIM + SSO ✅
Data Access Decision
IAM Policies ❌ Unity Catalog Grants ✅
Security Model
Workspace ACLs → Compute Unity Catalog → Data
What This Enables Next
- Prod data read-only from Dev
- Cluster RBAC enforced
- Auditor-friendly access logs
- Multi-account AWS sharing
Suggested Multi-Post Series
- Identity & SCIM Automation
- Workspace Isolation Strategy
- Unity Catalog Deep Dive
- RBAC & Data Security Patterns
- Cross-Account Data Sharing