Monday, 30 March 2026

Databricks Serverless Jobs with Terraform

🚀 Databricks Serverless Jobs using Terraform (Step-by-Step Guide)

Important: Workspace infrastructure (IAM roles, S3, etc.) is assumed to be pre-created outside Terraform.

🏢 Part 1: Create Databricks Workspace

Step 1: Account-Level Provider

provider "databricks" {
  alias      = "account"
  host       = "https://accounts.cloud.databricks.com"
  account_id = var.account_id
  username   = var.username
  password   = var.password
}

Step 2: Create Workspace

resource "databricks_mws_workspaces" "workspace" {
  provider = databricks.account

  account_id = var.account_id
  aws_region = var.aws_region

  workspace_name = "demo-workspace"

  credentials_id           = var.credentials_id
  storage_configuration_id = var.storage_config_id
}

Step 3: Configure Workspace Provider

provider "databricks" {
  host  = databricks_mws_workspaces.workspace.workspace_url
  token = var.workspace_token
}

📓 Part 2: Job WITH Notebook

Step 1: Create Notebook

resource "databricks_notebook" "notebook" {
  path     = "/Shared/demo-notebook"
  language = "PYTHON"

  content_base64 = base64encode(<<EOF
print("Hello from Notebook Job")
EOF
  )
}

Step 2: Create Job

resource "databricks_job" "notebook_job" {
  name = "notebook-job"

  task {
    task_key = "task1"

    notebook_task {
      notebook_path = databricks_notebook.notebook.path
    }

    environment_key = "serverless_env"
  }

  environment {
    key = "serverless_env"

    spec {
      client = "1"
    }
  }

  schedule {
    quartz_cron_expression = "0 0 0 * * ?"
    timezone_id            = "UTC"
  }
}

🐍 Part 3: Job WITHOUT Notebook (Python Script)

Step 1: Create Python Script

resource "databricks_workspace_file" "script" {
  path = "/Shared/demo-script.py"

  content_base64 = base64encode(<<EOF
print("Hello from Python Script Job")
EOF
  )
}

Step 2: Create Job

resource "databricks_job" "python_job" {
  name = "python-job"

  task {
    task_key = "task1"

    spark_python_task {
      python_file = databricks_workspace_file.script.path
    }

    environment_key = "serverless_env"
  }

  environment {
    key = "serverless_env"

    spec {
      client = "1"
    }
  }

  schedule {
    quartz_cron_expression = "0 0 0 * * ?"
    timezone_id            = "UTC"
  }
}

▶️ Execution Steps

terraform init
terraform plan
terraform apply

🔥 Key Takeaways

Workspace is created at account level
Jobs and notebooks are workspace-level resources
Serverless jobs require environment_key
Use Python scripts for production workloads

🚀 Pro Tip

For production workloads, avoid notebooks and use Python scripts or packaged jobs with CI/CD pipelines.

Tuesday, 24 March 2026

Databricks Data Engineer Associate - Complete Guide

Databricks Data Engineer Associate - Complete Step-by-Step Guide

Step 1: Databricks Fundamentals

Theory

Lakehouse = Data Lake + Data Warehouse
Built on Apache Spark
Uses Delta Lake for reliability

Core Components

Workspace
Cluster
Notebook
Jobs

Practical


spark.range(10).show()

Key Concept: Driver = brain, Workers = execution

Step 2: Apache Spark

Theory

DataFrames are distributed tables
Lazy evaluation (execution happens only on action)

Transformations vs Actions

Type	Example
Transformation	filter, select
Action	show, count

Practical

Read Data


df = spark.read.format("csv").option("header", True).load("/FileStore/data.csv")

Transform


df2 = df.filter(df.age > 30).select("name", "age")

Aggregate


df3 = df.groupBy("city").count()

Join


df.join(df2, "id", "inner")

Step 3: Delta Lake (Critical)

Theory

Provides ACID transactions
Supports updates, deletes, and merges
Supports time travel

Practical

Create Table


df.write.format("delta").save("/delta/table1")

Read Table


df = spark.read.format("delta").load("/delta/table1")

Update


UPDATE table1 SET age = 40 WHERE id = 1;

Delete


DELETE FROM table1 WHERE id = 2;

Merge (Important)


MERGE INTO target t
USING source s
ON t.id = s.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

Time Travel


SELECT * FROM table1 VERSION AS OF 2;

Step 4: Data Ingestion

Theory

Batch = one-time processing
Streaming = continuous processing

Practical

Batch


df = spark.read.json("/data/input")
df.write.format("delta").save("/data/output")

Streaming


df = spark.readStream.format("json").load("/input")

df.writeStream
  .format("delta")
  .option("checkpointLocation", "/chk")
  .start("/output")

Important: Checkpointing prevents data loss

Step 5: ETL Pipeline (Medallion Architecture)

Layer	Purpose
Bronze	Raw data
Silver	Cleaned data
Gold	Aggregated data

Practical

Bronze


df.write.format("delta").save("/bronze/data")

Silver


df_clean = df.filter("age IS NOT NULL")
df_clean.write.format("delta").save("/silver/data")

Gold


df.groupBy("city").count().write.format("delta").save("/gold/data")

Step 6: Databricks SQL

Practical

Create Table


CREATE TABLE users USING DELTA LOCATION '/delta/users';

Query


SELECT city, COUNT(*) FROM users GROUP BY city;

Temp View


CREATE TEMP VIEW temp_users AS SELECT * FROM users;

Step 7: Jobs & Automation

Create jobs from notebooks
Schedule using cron
Supports task dependencies

Step 8: Performance Optimization

Practical

Caching


df.cache()

Partitioning


df.write.partitionBy("city").format("delta").save("/data")

Step 9: Security

Unity Catalog for governance
Table-level permissions

Final Project (Recommended)

Ingest JSON → Bronze
Clean → Silver
Aggregate → Gold
Query using SQL

Final Checklist

Spark transformations
Delta MERGE, UPDATE, DELETE
Streaming basics
ETL pipelines
Jobs & scheduling

Pro Tip

If you already have experience with AWS, EMR, or streaming systems, focus mainly on:

Delta Lake
Databricks UI

Saturday, 21 March 2026

Databricks Workspace Resources – Complete Guide

Databricks Workspace Resources – Full Guide

A Databricks workspace provides an environment where you can create, organize, and manage compute resources, data objects, automation workflows, analytics assets, and machine learning components. The Databricks UI supports creating notebooks, queries, dashboards, jobs, pipelines, experiments, models, and more through the + New menu and the workspace sidebar [1](https://learn.microsoft.com/en-us/azure/databricks/jobs/pipeline). Workspace objects include notebooks, jobs, libraries, data files, experiments, and more [2](https://www.youtube.com/watch?v=cNFKzWpRvsw).

Summary Table of Creatable Databricks Workspace Resources

Resource	Description	How to Create (UI Steps)
Notebook	Interactive document for Python, SQL, R, Scala code execution [2](https://www.youtube.com/watch?v=cNFKzWpRvsw).	1. Click + New → Notebook. 2. Enter notebook name. 3. Choose language. 4. Select compute (cluster not covered). 5. Click Create [1](https://learn.microsoft.com/en-us/azure/databricks/jobs/pipeline).
Query	SQL query used for dashboards & alerts [1](https://learn.microsoft.com/en-us/azure/databricks/jobs/pipeline).	1. Click + New → Query. 2. SQL Editor opens. 3. Select SQL Warehouse. 4. Write SQL and click Save.
Dashboard	Visual BI dashboard created from queries [1](https://learn.microsoft.com/en-us/azure/databricks/jobs/pipeline).	1. Open a saved query. 2. Click Add to Dashboard. 3. Create new or choose existing. 4. Arrange visuals → Save.
Alert	Condition-based SQL alert [1](https://learn.microsoft.com/en-us/azure/databricks/jobs/pipeline).	1. Open a SQL Query. 2. Click Create Alert. 3. Add condition + recipients. 4. Save.
Repo	Git-connected source repo [1](https://learn.microsoft.com/en-us/azure/databricks/jobs/pipeline).	1. Click + New → Repo. 2. Choose Git provider. 3. Paste repository URL. 4. Authenticate and click Create.
File	Workspace-level file (CSV, Python script, config) [2](https://www.youtube.com/watch?v=cNFKzWpRvsw).	1. Open Workspace browser. 2. Click Add → File Upload. 3. Upload file.
Library	Install Python/JAR packages for use in notebooks/jobs [2](https://www.youtube.com/watch?v=cNFKzWpRvsw).	1. Go to Workspace → Libraries. 2. Click Install New. 3. Upload wheel/JAR or specify PyPI package. 4. Click Install.
Job	Automation for notebooks, scripts, JARs, pipelines [2](https://www.youtube.com/watch?v=cNFKzWpRvsw).	1. Click Jobs in sidebar. 2. Click Create Job. 3. Name the job. 4. Click Add Task and choose task type. 5. Configure task details. 6. Assign compute (cluster selection only). 7. Add schedule if needed. 8. Click Create [1](https://learn.microsoft.com/en-us/azure/databricks/jobs/pipeline).
Pipeline	DLT / Lakeflow ETL pipelines (triggered or continuous) [3](https://docs.databricks.com/aws/en/getting-started/concepts).	1. Click Jobs & Pipelines. 2. Click Create Pipeline. 3. Enter name. 4. Select pipeline mode. 5. Add SQL/Python pipeline code. 6. Select target catalog/schema. 7. Configure settings. 8. Click Create.
Experiment	MLflow tracking experiment for ML models [1](https://learn.microsoft.com/en-us/azure/databricks/jobs/pipeline).	1. Click + New → Experiment. 2. Enter name and location. 3. Click Create.
Model	MLflow model stored in Model Registry [4](https://devstacktips.com/development/programming-languages/2025/06/06/mastering-databricks-jobs-api-build-and-orchestrate-complex-data-pipelines/).	1. Open an MLflow run. 2. Click Register Model. 3. Select or create model name. 4. Register.
Serving Endpoint	Real-time inference endpoint for ML models [1](https://learn.microsoft.com/en-us/azure/databricks/jobs/pipeline).	1. Click + New → Serving Endpoint. 2. Select model. 3. Configure autoscaling. 4. Click Create Endpoint.

Visual Diagram of All Databricks Workspace Resources

The following diagram shows how notebooks, pipelines, jobs, dashboards, alerts, and ML workflows connect logically inside a Databricks workspace.

```mermaid
flowchart TD

A[Notebook] --> B[Job]
B --> C[Pipeline]
C --> D[Tables / Data Assets]

A --> E[Experiment]
E --> F[Model]
F --> G[Serving Endpoint]

B --> H[Dashboards]
H --> I[Alerts]

J[Repo] --> A
K[Files / Libraries] --> A

Thursday, 19 March 2026

Databricks Serverless Job with JAR from S3 via Volume

Databricks Serverless Job with JAR (S3 → Volume → Notebook → Job)

🎯 Goal

Upload JAR to S3
Create Databricks Volume
Copy JAR to Volume
Create Notebook
Create Job via UI
Run and validate

🧪 1️⃣ Sample Test JAR

HelloSpark.java

package com.example;

import org.apache.spark.sql.SparkSession;

public class HelloSpark {
    public static void main(String[] args) {

        SparkSession spark = SparkSession.builder()
                .appName("Test JAR Job")
                .getOrCreate();

        long count = spark.range(1, 100).count();

        System.out.println("Count is: " + count);

        spark.stop();
    }
}

pom.xml

<project>
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.example</groupId>
  <artifactId>hello-spark</artifactId>
  <version>1.0</version>

  <dependencies>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_2.12</artifactId>
      <version>3.5.0</version>
      <scope>provided</scope>
    </dependency>
  </dependencies>
</project>

Build JAR

mvn clean package

Output: target/hello-spark-1.0.jar

☁️ 2️⃣ Upload JAR to S3

aws s3 cp target/hello-spark-1.0.jar s3://my-artifact-bucket/libs/

🧱 3️⃣ Databricks UI Setup

Step 1: Create Storage Credential

Go to: Data → Credentials
Click: Create Credential
Name: my_cred
IAM Role ARN: your role ARN

Step 2: Create External Location

Go to: Data → External Locations
Name: my_ext_loc
URL: s3://my-volume-bucket/
Credential: my_cred

Step 3: Create Volume

Go to: Catalog → Schema
Create Volume: my_volume

📁 4️⃣ Copy JAR to Volume

volume_path = "/Volumes/my_catalog/my_schema/my_volume/"

dbutils.fs.cp(
    "s3://my-artifact-bucket/libs/hello-spark-1.0.jar",
    volume_path + "hello-spark.jar"
)

display(dbutils.fs.ls(volume_path))

📓 5️⃣ Notebook Example

print("Running Databricks Job with JAR")

df = spark.range(1, 10)
display(df)

⚙️ 6️⃣ Create Job (UI)

Go to: Workflows → Jobs → Create Job
Job Name: test-jar-job
Task Type: Notebook
Select Notebook

Add Library

/Volumes/my_catalog/my_schema/my_volume/hello-spark.jar

Compute

Serverless

▶️ 7️⃣ Run Job

Click Run Now
Check logs under Runs

✅ 8️⃣ Expected Output

Count is: 99

🧪 9️⃣ Test Scenarios

Positive

JAR loads successfully
Notebook executes
Volume accessible

Negative

No Volume permission → Access Denied
Wrong IAM Role → S3 Access Denied
Missing JAR → File not found

🔐 🔥 10️⃣ Enterprise Best Practices

Use separate S3 bucket for artifacts
Use Unity Catalog Volumes for governance
Restrict S3 access to specific prefixes
Enable audit logging

🎯 Final Flow

S3 (JAR) → Volume → Notebook → Job → Output

Databricks Job with S3, Volume and JAR (Serverless)

Databricks Serverless Job with S3 + Volume + Notebook

🧩 Architecture

S3 (JAR / Libraries) → Databricks Volume → Notebook → Databricks Job

S3 stores JAR files and artifacts
Volume (Unity Catalog) provides governed access
Notebook runs logic
Job executes workload

🔐 IAM Role & Policy

IAM Policy

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "S3ReadArtifacts",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::my-artifact-bucket",
        "arn:aws:s3:::my-artifact-bucket/*"
      ]
    },
    {
      "Sid": "S3VolumeAccess",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject"
      ],
      "Resource": [
        "arn:aws:s3:::my-volume-bucket/*"
      ]
    }
  ]
}

Trust Policy

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam:::root"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "sts:ExternalId": ""
        }
      }
    }
  ]
}

🧱 Unity Catalog Setup

Create Storage Credential

CREATE STORAGE CREDENTIAL my_cred
WITH IAM_ROLE = 'arn:aws:iam::123456789012:role/databricks-role';

Create External Location

CREATE EXTERNAL LOCATION my_ext_loc
URL 's3://my-volume-bucket/'
WITH (STORAGE CREDENTIAL my_cred);

Create Volume

CREATE VOLUME my_catalog.my_schema.my_volume
LOCATION 's3://my-volume-bucket/vol/';

📁 Notebook Example

volume_path = "/Volumes/my_catalog/my_schema/my_volume/"

# Copy JAR from S3 to Volume
dbutils.fs.cp(
    "s3://my-artifact-bucket/libs/my-app.jar",
    volume_path + "my-app.jar"
)

# List files
display(dbutils.fs.ls(volume_path))

⚙️ Using JAR in Job

Option 1: Add JAR in Notebook

spark.sparkContext.addJar("/Volumes/my_catalog/my_schema/my_volume/my-app.jar")

Option 2: Job Configuration (Recommended)

/Volumes/my_catalog/my_schema/my_volume/my-app.jar

🚀 Databricks Job JSON

{
  "name": "jar-test-job",
  "tasks": [
    {
      "task_key": "run-notebook",
      "notebook_task": {
        "notebook_path": "/Workspace/Users/test/notebook"
      },
      "libraries": [
        {
          "jar": "/Volumes/my_catalog/my_schema/my_volume/my-app.jar"
        }
      ]
    }
  ]
}

🧪 Test Scenarios

Positive Tests

JAR loads successfully
Notebook executes without error
Volume is accessible

Negative Tests

No permission on volume → Access Denied
Invalid IAM role → Storage credential failure
Missing JAR → Job failure

✅ Summary

Component	Purpose
S3	Stores JAR and artifacts
IAM Role	Grants access to S3
Storage Credential	Connects Databricks to AWS
External Location	Maps S3 to Databricks
Volume	Secure file access layer
Notebook	Executes logic
Job	Runs the workflow

Wednesday, 18 March 2026

S3 Bucket Security for Databricks on AWS – Do You Need a Bucket Policy

S3 Bucket Security for Databricks on AWS – Do You Need a Bucket Policy?

Short Answer

Yes — if you don’t have a bucket policy (or any explicit deny), any IAM principal in that AWS account with s3:* (or sufficient S3 permissions) can access the bucket, including data used by Databricks.

Why This Happens

AWS authorization follows this rule:

Access is allowed if there is at least one ALLOW and no explicit DENY

So if:

An IAM role/user has s3:*
The bucket has no restrictive bucket policy

Then access is granted.

What This Means

Without Bucket Policy

Databricks role → ✅ Access (expected)
Any other IAM role with S3 permissions → ❗ Also has access

This includes:

Admin roles
Other application roles
Over-permissioned users

Security Risk

❌ No data isolation
❌ Violates zero-trust principles
❌ Compliance risk (PII, GDPR, etc.)

Example: Another team’s EC2 role with s3:* can read your Databricks data.

Recommended Fix – Bucket Policy

Allow Only Databricks Role

{
  "Effect": "Allow",
  "Principal": {
    "AWS": "arn:aws:iam::<ACCOUNT_ID>:role/<UC_ROLE>"
  },
  "Action": "s3:*",
  "Resource": [
    "arn:aws:s3:::my-data-bucket",
    "arn:aws:s3:::my-data-bucket/*"
  ]
}

Deny Everyone Else (Critical)

{
  "Effect": "Deny",
  "NotPrincipal": {
    "AWS": "arn:aws:iam::<ACCOUNT_ID>:role/<UC_ROLE>"
  },
  "Action": "s3:*",
  "Resource": [
    "arn:aws:s3:::my-data-bucket",
    "arn:aws:s3:::my-data-bucket/*"
  ]
}

Security Model


IAM Role Policy → Defines WHAT actions are allowed
+
Bucket Policy → Defines WHO can access the bucket

Key Insight

IAM policy answers: “What can this role do?”
Bucket policy answers: “Who is allowed to access this bucket?”

Without a bucket policy, you lose resource-level protection.

Final Answer

✔ Yes — without a bucket policy, any IAM role with s3:* can access your Databricks S3 data
✔ Use a bucket policy to restrict access
✔ Allow only the Unity Catalog role
✔ Add explicit deny for all other principals

This ensures a secure, enterprise-grade, zero-trust setup for Databricks on AWS.

Databricks Serverless on AWS – IAM Roles, Policies, and Security Best Practices

Overview

In Databricks Serverless on AWS, IAM roles are required to securely enable cross-account access between the Databricks control plane and your AWS account. Unlike traditional clusters, serverless removes the need for instance profiles and instead relies on Unity Catalog and cross-account roles.

1. Cross-Account Role (Control Plane Role)

Purpose

Allows Databricks control plane to access AWS resources
Used for workspace validation, metadata access, and configuration
Does NOT perform data modifications

Trust Policy (External ID Required)

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DatabricksAssumeRole",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<DATABRICKS_ACCOUNT_ID>:root"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "sts:ExternalId": "<UNIQUE_EXTERNAL_ID>"
        }
      }
    }
  ]
}

Why External ID is Required

Prevents the Confused Deputy Problem
Ensures only your Databricks workspace can assume the role
Mandatory for secure cross-account access

Permissions Policy

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "S3ReadAccessForValidation",
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket",
        "s3:GetBucketLocation"
      ],
      "Resource": "arn:aws:s3:::my-data-bucket"
    },
    {
      "Sid": "GlueReadAccess",
      "Effect": "Allow",
      "Action": [
        "glue:GetDatabase",
        "glue:GetDatabases",
        "glue:GetTable",
        "glue:GetTables",
        "glue:GetPartitions"
      ],
      "Resource": "*"
    },
    {
      "Sid": "CloudWatchReadLogs",
      "Effect": "Allow",
      "Action": [
        "logs:DescribeLogGroups",
        "logs:DescribeLogStreams"
      ],
      "Resource": "*"
    }
  ]
}

Why These Permissions

s3:ListBucket – Validate bucket existence
s3:GetBucketLocation – Ensure region alignment
glue:Get* – Read metadata for tables
logs:Describe* – Optional monitoring and debugging

Security Note: No write access is granted to ensure least privilege.

2. Unity Catalog Storage Credential Role (Data Access Role)

Purpose

Provides data access for serverless compute
Used by Unity Catalog for governance
Replaces instance profile roles

Trust Policy

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DatabricksUnityCatalogAccess",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<DATABRICKS_ACCOUNT_ID>:root"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "sts:ExternalId": "<UNIQUE_EXTERNAL_ID>"
        }
      }
    }
  ]
}

Permissions Policy

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "S3DataAccess",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject"
      ],
      "Resource": "arn:aws:s3:::my-data-bucket/*"
    },
    {
      "Sid": "S3ListAccess",
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket"
      ],
      "Resource": "arn:aws:s3:::my-data-bucket"
    }
  ]
}

Why These Permissions

s3:GetObject – Read data
s3:PutObject – Write data
s3:DeleteObject – Cleanup/overwrite
s3:ListBucket – Required for query planning

Optional: KMS Permissions

If your S3 bucket uses encryption:

{
  "Effect": "Allow",
  "Action": [
    "kms:Decrypt",
    "kms:Encrypt",
    "kms:GenerateDataKey"
  ],
  "Resource": "<KMS_KEY_ARN>"
}

Why Needed

Decrypt data during reads
Encrypt data during writes

What You Should NOT Include

s3:* – Too broad
iam:* – Security risk
ec2:* – Not required in serverless
glue:* (write) – Prevent schema tampering

Architecture Summary


Databricks Serverless Compute
        │
        ▼
Assume Role (with External ID)
        │
        ├── Cross-Account Role → Metadata access
        └── Storage Credential Role → S3 data access

Key Takeaways

External ID is mandatory for security
No instance profile role in serverless
Cross-account role = control plane access
Unity Catalog role = data plane access
Strict least privilege must be enforced

Final Summary

Cross-account role (read-only, control plane)
Unity Catalog storage credential role (data access)
Optional KMS permissions

This setup ensures a secure, scalable, and enterprise-ready Databricks Serverless architecture on AWS.

Friday, 13 March 2026

Databricks Roles Full Reference Matri

Databricks Roles Full Reference Matrix

Databricks Roles – Full Reference Matrix

This table includes Workspace Roles, Account Roles, and Unity Catalog Roles with exact capabilities.

Role	Category	Capabilities / Permissions	Notes
Workspace Admin	Workspace	Manage users and groups Assign workspace roles Create/manage clusters Restart/terminate all clusters Create/manage jobs and workflows Create SQL warehouses Manage secrets, libraries, instance profiles Access DBFS (read/write) Run notebooks and jobs	Full control of workspace; does NOT grant automatic data access in Unity Catalog
User	Workspace	Create/edit/run own notebooks Create/run jobs Create clusters (if allowed by cluster policies) Access DBFS (read/write) Use SQL warehouses (if permitted)	Cannot manage other users or workspace settings
Can Manage / Job Creator	Workspace	Create/manage own jobs and clusters Run notebooks Upload files to DBFS	Limited admin; cannot manage other users or workspace-wide settings
Viewer	Workspace	Read-only access to notebooks, dashboards View clusters and jobs Read access to DBFS (if allowed)	No write permissions
Account Admin	Account	Create and delete workspaces Assign workspace admins Manage metastore assignments Access account-wide audit logs Manage billing / usage	Full control over account; workspace-level roles must still be respected
Billing / Support Roles	Account	View usage and billing Access technical support	Cannot manage workspace or data; read-only account permissions
Metastore Admin	Unity Catalog	Create catalogs and schemas Create storage credentials and external locations Assign catalog-level permissions Grant/revoke data access	Full control over UC metadata; does NOT give workspace admin rights
Catalog Owner	Unity Catalog	Manage catalog and contained schemas Grant/revoke access at catalog level	Limited to one catalog; cannot manage other catalogs
Schema Owner	Unity Catalog	Manage schema and contained tables/views Grant/revoke access at schema level	Cannot manage catalog-level permissions
Volume Owner	Unity Catalog	Manage managed volumes (file storage) Grant/revoke access to volumes	Access to volume paths only
Data Access Roles (SELECT / MODIFY / USAGE)	Unity Catalog	Read/write/query specific tables, views, volumes Can be granted granular privileges via grants	Applied per-object; separate from workspace admin rights

Databricks on AWS – Least Privilege Permission Matrix

Databricks on AWS Least Privilege Permission Matrix

Databricks on AWS – Least Privilege Permission Matrix

This matrix describes the minimal AWS and Databricks permissions required to create or manage common platform resources when using Databricks on AWS. The goal is to follow enterprise least-privilege security principles.

Resource	Primary Owner Role	Required AWS Permissions	Required Databricks Permissions	Purpose	Security / Least Privilege Notes
Workspace	Platform Admin	iam:CreateRole, iam:AttachRolePolicy, ec2:CreateVpc, s3:CreateBucket	Account Admin	Create Databricks workspace	Automate using Terraform and restrict to platform team
Cross Account IAM Role	AWS Cloud Admin	iam:CreateRole, iam:PutRolePolicy, sts:AssumeRole	None	Allows Databricks control plane access	Trust only Databricks account
Root Storage (DBFS)	AWS Cloud Admin	s3:CreateBucket, s3:PutBucketPolicy	None	Workspace default storage	Enable encryption and versioning
Unity Catalog Metastore	Data Platform Admin	s3:GetObject, s3:PutObject, s3:ListBucket	Metastore Admin	Central governance metadata store	Dedicated metastore bucket
Metastore Assignment	Platform Admin	None	Account Admin	Attach metastore to workspace	Single metastore per region recommended
Storage Credential	Data Platform Admin	iam:PassRole, sts:AssumeRole	CREATE STORAGE CREDENTIAL	Connect Unity Catalog to S3	IAM role should allow only specific S3 path
External Location	Data Governance Admin	s3:GetObject, s3:PutObject	CREATE EXTERNAL LOCATION	Expose S3 path to Unity Catalog	Use path-level permissions
Catalog	Data Governance Admin	Access to storage location	CREATE CATALOG	Top governance layer	One catalog per domain recommended
Schema	Data Owner	None	CREATE SCHEMA	Database container	Grant schema-level privileges
Delta Table	Data Engineer	S3 read/write	CREATE TABLE	Structured table storage	Use Unity Catalog governance
External Table	Data Engineer	S3 read	CREATE TABLE	Reference external dataset	Avoid direct S3 access
Notebook	Data Engineer / Analyst	None	Workspace Editor	Analytics code	Store production code in Git
Git Repo Integration	Developer	None	Workspace Editor	Version control integration	Use GitHub / GitLab PAT
Job / Workflow	Data Engineer	None	CREATE JOB	Automated pipelines	Define jobs as code
Cluster	Platform Admin	ec2:RunInstances, iam:PassRole	CREATE CLUSTER	Compute resource	Restrict using cluster policies
SQL Warehouse	Data Engineer	None	CREATE SQL WAREHOUSE	Serverless SQL analytics	Limit compute size via policies
Cluster Policy	Platform Admin	None	CREATE CLUSTER POLICY	Restrict compute usage	Important governance control
Feature Store Table	ML Engineer	S3 read/write	CREATE TABLE	Machine learning features	Stored as Delta tables
ML Model Registry	ML Engineer	S3 artifact storage	CREATE MODEL	Track ML model versions	Store artifacts in secure bucket
Streaming Checkpoints	Data Engineer	s3:PutObject, s3:GetObject	Job permission	Streaming progress tracking	Separate checkpoint directory
Unity Catalog Volume	Data Platform Admin	S3 access	CREATE VOLUME	File storage governance	Alternative to DBFS
Audit Logs	Security Team	S3 write	Account Admin	Security auditing	Send logs to SIEM
PrivateLink Networking	AWS Cloud Admin	ec2:CreateVpcEndpoint	Account Admin	Private connectivity	Required for highly secure environments
DBFS File Upload	User	s3:PutObject	Workspace User	Temporary file storage	Avoid for production data

Databricks Architecture Matrix with DR and Terraform Resources

Databricks Architecture Matrix (Serverless on AWS)

This document shows where major Databricks components live when using Serverless on AWS, including Disaster Recovery strategies and Terraform resources used for automation.

Control Plane Components

Component	Purpose	Where It Runs	Plane	DR Best Practice	Terraform Resource
Workspace	Main analytics workspace	Databricks SaaS	Control Plane	Create secondary workspace in another region	databricks_mws_workspaces
Users	User identity	Databricks account	Control Plane	Use centralized IdP	databricks_user
Groups	Access management	Databricks account	Control Plane	Manage via SCIM	databricks_group
Group Membership	User-group association	Databricks account	Control Plane	Recreate from IaC	databricks_group_member
Notebook Source Code	Notebook scripts	Workspace storage	Control Plane	Store notebooks in Git	databricks_notebook
Repos (Git Integration)	Source code integration	Workspace metadata	Control Plane	Keep Git remote as source of truth	databricks_repo
Job Scheduler	Pipeline scheduling	Databricks control services	Control Plane	Define jobs as code	databricks_job
Cluster Configuration	Compute definition	Databricks control services	Control Plane	Recreate clusters via IaC	databricks_cluster
SQL Warehouse	Serverless SQL endpoint	Databricks control services	Control Plane	Recreate warehouse in DR region	databricks_sql_endpoint
Unity Catalog Metastore	Metadata store	Databricks metadata service	Control Plane	Replicate configuration	databricks_metastore
Unity Catalog Catalog	Top level data container	Databricks governance service	Control Plane	Recreate catalogs	databricks_catalog
Unity Catalog Schema	Database layer	Databricks governance service	Control Plane	Recreate schema structure	databricks_schema
Permissions	Access control policies	Databricks governance service	Control Plane	Store as code	databricks_grants
Model Registry	ML model version tracking	Databricks metadata services	Control Plane	Replicate model metadata	databricks_mlflow_model
Feature Store Metadata	ML feature definitions	Databricks metadata services	Control Plane	Store definitions in Git	databricks_feature_table

Data Plane Components (AWS)

Component	Purpose	Where It Runs	Plane	DR Best Practice	Terraform Resource
S3 Data Lake	Primary storage	AWS S3	Data Plane	Enable cross-region replication	aws_s3_bucket
Delta Tables	Structured data storage	S3	Data Plane	Replicate bucket	aws_s3_bucket
DBFS Root Storage	Databricks filesystem	S3	Data Plane	Enable bucket versioning	aws_s3_bucket
MLflow Artifact Storage	Stores ML models	S3	Data Plane	Replicate artifact bucket	aws_s3_bucket
Streaming Checkpoints	Streaming progress tracking	S3	Data Plane	Replicate checkpoint folders	aws_s3_bucket
Feature Store Data	ML training features	S3	Data Plane	Enable replication	aws_s3_bucket
Execution Logs	Spark logs	S3	Data Plane	Central logging system	aws_s3_bucket
Serverless Spark Compute	Job execution	AWS compute	Data Plane	Use multi-region workspace	N/A (managed by Databricks)
Temporary Spark Shuffle Data	Intermediate processing	Compute disk	Data Plane	No DR required	N/A

Architecture Flow

Databricks Control Plane Workspace Unity Catalog Metastore Jobs Clusters SQL Warehouses | | Secure API v AWS Data Plane Serverless Spark Compute Delta Tables ML Models Streaming Checkpoints | v Amazon S3 Data Lake

Databricks Architecture Matrix with DR Best Practices

Databricks Architecture Matrix (Serverless on AWS) with DR Best Practices

This document explains where major Databricks components reside when running Databricks Serverless on AWS and the recommended Disaster Recovery (DR) strategy for each component.

Control Plane Components

Component	Purpose	Where It Runs	Plane	DR Best Practice
Workspace UI	User interface for notebooks and jobs	Databricks SaaS	Control Plane	Create secondary workspace in another region
Workspace APIs	Automation APIs	Databricks SaaS	Control Plane	Automate infrastructure using Terraform
Users & Groups	User identity management	Databricks account services	Control Plane	Use centralized IdP like Okta/Azure AD
Authentication / SSO	Login via external identity provider	Databricks account services	Control Plane	Configure SSO redundancy at IdP level
Permissions / RBAC	Access control policies	Databricks control services	Control Plane	Store policies as code using Terraform
Notebook Source Code	Notebook scripts	Workspace storage	Control Plane	Sync notebooks with Git repositories
Notebook Outputs	Charts and query results	Workspace storage	Control Plane	Do not rely on outputs; regenerate from data
Workspace Files	Files uploaded to workspace	Workspace storage	Control Plane	Store important files in S3 or Git
Repos (Git Integration)	Git source control integration	Workspace metadata	Control Plane	Maintain source code in GitHub/GitLab
Job Scheduler	Schedules workflows	Databricks orchestration service	Control Plane	Define jobs using Infrastructure-as-Code
Workflows	Pipeline orchestration	Databricks orchestration service	Control Plane	Export workflows via API and Terraform
SQL Query Planner	SQL optimization engine	Databricks query services	Control Plane	No DR needed (managed by Databricks)
SQL Warehouse Management	Serverless SQL management	Databricks control services	Control Plane	Recreate warehouses in secondary region
Unity Catalog	Central governance system	Databricks governance service	Control Plane	Replicate catalogs configuration using scripts
Metastore	Metadata storage	Databricks metadata services	Control Plane	Export metadata periodically
Data Lineage	Tracks data relationships	Databricks governance services	Control Plane	Export lineage metadata via APIs
Audit Logs	Security logs	Databricks governance services	Control Plane	Send logs to centralized SIEM storage
Cluster Management	Compute lifecycle management	Databricks control services	Control Plane	Recreate clusters via automation
Feature Store Metadata	Feature definitions	Databricks metadata services	Control Plane	Backup definitions in Git
Model Registry Metadata	ML model tracking	Databricks metadata services	Control Plane	Replicate registry configuration
Lakehouse Monitoring Metadata	Dataset monitoring metrics	Databricks monitoring services	Control Plane	Export monitoring metrics
Vector Search Metadata	Vector index configuration	Databricks control services	Control Plane	Recreate vector indexes from embeddings

Data Plane Components (Customer AWS)

Component	Purpose	Where It Runs	Plane	DR Best Practice
Serverless Spark Compute	Executes jobs	AWS compute	Data Plane	Deploy in multi-region workspace
SQL Warehouse Compute	SQL query execution	AWS compute	Data Plane	Provision warehouses in secondary region
Delta Table Data	Table storage	S3	Data Plane	Enable S3 cross-region replication
Managed Tables	Managed table storage	S3	Data Plane	Use versioned S3 buckets
External Tables	External dataset storage	S3	Data Plane	Replicate underlying S3 storage
DBFS Root	Databricks filesystem	S3	Data Plane	Enable bucket replication
Unity Catalog Managed Storage	Catalog table storage	S3	Data Plane	Cross-region replication
Unity Catalog Volumes	Governed file storage	S3	Data Plane	Replicate S3 buckets
MLflow Model Artifacts	ML models	S3	Data Plane	Replicate artifact bucket
Feature Store Data	ML feature datasets	S3	Data Plane	S3 replication and versioning
Vector Search Index Data	Embedding storage	S3	Data Plane	Rebuild indexes from replicated embeddings
Streaming Checkpoints	Streaming progress tracking	S3	Data Plane	Replicate checkpoint directories
Temporary Spark Shuffle Data	Intermediate processing	Compute disk	Data Plane	No DR required (recomputed)
Job Execution Logs	Spark logs	S3	Data Plane	Send logs to centralized logging system
ML Training Data	Training datasets	S3	Data Plane	Multi-region S3 replication
Delta Transaction Logs	Table version metadata	S3	Data Plane	Protect using S3 versioning

Architecture Flow

Databricks Control Plane (Managed by Databricks) Workspace UI Authentication Unity Catalog Metastore Metadata Query Planner Job Scheduler | | Secure API v AWS Data Plane (Customer Account) Serverless Spark Compute SQL Warehouses Delta Tables ML Models Spark Temp Storage | v Amazon S3 Data Lake

Databricks Architecture Matrix

Databricks Architecture Matrix (Serverless on AWS)

This document explains where major Databricks components reside when running Databricks Serverless on AWS. Databricks architecture is divided into two planes:

Control Plane – Managed by Databricks
Data Plane – Runs in the customer AWS account

Control Plane Components

Component	Purpose / What It Does	Where It Runs or Is Stored	Plane
Workspace UI	Web interface to access notebooks, jobs, dashboards	Databricks SaaS infrastructure	Control Plane
Workspace APIs	REST APIs for automation, Terraform, CLI	Databricks SaaS	Control Plane
Users & Groups	Identity and user management	Databricks account services	Control Plane
Authentication / SSO	Integrates with IdP such as Okta or Azure AD	Databricks account services	Control Plane
Permissions / RBAC	Access control policies	Databricks control services	Control Plane
Notebook Source Code	Python / SQL / Scala notebooks	Workspace storage	Control Plane
Notebook Outputs	Charts and result previews	Workspace storage	Control Plane
Workspace Files	Files uploaded to workspace	Workspace storage	Control Plane
Repos (Git Integration)	GitHub / Git integration	Workspace metadata	Control Plane
Job Scheduler	Schedules pipelines and jobs	Databricks orchestration service	Control Plane
Workflows	Pipeline orchestration	Databricks orchestration service	Control Plane
SQL Query Planner	Optimizes SQL queries	Databricks query services	Control Plane
SQL Warehouse Management	Manages serverless SQL endpoints	Databricks control services	Control Plane
Unity Catalog	Central governance system	Databricks governance service	Control Plane
Metastore	Stores catalog and table metadata	Databricks metadata services	Control Plane
Data Lineage	Tracks data dependencies	Databricks governance services	Control Plane
Audit Logs	Security and governance logs	Databricks governance services	Control Plane
Cluster Management	Manages compute lifecycle	Databricks control services	Control Plane
Feature Store Metadata	ML feature definitions	Databricks metadata services	Control Plane
Model Registry Metadata	Tracks ML model versions	Databricks metadata services	Control Plane
Lakehouse Monitoring Metadata	Tracks dataset quality	Databricks monitoring services	Control Plane
Vector Search Metadata	Vector index configuration	Databricks control services	Control Plane

Data Plane Components (Customer AWS Account)

Component	Purpose / What It Does	Where It Runs or Is Stored	Plane
Serverless Spark Compute	Runs notebooks and jobs	AWS compute instances	Data Plane
Databricks SQL Warehouse Compute	Executes SQL queries	AWS compute instances	Data Plane
Delta Table Data	Actual table data	S3	Data Plane
Managed Tables	Databricks managed tables	S3	Data Plane
External Tables	Tables referencing external datasets	S3	Data Plane
DBFS Root	Databricks File System root	S3 bucket	Data Plane
Unity Catalog Managed Storage	Table storage governed by Unity Catalog	S3	Data Plane
Unity Catalog Volumes	File governance storage	S3	Data Plane
MLflow Model Artifacts	ML models and artifacts	S3	Data Plane
Feature Store Data	ML feature datasets	S3	Data Plane
Vector Search Index Data	Vector embeddings	S3	Data Plane
Streaming Checkpoints	Streaming job progress	S3	Data Plane
Temporary Spark Shuffle Data	Intermediate processing data	Local disk / S3	Data Plane
Job Execution Logs	Spark execution logs	S3	Data Plane
ML Training Data	Training datasets	S3	Data Plane
Delta Transaction Logs	Table versioning metadata	S3	Data Plane

Architecture Flow

Databricks Control Plane (Managed by Databricks) Workspace UI Authentication Unity Catalog Metastore Metadata Query Planner Job Scheduler Notebook Code | | Secure API v AWS Data Plane (Customer AWS Account) Serverless Spark Compute SQL Warehouses Delta Tables DBFS Storage ML Models Spark Temp Storage | v Amazon S3 (Customer Data Lake)

Key Architecture Rule

Type	Location
Metadata	Databricks Control Plane
Data	Customer AWS S3
Compute	Customer AWS
Governance Policies	Unity Catalog (Control Plane)