Friday, 13 March 2026

Databricks Roles Full Reference Matri

Databricks Roles Full Reference Matrix

Databricks Roles – Full Reference Matrix

This table includes Workspace Roles, Account Roles, and Unity Catalog Roles with exact capabilities.

Role Category Capabilities / Permissions Notes
Workspace Admin Workspace
  • Manage users and groups
  • Assign workspace roles
  • Create/manage clusters
  • Restart/terminate all clusters
  • Create/manage jobs and workflows
  • Create SQL warehouses
  • Manage secrets, libraries, instance profiles
  • Access DBFS (read/write)
  • Run notebooks and jobs
Full control of workspace; does NOT grant automatic data access in Unity Catalog
User Workspace
  • Create/edit/run own notebooks
  • Create/run jobs
  • Create clusters (if allowed by cluster policies)
  • Access DBFS (read/write)
  • Use SQL warehouses (if permitted)
Cannot manage other users or workspace settings
Can Manage / Job Creator Workspace
  • Create/manage own jobs and clusters
  • Run notebooks
  • Upload files to DBFS
Limited admin; cannot manage other users or workspace-wide settings
Viewer Workspace
  • Read-only access to notebooks, dashboards
  • View clusters and jobs
  • Read access to DBFS (if allowed)
No write permissions
Account Admin Account
  • Create and delete workspaces
  • Assign workspace admins
  • Manage metastore assignments
  • Access account-wide audit logs
  • Manage billing / usage
Full control over account; workspace-level roles must still be respected
Billing / Support Roles Account
  • View usage and billing
  • Access technical support
Cannot manage workspace or data; read-only account permissions
Metastore Admin Unity Catalog
  • Create catalogs and schemas
  • Create storage credentials and external locations
  • Assign catalog-level permissions
  • Grant/revoke data access
Full control over UC metadata; does NOT give workspace admin rights
Catalog Owner Unity Catalog
  • Manage catalog and contained schemas
  • Grant/revoke access at catalog level
Limited to one catalog; cannot manage other catalogs
Schema Owner Unity Catalog
  • Manage schema and contained tables/views
  • Grant/revoke access at schema level
Cannot manage catalog-level permissions
Volume Owner Unity Catalog
  • Manage managed volumes (file storage)
  • Grant/revoke access to volumes
Access to volume paths only
Data Access Roles (SELECT / MODIFY / USAGE) Unity Catalog
  • Read/write/query specific tables, views, volumes
  • Can be granted granular privileges via grants
Applied per-object; separate from workspace admin rights

Databricks on AWS – Least Privilege Permission Matrix

Databricks on AWS Least Privilege Permission Matrix

Databricks on AWS – Least Privilege Permission Matrix

This matrix describes the minimal AWS and Databricks permissions required to create or manage common platform resources when using Databricks on AWS. The goal is to follow enterprise least-privilege security principles.

Resource Primary Owner Role Required AWS Permissions Required Databricks Permissions Purpose Security / Least Privilege Notes
Workspace Platform Admin iam:CreateRole, iam:AttachRolePolicy, ec2:CreateVpc, s3:CreateBucket Account Admin Create Databricks workspace Automate using Terraform and restrict to platform team
Cross Account IAM Role AWS Cloud Admin iam:CreateRole, iam:PutRolePolicy, sts:AssumeRole None Allows Databricks control plane access Trust only Databricks account
Root Storage (DBFS) AWS Cloud Admin s3:CreateBucket, s3:PutBucketPolicy None Workspace default storage Enable encryption and versioning
Unity Catalog Metastore Data Platform Admin s3:GetObject, s3:PutObject, s3:ListBucket Metastore Admin Central governance metadata store Dedicated metastore bucket
Metastore Assignment Platform Admin None Account Admin Attach metastore to workspace Single metastore per region recommended
Storage Credential Data Platform Admin iam:PassRole, sts:AssumeRole CREATE STORAGE CREDENTIAL Connect Unity Catalog to S3 IAM role should allow only specific S3 path
External Location Data Governance Admin s3:GetObject, s3:PutObject CREATE EXTERNAL LOCATION Expose S3 path to Unity Catalog Use path-level permissions
Catalog Data Governance Admin Access to storage location CREATE CATALOG Top governance layer One catalog per domain recommended
Schema Data Owner None CREATE SCHEMA Database container Grant schema-level privileges
Delta Table Data Engineer S3 read/write CREATE TABLE Structured table storage Use Unity Catalog governance
External Table Data Engineer S3 read CREATE TABLE Reference external dataset Avoid direct S3 access
Notebook Data Engineer / Analyst None Workspace Editor Analytics code Store production code in Git
Git Repo Integration Developer None Workspace Editor Version control integration Use GitHub / GitLab PAT
Job / Workflow Data Engineer None CREATE JOB Automated pipelines Define jobs as code
Cluster Platform Admin ec2:RunInstances, iam:PassRole CREATE CLUSTER Compute resource Restrict using cluster policies
SQL Warehouse Data Engineer None CREATE SQL WAREHOUSE Serverless SQL analytics Limit compute size via policies
Cluster Policy Platform Admin None CREATE CLUSTER POLICY Restrict compute usage Important governance control
Feature Store Table ML Engineer S3 read/write CREATE TABLE Machine learning features Stored as Delta tables
ML Model Registry ML Engineer S3 artifact storage CREATE MODEL Track ML model versions Store artifacts in secure bucket
Streaming Checkpoints Data Engineer s3:PutObject, s3:GetObject Job permission Streaming progress tracking Separate checkpoint directory
Unity Catalog Volume Data Platform Admin S3 access CREATE VOLUME File storage governance Alternative to DBFS
Audit Logs Security Team S3 write Account Admin Security auditing Send logs to SIEM
PrivateLink Networking AWS Cloud Admin ec2:CreateVpcEndpoint Account Admin Private connectivity Required for highly secure environments
DBFS File Upload User s3:PutObject Workspace User Temporary file storage Avoid for production data

Databricks Architecture Matrix with DR and Terraform Resources

Databricks Architecture Matrix with DR and Terraform Resources

Databricks Architecture Matrix (Serverless on AWS)

This document shows where major Databricks components live when using Serverless on AWS, including Disaster Recovery strategies and Terraform resources used for automation.

Control Plane Components

Component Purpose Where It Runs Plane DR Best Practice Terraform Resource
Workspace Main analytics workspace Databricks SaaS Control Plane Create secondary workspace in another region databricks_mws_workspaces
Users User identity Databricks account Control Plane Use centralized IdP databricks_user
Groups Access management Databricks account Control Plane Manage via SCIM databricks_group
Group Membership User-group association Databricks account Control Plane Recreate from IaC databricks_group_member
Notebook Source Code Notebook scripts Workspace storage Control Plane Store notebooks in Git databricks_notebook
Repos (Git Integration) Source code integration Workspace metadata Control Plane Keep Git remote as source of truth databricks_repo
Job Scheduler Pipeline scheduling Databricks control services Control Plane Define jobs as code databricks_job
Cluster Configuration Compute definition Databricks control services Control Plane Recreate clusters via IaC databricks_cluster
SQL Warehouse Serverless SQL endpoint Databricks control services Control Plane Recreate warehouse in DR region databricks_sql_endpoint
Unity Catalog Metastore Metadata store Databricks metadata service Control Plane Replicate configuration databricks_metastore
Unity Catalog Catalog Top level data container Databricks governance service Control Plane Recreate catalogs databricks_catalog
Unity Catalog Schema Database layer Databricks governance service Control Plane Recreate schema structure databricks_schema
Permissions Access control policies Databricks governance service Control Plane Store as code databricks_grants
Model Registry ML model version tracking Databricks metadata services Control Plane Replicate model metadata databricks_mlflow_model
Feature Store Metadata ML feature definitions Databricks metadata services Control Plane Store definitions in Git databricks_feature_table

Data Plane Components (AWS)

Component Purpose Where It Runs Plane DR Best Practice Terraform Resource
S3 Data Lake Primary storage AWS S3 Data Plane Enable cross-region replication aws_s3_bucket
Delta Tables Structured data storage S3 Data Plane Replicate bucket aws_s3_bucket
DBFS Root Storage Databricks filesystem S3 Data Plane Enable bucket versioning aws_s3_bucket
MLflow Artifact Storage Stores ML models S3 Data Plane Replicate artifact bucket aws_s3_bucket
Streaming Checkpoints Streaming progress tracking S3 Data Plane Replicate checkpoint folders aws_s3_bucket
Feature Store Data ML training features S3 Data Plane Enable replication aws_s3_bucket
Execution Logs Spark logs S3 Data Plane Central logging system aws_s3_bucket
Serverless Spark Compute Job execution AWS compute Data Plane Use multi-region workspace N/A (managed by Databricks)
Temporary Spark Shuffle Data Intermediate processing Compute disk Data Plane No DR required N/A

Architecture Flow

Databricks Control Plane Workspace Unity Catalog Metastore Jobs Clusters SQL Warehouses | | Secure API v AWS Data Plane Serverless Spark Compute Delta Tables ML Models Streaming Checkpoints | v Amazon S3 Data Lake

Databricks Architecture Matrix with DR Best Practices

Databricks Architecture Matrix with DR Best Practices

Databricks Architecture Matrix (Serverless on AWS) with DR Best Practices

This document explains where major Databricks components reside when running Databricks Serverless on AWS and the recommended Disaster Recovery (DR) strategy for each component.

Control Plane Components

Component Purpose Where It Runs Plane DR Best Practice
Workspace UIUser interface for notebooks and jobsDatabricks SaaSControl PlaneCreate secondary workspace in another region
Workspace APIsAutomation APIsDatabricks SaaSControl PlaneAutomate infrastructure using Terraform
Users & GroupsUser identity managementDatabricks account servicesControl PlaneUse centralized IdP like Okta/Azure AD
Authentication / SSOLogin via external identity providerDatabricks account servicesControl PlaneConfigure SSO redundancy at IdP level
Permissions / RBACAccess control policiesDatabricks control servicesControl PlaneStore policies as code using Terraform
Notebook Source CodeNotebook scriptsWorkspace storageControl PlaneSync notebooks with Git repositories
Notebook OutputsCharts and query resultsWorkspace storageControl PlaneDo not rely on outputs; regenerate from data
Workspace FilesFiles uploaded to workspaceWorkspace storageControl PlaneStore important files in S3 or Git
Repos (Git Integration)Git source control integrationWorkspace metadataControl PlaneMaintain source code in GitHub/GitLab
Job SchedulerSchedules workflowsDatabricks orchestration serviceControl PlaneDefine jobs using Infrastructure-as-Code
WorkflowsPipeline orchestrationDatabricks orchestration serviceControl PlaneExport workflows via API and Terraform
SQL Query PlannerSQL optimization engineDatabricks query servicesControl PlaneNo DR needed (managed by Databricks)
SQL Warehouse ManagementServerless SQL managementDatabricks control servicesControl PlaneRecreate warehouses in secondary region
Unity CatalogCentral governance systemDatabricks governance serviceControl PlaneReplicate catalogs configuration using scripts
MetastoreMetadata storageDatabricks metadata servicesControl PlaneExport metadata periodically
Data LineageTracks data relationshipsDatabricks governance servicesControl PlaneExport lineage metadata via APIs
Audit LogsSecurity logsDatabricks governance servicesControl PlaneSend logs to centralized SIEM storage
Cluster ManagementCompute lifecycle managementDatabricks control servicesControl PlaneRecreate clusters via automation
Feature Store MetadataFeature definitionsDatabricks metadata servicesControl PlaneBackup definitions in Git
Model Registry MetadataML model trackingDatabricks metadata servicesControl PlaneReplicate registry configuration
Lakehouse Monitoring MetadataDataset monitoring metricsDatabricks monitoring servicesControl PlaneExport monitoring metrics
Vector Search MetadataVector index configurationDatabricks control servicesControl PlaneRecreate vector indexes from embeddings

Data Plane Components (Customer AWS)

Component Purpose Where It Runs Plane DR Best Practice
Serverless Spark ComputeExecutes jobsAWS computeData PlaneDeploy in multi-region workspace
SQL Warehouse ComputeSQL query executionAWS computeData PlaneProvision warehouses in secondary region
Delta Table DataTable storageS3Data PlaneEnable S3 cross-region replication
Managed TablesManaged table storageS3Data PlaneUse versioned S3 buckets
External TablesExternal dataset storageS3Data PlaneReplicate underlying S3 storage
DBFS RootDatabricks filesystemS3Data PlaneEnable bucket replication
Unity Catalog Managed StorageCatalog table storageS3Data PlaneCross-region replication
Unity Catalog VolumesGoverned file storageS3Data PlaneReplicate S3 buckets
MLflow Model ArtifactsML modelsS3Data PlaneReplicate artifact bucket
Feature Store DataML feature datasetsS3Data PlaneS3 replication and versioning
Vector Search Index DataEmbedding storageS3Data PlaneRebuild indexes from replicated embeddings
Streaming CheckpointsStreaming progress trackingS3Data PlaneReplicate checkpoint directories
Temporary Spark Shuffle DataIntermediate processingCompute diskData PlaneNo DR required (recomputed)
Job Execution LogsSpark logsS3Data PlaneSend logs to centralized logging system
ML Training DataTraining datasetsS3Data PlaneMulti-region S3 replication
Delta Transaction LogsTable version metadataS3Data PlaneProtect using S3 versioning

Architecture Flow

Databricks Control Plane (Managed by Databricks) Workspace UI Authentication Unity Catalog Metastore Metadata Query Planner Job Scheduler | | Secure API v AWS Data Plane (Customer Account) Serverless Spark Compute SQL Warehouses Delta Tables ML Models Spark Temp Storage | v Amazon S3 Data Lake

Databricks Architecture Matrix

Databricks Architecture Matrix (Serverless on AWS)

Databricks Architecture Matrix (Serverless on AWS)

This document explains where major Databricks components reside when running Databricks Serverless on AWS. Databricks architecture is divided into two planes:

  • Control Plane – Managed by Databricks
  • Data Plane – Runs in the customer AWS account

Control Plane Components

Component Purpose / What It Does Where It Runs or Is Stored Plane
Workspace UIWeb interface to access notebooks, jobs, dashboardsDatabricks SaaS infrastructureControl Plane
Workspace APIsREST APIs for automation, Terraform, CLIDatabricks SaaSControl Plane
Users & GroupsIdentity and user managementDatabricks account servicesControl Plane
Authentication / SSOIntegrates with IdP such as Okta or Azure ADDatabricks account servicesControl Plane
Permissions / RBACAccess control policiesDatabricks control servicesControl Plane
Notebook Source CodePython / SQL / Scala notebooksWorkspace storageControl Plane
Notebook OutputsCharts and result previewsWorkspace storageControl Plane
Workspace FilesFiles uploaded to workspaceWorkspace storageControl Plane
Repos (Git Integration)GitHub / Git integrationWorkspace metadataControl Plane
Job SchedulerSchedules pipelines and jobsDatabricks orchestration serviceControl Plane
WorkflowsPipeline orchestrationDatabricks orchestration serviceControl Plane
SQL Query PlannerOptimizes SQL queriesDatabricks query servicesControl Plane
SQL Warehouse ManagementManages serverless SQL endpointsDatabricks control servicesControl Plane
Unity CatalogCentral governance systemDatabricks governance serviceControl Plane
MetastoreStores catalog and table metadataDatabricks metadata servicesControl Plane
Data LineageTracks data dependenciesDatabricks governance servicesControl Plane
Audit LogsSecurity and governance logsDatabricks governance servicesControl Plane
Cluster ManagementManages compute lifecycleDatabricks control servicesControl Plane
Feature Store MetadataML feature definitionsDatabricks metadata servicesControl Plane
Model Registry MetadataTracks ML model versionsDatabricks metadata servicesControl Plane
Lakehouse Monitoring MetadataTracks dataset qualityDatabricks monitoring servicesControl Plane
Vector Search MetadataVector index configurationDatabricks control servicesControl Plane

Data Plane Components (Customer AWS Account)

Component Purpose / What It Does Where It Runs or Is Stored Plane
Serverless Spark ComputeRuns notebooks and jobsAWS compute instancesData Plane
Databricks SQL Warehouse ComputeExecutes SQL queriesAWS compute instancesData Plane
Delta Table DataActual table dataS3Data Plane
Managed TablesDatabricks managed tablesS3Data Plane
External TablesTables referencing external datasetsS3Data Plane
DBFS RootDatabricks File System rootS3 bucketData Plane
Unity Catalog Managed StorageTable storage governed by Unity CatalogS3Data Plane
Unity Catalog VolumesFile governance storageS3Data Plane
MLflow Model ArtifactsML models and artifactsS3Data Plane
Feature Store DataML feature datasetsS3Data Plane
Vector Search Index DataVector embeddingsS3Data Plane
Streaming CheckpointsStreaming job progressS3Data Plane
Temporary Spark Shuffle DataIntermediate processing dataLocal disk / S3Data Plane
Job Execution LogsSpark execution logsS3Data Plane
ML Training DataTraining datasetsS3Data Plane
Delta Transaction LogsTable versioning metadataS3Data Plane

Architecture Flow

Databricks Control Plane (Managed by Databricks) Workspace UI Authentication Unity Catalog Metastore Metadata Query Planner Job Scheduler Notebook Code | | Secure API v AWS Data Plane (Customer AWS Account) Serverless Spark Compute SQL Warehouses Delta Tables DBFS Storage ML Models Spark Temp Storage | v Amazon S3 (Customer Data Lake)

Key Architecture Rule

TypeLocation
MetadataDatabricks Control Plane
DataCustomer AWS S3
ComputeCustomer AWS
Governance PoliciesUnity Catalog (Control Plane)

Monday, 26 January 2026

Databricks APIs – Overview and Python Examples

Databricks APIs – Overview and Python Examples

Databricks APIs – Architecture, Types, and Python Examples

Databricks provides a comprehensive set of REST APIs to automate platform setup, workspace administration, data governance, compute management, and analytics workflows. These APIs are commonly used for infrastructure automation, CI/CD pipelines, and application onboarding.


Common Python Setup


import requests
import json

DATABRICKS_HOST = "https://<databricks-instance>"
TOKEN = "<DATABRICKS_TOKEN>"

HEADERS = {
    "Authorization": f"Bearer {TOKEN}",
    "Content-Type": "application/json"
}

1. Account API

Purpose: Manage Databricks accounts and workspaces.

Documentation: Databricks Account API

Create a Workspace


url = f"{DATABRICKS_HOST}/api/2.0/accounts/<ACCOUNT_ID>/workspaces"

payload = {
    "workspace_name": "dev-workspace",
    "aws_region": "us-east-1",
    "credentials_id": "cred-id",
    "storage_configuration_id": "storage-id",
    "network_id": "network-id"
}

response = requests.post(url, headers=HEADERS, json=payload)
print(response.json())

2. SCIM API

Purpose: Manage users, groups, and service principals.

Documentation: Databricks SCIM API

Create a Service Principal


url = f"{DATABRICKS_HOST}/api/2.0/preview/scim/v2/ServicePrincipals"

payload = {
    "displayName": "my-app-sp"
}

response = requests.post(url, headers=HEADERS, json=payload)
print(response.json())

3. Unity Catalog API

Purpose: Centralized data governance for catalogs, schemas, and tables.

Documentation: Unity Catalog API

Create a Catalog


url = f"{DATABRICKS_HOST}/api/2.1/unity-catalog/catalogs"

payload = {
    "name": "sales_catalog",
    "comment": "Catalog for sales domain"
}

response = requests.post(url, headers=HEADERS, json=payload)
print(response.json())

Grant Catalog Permission


url = f"{DATABRICKS_HOST}/api/2.1/unity-catalog/permissions/catalogs/sales_catalog"

payload = {
    "changes": [
        {
            "principal": "data_analysts",
            "add": ["USE_CATALOG"]
        }
    ]
}

response = requests.patch(url, headers=HEADERS, json=payload)
print(response.json())

4. Workspace API

Purpose: Manage clusters, jobs, notebooks, and workspace objects.

Documentation: Workspace API

Create a Cluster


url = f"{DATABRICKS_HOST}/api/2.0/clusters/create"

payload = {
    "cluster_name": "demo-cluster",
    "spark_version": "13.3.x-scala2.12",
    "node_type_id": "Standard_DS3_v2",
    "num_workers": 1,
    "autotermination_minutes": 30
}

response = requests.post(url, headers=HEADERS, json=payload)
print(response.json())

5. Jobs API

Purpose: Orchestrate batch and streaming workloads.

Documentation: Jobs API

Create a Job


url = f"{DATABRICKS_HOST}/api/2.1/jobs/create"

payload = {
    "name": "sample-job",
    "tasks": [
        {
            "task_key": "run_notebook",
            "notebook_task": {
                "notebook_path": "/Shared/sample_notebook"
            },
            "new_cluster": {
                "spark_version": "13.3.x-scala2.12",
                "node_type_id": "Standard_DS3_v2",
                "num_workers": 1
            }
        }
    ]
}

response = requests.post(url, headers=HEADERS, json=payload)
print(response.json())

6. Repos API

Purpose: Integrate Git repositories.

Documentation: Repos API

Create a Repo

Databricks APIs – Types & Summary

Databricks APIs – Types & Summary

Types of Databricks APIs

Databricks provides a rich set of APIs to manage both the platform and workspace workloads. These APIs are categorized based on their scope and functionality, and they are critical for automation, CI/CD, governance, and onboarding of applications.

1. Account-Level APIs (Control Plane)

These APIs manage the Databricks account itself. They allow platform engineers to create and configure workspaces, set up Unity Catalog (metastores), manage networking, storage credentials, and service principals.

Official Databricks Account API Docs

2. Workspace-Level APIs (Data Plane)

These APIs operate inside a single workspace to manage data workloads such as:

  • Clusters, Jobs, and Libraries
  • DBFS file storage
  • Secrets & instance pools
  • SQL Warehouses

Official Workspace REST API Docs

3. Unity Catalog / Metastore APIs

These APIs manage metadata, governance, and data access across multiple workspaces:

  • Create catalogs, schemas, tables, and external locations
  • Grant permissions at table, column, or catalog level
  • Attach or detach workspaces to a metastore

Unity Catalog API Reference

4. Repos API

Used to manage Git repositories integrated with Databricks (GitHub, GitLab, Azure DevOps). Enables CI/CD automation for notebooks.

Repos API Docs

5. Tokens & Authentication APIs

Used to manage personal access tokens (PATs) and service principal tokens for automation pipelines.

Token API Docs

6. SCIM API

Manages users, groups, and service principals for identity management and enterprise compliance. SCIM v2.1 is standard.

SCIM API Docs

7. SQL API

Enables programmatic execution of SQL queries and management of SQL endpoints / warehouses.

SQL API Docs

8. MLflow API

Manages the machine learning lifecycle including experiments, runs, and model registry.

MLflow API Docs

Summary Table of Databricks APIs

API Type Scope Purpose Official Documentation
Account API Account Platform setup & governance: workspaces, metastore, network, credentials, service principals Docs
Workspace REST API Workspace Data plane workloads: clusters, jobs, DBFS, libraries, secrets Docs
SCIM API Workspace / Account Identity management: users, groups, service principals Docs
Unity Catalog / Metastore API Account + Workspaces Data governance, catalogs, schemas, tables, permissions, external locations Docs
Repos API Workspace Git repository integration for CI/CD Docs
Tokens / Authentication API Account / Workspace Manage PATs & service principal tokens Docs
SQL API Workspace Programmatic SQL execution & SQL endpoint management Docs
MLflow API Workspace Machine learning lifecycle: experiments, runs, model registry Docs

For the full index of all Databricks APIs and SDKs: Databricks API Reference

Friday, 16 January 2026

Enterprise Databricks on AWS – Zero Trust, Unity Catalog & Audit-Ready Architecture

Enterprise Databricks on AWS – Zero Trust, Unity Catalog & Audit-Ready Architecture

Enterprise Databricks on AWS: Zero-Trust, Unity Catalog & Audit-Ready Architecture

This document explains how to design and implement Databricks on AWS using Zero-Trust principles, Unity Catalog enforced security, cross-account data sharing, and an audit-ready architecture.


1. Zero-Trust Databricks Deployment (AWS)

What Zero-Trust Means for Databricks

  • No public IPs
  • No inbound internet access
  • Explicit identity-based access only
  • All access is authenticated, authorized, and logged

Core AWS Components

  • Dedicated VPC per Databricks workspace
  • Private subnets only
  • VPC Endpoints (PrivateLink)
  • IAM roles with least privilege
  • Security Groups with deny-by-default

VPC Design

VPC (10.0.0.0/16)
├── Private Subnet A (10.0.1.0/24) - Databricks Compute
├── Private Subnet B (10.0.2.0/24) - Databricks Compute
├── VPC Endpoint Subnet
└── No Internet Gateway

Required VPC Endpoints

  • com.amazonaws.<region>.s3
  • com.amazonaws.<region>.sts
  • com.amazonaws.<region>.logs
  • com.amazonaws.<region>.monitoring
  • Databricks Control Plane PrivateLink endpoints
Why: Databricks clusters must communicate with AWS services without touching the public internet.

Security Groups

  • No inbound rules
  • Outbound only to:
    • VPC endpoints
    • Databricks control plane CIDRs

2. Unity Catalog Enforced Security

Why Unity Catalog Is Mandatory for Enterprises

  • Centralized governance
  • Fine-grained RBAC (catalog, schema, table, column, row)
  • Cross-workspace data sharing
  • Built-in auditing

Unity Catalog Core Objects

Metastore
 ├── Catalog (prod_sales)
 │    ├── Schema (orders)
 │    │    └── Table (transactions)

Metastore Setup (AWS)

  • Create S3 bucket for UC storage
  • Enable versioning & encryption (SSE-KMS)
  • Attach IAM role to Databricks
S3 Bucket Policy:
- Allow Databricks IAM Role
- Deny public access
- Enforce TLS

RBAC Example

Group: analytics_team
Permissions:
- USE CATALOG prod_sales
- USE SCHEMA prod_sales.orders
- SELECT ON TABLE prod_sales.orders.transactions

Row-Level Security (Dynamic Views)

CREATE VIEW prod_sales.orders.secure_transactions AS
SELECT *
FROM prod_sales.orders.transactions
WHERE region = current_user();

3. Cross-Account Data Sharing (Unity Catalog)

Use Case

  • Producer account owns raw data
  • Consumer account reads curated data
  • No data copy

Architecture

Account A (Producer)
 └── Unity Catalog Metastore
      └── Shared Catalog

Account B (Consumer)
 └── Databricks Workspace
      └── Read-only access

How Sharing Works

  • Delta Sharing protocol
  • IAM role trust between accounts
  • Read-only permissions

Security Guarantees

  • No write access
  • All queries logged
  • Column and row filters enforced

4. Audit-Ready Architecture

Audit Requirements Covered

  • Who accessed what data
  • When queries were run
  • From which workspace
  • Using which identity

Audit Logs

  • Databricks audit logs → S3
  • CloudTrail for IAM & API calls
  • S3 access logs

Audit Log Flow

Databricks → S3 (Audit Logs)
AWS CloudTrail → S3
S3 → SIEM / Athena / OpenSearch

What Auditors Love

  • No shared credentials
  • Identity-based access
  • Immutable logs
  • Separation of duties

5. End-to-End Control Summary

Layer Control
Network Private VPC, PrivateLink, no internet
Identity IAM + Databricks SCIM groups
Compute Cluster policies & group binding
Data Unity Catalog RBAC + RLS
Audit Centralized logs in S3

Final Outcome

  • Zero-trust Databricks deployment
  • Centralized governance via Unity Catalog
  • Secure cross-account data sharing
  • Fully audit-ready enterprise platform

This architecture scales cleanly across Dev / Test / Prod, supports regulated workloads, and aligns with financial-grade security standards.

Databricks on AWS – Networking, Security & PrivateLink Architecture (Deep Dive)

Databricks on AWS – Networking, Security & PrivateLink Architecture (Deep Dive)

Databricks on AWS – Complete Networking & Security Architecture Guide

This document explains how Databricks is deployed securely on AWS, focusing on:

  • VPC & subnet design
  • Control plane vs data plane
  • IAM roles & instance profiles
  • Security groups & traffic flow
  • PrivateLink (frontend & backend)

1️⃣ Databricks Architecture Overview

::contentReference[oaicite:0]{index=0}

Control Plane vs Data Plane

PlaneOwned ByWhat Runs Here
Control Plane Databricks UI, REST APIs, Jobs scheduler, notebooks metadata
Data Plane Customer AWS Account Clusters, Spark executors, DBFS root, data access
Key rule: Your data never leaves your AWS account.

2️⃣ VPC Design (Customer-Managed)

Why Customer-Managed VPC?

  • Network isolation
  • PrivateLink support
  • Compliance (SOC2, PCI, HIPAA)

Recommended VPC Layout

VPC (10.0.0.0/16)
│
├── Private Subnet A (10.0.1.0/24)
│   └── Databricks Workers
│
├── Private Subnet B (10.0.2.0/24)
│   └── Databricks Workers
│
├── Public Subnet (optional)
│   └── NAT Gateway
│
└── VPC Endpoints
    ├── S3
    ├── STS
    ├── Kinesis (optional)
    └── Databricks PrivateLink
Databricks clusters should never be in public subnets.

3️⃣ Subnets & Routing

Private Subnets

  • No public IPs
  • Route to NAT Gateway (only if needed)
  • Preferred: VPC endpoints instead of NAT

Route Table (Private Subnet)

0.0.0.0/0 → NAT Gateway (optional)
pl-xxxxxx → Databricks PrivateLink
s3 → Gateway Endpoint

4️⃣ Security Groups (CRITICAL)

Databricks Cluster Security Group

DirectionPortSourcePurpose
Inbound All Self Worker ↔ Worker communication
Outbound 443 0.0.0.0/0 or VPC endpoints Control plane, S3, APIs
Databricks requires full intra-cluster communication.

5️⃣ IAM Roles & Instance Profiles

Why IAM Roles?

  • No access keys on clusters
  • Least privilege data access
  • Auditable via CloudTrail

Databricks EC2 Role

Trust Policy:
Service: ec2.amazonaws.com

Permissions Policy

{
  "Effect": "Allow",
  "Action": [
    "s3:GetObject",
    "s3:PutObject",
    "s3:ListBucket"
  ],
  "Resource": [
    "arn:aws:s3:::prod-data",
    "arn:aws:s3:::prod-data/*"
  ]
}

Instance Profile

  • IAM Role → Instance Profile
  • Attached to Databricks clusters

6️⃣ PrivateLink Architecture

::contentReference[oaicite:1]{index=1}

Frontend PrivateLink

  • Users access Databricks UI privately
  • No public internet exposure

Backend PrivateLink

  • Clusters talk to control plane privately
  • No NAT gateway required

Required VPC Endpoints

EndpointType
Databricks Control PlaneInterface
S3Gateway
STSInterface
CloudWatchInterface

7️⃣ Traffic Flow (End-to-End)

User Browser
  ↓ (PrivateLink)
Databricks Control Plane
  ↓ (PrivateLink)
Cluster Driver (Private Subnet)
  ↓
S3 via VPC Endpoint
At no point does traffic traverse the public internet.

8️⃣ Common Enterprise Decisions

DecisionRecommendation
Public vs Private workspace Private (PrivateLink)
NAT Gateway Avoid if endpoints available
IAM Users Never
Data access IAM Roles + Unity Catalog

9️⃣ What This Enables Next

  • Zero-trust Databricks deployment
  • Unity Catalog enforced security
  • Cross-account data sharing
  • Audit-ready architecture

10️⃣ Typical Enterprise Follow-Up Topics

  • Terraform modules for networking
  • Private DNS for Databricks
  • Multi-account AWS architecture
  • Cost & network optimization
This architecture is used by banks, healthcare, and regulated enterprises.

Thursday, 15 January 2026

Databricks REST API – Complete Enterprise Automation Guide (Python + AWS)

Databricks REST API – Complete Enterprise Automation Guide (Python + AWS)

Databricks REST API – Complete Enterprise Automation Guide

This guide documents almost all commonly used Databricks REST API endpoints with working Python examples for enterprise automation on AWS.


0️⃣ Authentication & Base Configuration

Account-Level APIs

Base URL: https://accounts.cloud.databricks.com
Auth: Account PAT

Workspace-Level APIs

Base URL: https://dbc-xxxx.region.databricks.com
Auth: Workspace PAT
import requests

ACCOUNT_ID = "xxxx"
ACCOUNT_HOST = "https://accounts.cloud.databricks.com"
WORKSPACE_HOST = "https://dbc-xxxx.us-east-1.databricks.com"

ACCOUNT_HEADERS = {
    "Authorization": "Bearer ACCOUNT_TOKEN",
    "Content-Type": "application/json"
}

WORKSPACE_HEADERS = {
    "Authorization": "Bearer WORKSPACE_TOKEN",
    "Content-Type": "application/json"
}

1️⃣ Identity & SCIM APIs

EndpointPurpose
POST /scim/v2/UsersCreate user
GET /scim/v2/UsersList users
POST /scim/v2/GroupsCreate group
PATCH /scim/v2/Groups/{id}Add/remove members

Create User

url = f"{ACCOUNT_HOST}/api/2.0/accounts/{ACCOUNT_ID}/scim/v2/Users"
payload = {
  "userName": "alice@company.com",
  "displayName": "Alice",
  "active": true
}
requests.post(url, headers=ACCOUNT_HEADERS, json=payload).raise_for_status()

Create Group

url = f"{ACCOUNT_HOST}/api/2.0/accounts/{ACCOUNT_ID}/scim/v2/Groups"
payload = {"displayName": "data-engineers"}
group = requests.post(url, headers=ACCOUNT_HEADERS, json=payload).json()

2️⃣ Workspace (Account-Level) APIs

EndpointDescription
POST /workspacesCreate workspace
GET /workspacesList workspaces
POST /permissionassignmentsAssign groups to workspace

Create Workspace

url = f"{ACCOUNT_HOST}/api/2.0/accounts/{ACCOUNT_ID}/workspaces"
payload = {
  "workspace_name": "prod",
  "aws_region": "us-east-1",
  "credentials_id": "cred-123",
  "storage_configuration_id": "storage-123",
  "network_id": "network-123"
}
workspace = requests.post(url, headers=ACCOUNT_HEADERS, json=payload).json()

3️⃣ Cluster APIs

EndpointDescription
POST /clusters/createCreate cluster
GET /clusters/listList clusters
POST /clusters/startStart cluster
POST /clusters/deleteDelete cluster

Create Cluster

url = f"{WORKSPACE_HOST}/api/2.0/clusters/create"
payload = {
  "cluster_name": "engineering",
  "spark_version": "13.3.x-scala2.12",
  "node_type_id": "m5.xlarge",
  "num_workers": 2
}
cluster = requests.post(url, headers=WORKSPACE_HEADERS, json=payload).json()

Set Cluster Permissions

url = f"{WORKSPACE_HOST}/api/2.0/permissions/clusters/{cluster['cluster_id']}"
payload = {
  "access_control_list": [
    {
      "group_name": "data-engineers",
      "permission_level": "CAN_ATTACH_TO"
    }
  ]
}
requests.patch(url, headers=WORKSPACE_HEADERS, json=payload)

4️⃣ Jobs API

EndpointPurpose
POST /jobs/createCreate job
POST /jobs/run-nowRun job
GET /jobs/listList jobs

Create Job

url = f"{WORKSPACE_HOST}/api/2.0/jobs/create"
payload = {
  "name": "etl-job",
  "new_cluster": {
    "spark_version": "13.3.x-scala2.12",
    "node_type_id": "m5.large",
    "num_workers": 2
  },
  "notebook_task": {
    "notebook_path": "/Shared/etl"
  }
}
job = requests.post(url, headers=WORKSPACE_HEADERS, json=payload).json()

5️⃣ SQL & Warehouses API

EndpointDescription
POST /sql/warehousesCreate SQL warehouse
POST /sql/statementsExecute SQL

Execute SQL

url = f"{WORKSPACE_HOST}/api/2.0/sql/statements"
payload = {
  "statement": "SELECT current_user(), current_date()",
  "warehouse_id": "wh-123"
}
requests.post(url, headers=WORKSPACE_HEADERS, json=payload)

6️⃣ DBFS & Workspace APIs

EndpointDescription
POST /dbfs/putUpload file
GET /workspace/listList notebooks
POST /workspace/importImport notebook

Upload File to DBFS

url = f"{WORKSPACE_HOST}/api/2.0/dbfs/put"
payload = {
  "path": "/tmp/data.txt",
  "contents": "SGVsbG8="
}
requests.post(url, headers=WORKSPACE_HEADERS, json=payload)

7️⃣ Unity Catalog APIs (Most Used)

EndpointDescription
POST /unity-catalog/catalogsCreate catalog
POST /unity-catalog/schemasCreate schema
POST /unity-catalog/tablesCreate table
PATCH /unity-catalog/permissionsGrant access

Create Catalog

url = f"{WORKSPACE_HOST}/api/2.1/unity-catalog/catalogs"
payload = {"name": "finance"}
requests.post(url, headers=WORKSPACE_HEADERS, json=payload)

Grant Table Access

url = f"{WORKSPACE_HOST}/api/2.1/unity-catalog/permissions/table/finance.payments.txns"
payload = {
  "changes": [{
    "principal": "data-scientists",
    "add": ["SELECT"]
  }]
}
requests.patch(url, headers=WORKSPACE_HEADERS, json=payload)

8️⃣ Tokens, Secrets, Repos

EndpointUse
POST /token/createCreate PAT
POST /secrets/scopes/createCreate secret scope
POST /reposCreate repo

9️⃣ Enterprise Best Practices

  • Terraform for bootstrap & security
  • Python APIs for day-2 operations
  • Unity Catalog for ALL data access
  • No IAM-based data access
This API-first approach is used by regulated banks, fintech, and large enterprises.

Next Topics You Can Publish

  • Databricks CI/CD pipelines
  • API error handling & retries
  • Zero-trust data architecture
  • Cross-account Unity Catalog sharing

Enterprise Databricks on AWS – Identity, Workspace Isolation, Unity Catalog & RBAC (Terraform-Only)

Enterprise Databricks on AWS – Identity, Workspace Isolation, Unity Catalog & RBAC (Terraform-Only)

Enterprise Databricks on AWS – Terraform-First Architecture

This article explains how to build a fully automated, enterprise-grade Databricks platform on AWS using Terraform only, covering:

  • SCIM & Identity automation
  • Workspace creation and isolation
  • Unity Catalog metastore & data isolation
  • Catalog, schema, table-level RBAC
  • Row-level security using dynamic views
  • Cross-account AWS data sharing

High-Level Enterprise Architecture

AWS Account (Databricks Account)
│
├── Account Console
│   ├── SCIM Users & Groups (Terraform)
│   ├── Unity Catalog Metastore (Terraform)
│   └── Workspaces (Dev / QA / Prod)
│
├── AWS Account A (Prod Data)
│   ├── S3 UC Managed Location
│   └── IAM Role (External Location)
│
├── AWS Account B (Analytics)
│   └── Read-only access via UC Sharing
│
└── Azure AD / Okta
    └── Identity Source (SSO + SCIM)
Design Principle: Identity, access, and data governance are controlled centrally at the Databricks Account level.

1. SCIM Group Automation with Terraform

Why SCIM Matters

SCIM ensures that Databricks users and groups are never created manually. Azure AD (or Okta) remains the source of truth.

Terraform – Databricks Account Provider

provider "databricks" {
  alias      = "account"
  host       = "https://accounts.cloud.databricks.com"
  account_id = var.databricks_account_id
}

Create Groups (Mirrors Azure AD)

resource "databricks_group" "data_engineers" {
  provider     = databricks.account
  display_name = "data-engineers"
}

resource "databricks_group" "data_scientists" {
  provider     = databricks.account
  display_name = "data-scientists"
}

Assign Users (SCIM)

resource "databricks_user" "alice" {
  provider  = databricks.account
  user_name = "alice@company.com"
}

resource "databricks_group_member" "alice_engineers" {
  provider  = databricks.account
  group_id = databricks_group.data_engineers.id
  member_id = databricks_user.alice.id
}
Result: Azure AD → SCIM → Databricks is now fully automated.

2. Workspace Creation & Environment Isolation

Enterprise Workspace Strategy

  • One workspace per environment
  • Dev cannot modify Prod
  • Shared metastore across workspaces

Create Workspace (AWS)

resource "databricks_mws_workspaces" "prod" {
  provider      = databricks.account
  workspace_name = "prod-workspace"
  aws_region     = "us-east-1"

  credentials_id = databricks_mws_credentials.this.credentials_id
  storage_configuration_id = databricks_mws_storage_configurations.this.storage_configuration_id
}

Attach Groups to Workspace

resource "databricks_mws_permission_assignment" "prod_admins" {
  provider     = databricks.account
  workspace_id = databricks_mws_workspaces.prod.workspace_id
  principal_id = databricks_group.data_engineers.id
  permissions  = ["ADMIN"]
}

3. Unity Catalog Metastore – Terraform-Only

Create Metastore

resource "databricks_metastore" "main" {
  provider     = databricks.account
  name         = "enterprise-metastore"
  region       = "us-east-1"
  storage_root = "s3://uc-metastore-root/"
}

Attach Metastore to Workspace

resource "databricks_metastore_assignment" "prod" {
  provider     = databricks.account
  workspace_id = databricks_mws_workspaces.prod.workspace_id
  metastore_id = databricks_metastore.main.id
}

4. Unity Catalog RBAC as Code (grants.tf)

Create Catalogs per Domain

resource "databricks_catalog" "finance" {
  name = "finance"
}

Create Schemas

resource "databricks_schema" "payments" {
  name       = "payments"
  catalog_name = databricks_catalog.finance.name
}

Grant Permissions

resource "databricks_grants" "finance_read" {
  catalog = databricks_catalog.finance.name

  grant {
    principal  = "data-scientists"
    privileges = ["USE_CATALOG"]
  }
}
All permissions are version-controlled and auditable.

5. Row-Level Security (Dynamic Views)

Use Case

  • US team sees US data
  • EU team sees EU data

Dynamic View

CREATE OR REPLACE VIEW finance.payments.secure_payments AS
SELECT *
FROM finance.payments.raw
WHERE region = current_user();
No data duplication. No application-side filtering.

6. Cross-Account AWS Sharing with Unity Catalog

Producer Account (Prod)

CREATE SHARE finance_share;
ALTER SHARE finance_share ADD TABLE finance.payments.raw;

Consumer Account

CREATE CATALOG finance_shared
USING SHARE finance_share
WITH PROVIDER databricks;
S3 access is mediated by UC – not IAM users.

7. Decision Diagrams for Architects

Identity Decision

Azure AD
 ├── Manual Users ❌
 └── SCIM + SSO ✅

Data Access Decision

IAM Policies ❌
Unity Catalog Grants ✅

Security Model

Workspace ACLs → Compute
Unity Catalog → Data

What This Enables Next

  • Prod data read-only from Dev
  • Cluster RBAC enforced
  • Auditor-friendly access logs
  • Multi-account AWS sharing
This is the reference architecture used by regulated enterprises.

Suggested Multi-Post Series

  1. Identity & SCIM Automation
  2. Workspace Isolation Strategy
  3. Unity Catalog Deep Dive
  4. RBAC & Data Security Patterns
  5. Cross-Account Data Sharing

Unity Catalog Metastore & Data Isolation – Enterprise Deep Dive

Unity Catalog Metastore & Data Isolation – Enterprise Deep Dive

Unity Catalog Metastore & Data Isolation

Enterprise-Level Technical Deep Dive with Real Examples (AWS Databricks)


1. What a Unity Catalog Metastore Really Is

A Unity Catalog metastore is the central security and governance control plane for Databricks. It owns:

  • All metadata (catalogs, schemas, tables, views, functions)
  • All permissions (RBAC, RLS, CLS)
  • Access to physical storage through credentials and locations
The workspace is NOT the security boundary for data. The metastore is.

2. Metastore Scope & Design Decision

Enterprise Best Practice

One Metastore per:
- Cloud
- Region
- Compliance Boundary

Why This Matters

  • Enables cross-workspace data sharing
  • Centralizes governance and audit
  • Prevents duplicated security logic
Anti-pattern: One metastore per workspace This breaks data sharing and multiplies governance overhead.

3. Real Enterprise Architecture (AWS)

AWS Account
│
├── Unity Catalog Metastore (us-east-1)
│   ├── Storage Root
│   ├── Storage Credentials
│   ├── External Locations
│   ├── Catalog: prod
│   └── Catalog: dev
│
├── Databricks Workspace: dev
└── Databricks Workspace: prod

Both workspaces attach to the same metastore.


4. Metastore Storage Root

The storage root is the default storage for managed tables. Users never access this directly.

Example


s3://company-uc-root/

IAM Role Permissions

  • s3:GetObject
  • s3:PutObject
  • s3:ListBucket
Users and clusters do NOT get these permissions directly.

5. Storage Credentials

A storage credential is a Unity Catalog object that wraps an IAM role.

Example


CREATE STORAGE CREDENTIAL prod_storage_cred
WITH IAM_ROLE 'arn:aws:iam::123456789:role/dbx-prod-uc-role';

This decouples cloud IAM from users completely.


6. External Locations (Actual Data Isolation)

External locations bind:

  • S3 path
  • Storage credential

Example


CREATE EXTERNAL LOCATION prod_sales_loc
URL 's3://prod-sales-data/'
WITH STORAGE CREDENTIAL prod_storage_cred;
Without an external location, Unity Catalog blocks access — even if S3 exists.

7. Catalog-Level Isolation

Catalogs are the first logical isolation layer.

Example


CREATE CATALOG prod;
CREATE CATALOG dev;

Access Control


GRANT USAGE ON CATALOG prod TO `group_prod_users`;

8. Schema-Level Isolation

Schemas isolate teams or business domains.

Example


CREATE SCHEMA prod.sales;
CREATE SCHEMA prod.finance;

GRANT SELECT ON SCHEMA prod.sales
TO `group_sales_analytics`;

9. Table-Level Isolation

Tables are where most security risk exists.

Example


GRANT SELECT, MODIFY
ON TABLE prod.sales.customers
TO `group_sales_engineers`;
Never grant access to PUBLIC.

10. Cross-Workspace Data Sharing

Scenario

  • Dev workspace needs read-only access to Prod data

Solution


GRANT SELECT
ON TABLE prod.sales.customers
TO `group_dev_engineers`;

No S3 access required. Unity Catalog enforces this.


11. Row-Level Security (Dynamic Views)

Business Rule

GroupCountry Access
group_us_analystsUSA
group_eu_analystsEU

Dynamic View


CREATE VIEW prod.sales.customers_secure AS
SELECT *
FROM prod.sales.customers
WHERE
  (is_member('group_us_analysts') AND country = 'US')
  OR
  (is_member('group_eu_analysts') AND country = 'EU');

12. Column-Level Security

Example


CREATE VIEW prod.sales.customers_masked AS
SELECT
  id,
  name,
  CASE
    WHEN is_member('group_pii_admins') THEN ssn
    ELSE 'XXX-XX-XXXX'
  END AS ssn
FROM prod.sales.customers;

13. Managed vs External Tables

TypeStorageUse Case
ManagedUC RootDev, sandbox
ExternalExternal LocationProd, regulated data

14. How Security Is Actually Enforced

  • At query planning
  • At query execution

Even if a user knows the S3 path, Unity Catalog blocks access.


15. Auditing & Lineage

Unity Catalog automatically captures:

  • Who accessed what
  • Which queries touched which tables
  • Downstream dependencies

Example Query


SELECT * FROM system.access.audit;

16. Common Enterprise Mistakes

  • Multiple metastores per environment
  • Granting S3 access to users
  • Relying on workspace ACLs for data
  • No catalog separation

17. Enterprise Golden Rules

  1. One metastore per region
  2. Always use groups
  3. Never grant to PUBLIC
  4. Use views for sensitive data
  5. Treat UC as a security firewall

18. End-to-End Access Example

UserGroupReadWrite
User Agroup_prod_engineersAllYes
User Bgroup_dev_engineersAllNo
User Cgroup_us_analystsUS onlyNo

Final Summary

Unity Catalog is not just metadata. It is your data firewall, governance engine, and compliance backbone.

If the metastore is designed correctly, everything else becomes simple.

Enterprise Databricks Automation on AWS – Identity, RBAC & Security as Code

Enterprise Databricks Automation on AWS – Identity, RBAC & Security as Code

Enterprise Databricks Automation on AWS

SCIM, Unity Catalog RBAC, Row-Level Security & Security-as-Code


Where This Fits in the Enterprise Series

PostTopic
Step 0Identity setup (SSO + SCIM)
Step 1Workspace strategy & environment isolation
Step 2Unity Catalog metastore & data isolation
Step 3Identity, RBAC & data security as code (this post)
Step 4CI/CD & promotion pipelines

1️⃣ SCIM Group Automation with Terraform

Why SCIM Automation Is Mandatory

  • No manual user or group creation
  • Identity source of truth = IdP
  • Permissions change automatically with group membership

Provider Configuration (Account Level)


provider "databricks" {
  host  = var.databricks_account_host
  token = var.databricks_account_token
}

Create Groups via Terraform


resource "databricks_group" "prod_engineers" {
  display_name = "group_prod_engineers"
}

resource "databricks_group" "dev_engineers" {
  display_name = "group_dev_engineers"
}

Add Users to Groups


resource "databricks_group_member" "prod_user" {
  group_id  = databricks_group.prod_engineers.id
  member_id = databricks_user.user_a.id
}
In real enterprises, users are synced automatically from IdP via SCIM. Terraform manages only group-level logic.

2️⃣ Unity Catalog RBAC as Code (grants.tf)

Why RBAC as Code Matters

  • Auditable permissions
  • No UI drift
  • Consistent across environments

grants.tf – Catalog-Level Access


resource "databricks_grants" "catalog_usage" {
  catalog = "prod_catalog"

  grant {
    principal  = "group_prod_engineers"
    privileges = ["USAGE"]
  }

  grant {
    principal  = "group_dev_engineers"
    privileges = ["USAGE"]
  }
}

Schema-Level RBAC


resource "databricks_grants" "sales_schema" {
  schema = "prod_catalog.sales"

  grant {
    principal  = "group_prod_engineers"
    privileges = ["CREATE", "SELECT"]
  }

  grant {
    principal  = "group_dev_engineers"
    privileges = ["SELECT"]
  }
}

Table-Level RBAC


resource "databricks_grants" "customers_table" {
  table = "prod_catalog.sales.customers"

  grant {
    principal  = "group_prod_engineers"
    privileges = ["SELECT", "MODIFY"]
  }

  grant {
    principal  = "group_dev_engineers"
    privileges = ["SELECT"]
  }
}
Permissions are enforced by Unity Catalog at query execution time.

3️⃣ Row-Level Security (Dynamic Views)

Use Case

User GroupAllowed Country
group_us_analystsUSA
group_eu_analystsEU

Base Table (Restricted)


REVOKE ALL PRIVILEGES ON TABLE prod_catalog.sales.customers FROM PUBLIC;

Dynamic View with RLS


CREATE VIEW prod_catalog.sales.customers_secure AS
SELECT *
FROM prod_catalog.sales.customers
WHERE
  (is_member('group_us_analysts') AND country = 'USA')
  OR
  (is_member('group_eu_analysts') AND country = 'EU');

Grant Access Only to the View


GRANT SELECT ON VIEW prod_catalog.sales.customers_secure
TO `group_us_analysts`, `group_eu_analysts`;
Row-level security is enforced automatically based on group membership. No application changes required.

4️⃣ End-to-End Access Example

UserGroupResult
User Agroup_prod_engineersRead + Write all rows
User Bgroup_dev_engineersRead-only
User Cgroup_us_analystsUSA rows only

5️⃣ CI/CD Flow (Security Included)

Git Commit
  ↓
Terraform Apply
  ↓
SCIM Groups + Workspaces + UC Grants
  ↓
Users Login
  ↓
Access Automatically Enforced

6️⃣ Common Enterprise Anti-Patterns

  • Granting permissions to users instead of groups
  • Direct access to base tables (no views)
  • Mixing Dev and Prod users in same workspace
  • Manual permission changes via UI

7️⃣ Why Auditors Love This Setup

  • All access is code-reviewed
  • Clear separation of duties
  • Full traceability in Git
  • Zero manual overrides

8️⃣ Enterprise Databricks Blog Series Roadmap

PostDescription
Part 1Identity, SSO & SCIM architecture
Part 2Workspace isolation & networking
Part 3Unity Catalog & RBAC as code
Part 4Row-level & column-level security
Part 5CI/CD promotion Dev → Prod
Part 6Operating Databricks at scale

Final Takeaway

This approach gives you:

  • Enterprise-grade security by design
  • Zero-touch onboarding
  • Strong compliance posture
  • Infrastructure and data security as code

This is how Databricks is run in regulated enterprises.

AWS Databricks Enterprise Automation – Workspaces, Isolation & RBAC

AWS Databricks Enterprise Automation – Workspaces, Isolation & RBAC

AWS Databricks Enterprise Automation

Workspaces, Environment Isolation, Unity Catalog & RBAC – Fully Automated


Why Enterprise Automation Is Mandatory

In enterprise environments, Databricks must be deployed with:

  • Strict environment isolation (Dev / QA / Prod)
  • Centralized identity and access management
  • Fine-grained data access controls
  • Auditable and repeatable infrastructure

Manual workspace creation or UI-based permission management does not scale and introduces security risk. This blog shows how to automate everything on AWS.


High-Level Architecture

AWS Account │ ├── Databricks Account (Control Plane) │ ├── Unity Catalog Metastore (Single, Central) │ ├── SCIM Groups (Synced from IdP) │ │ │ ├── Workspace: Dev │ │ ├── VPC + Subnets │ │ ├── S3 Bucket (Dev Only) │ │ └── Cluster Policies (Small / Auto-Terminate) │ │ │ └── Workspace: Prod │ ├── VPC + Subnets │ ├── S3 Bucket (Prod Only) │ └── Cluster Policies (Restricted / Large)

Technology Stack Used

ComponentPurpose
TerraformWorkspace, network, storage, cluster policy automation
Databricks REST API / SDKUnity Catalog, RBAC, grants
AWS S3Managed storage for Unity Catalog
AWS IAMSecure access to data storage
SCIM GroupsUser → group → permission mapping

Step 0 – Prerequisites (One-Time Setup)

AWS Side

  • Create dedicated S3 buckets per environment
  • Create IAM roles with least privilege access
  • Enable VPC endpoints for S3 (no public internet)

Databricks Account

  • Databricks Enterprise (Premium) account
  • Account-level admin access
  • Unity Catalog enabled

Step 1 – Automated Workspace Creation (Terraform)

Provider Configuration


provider "databricks" {
  host  = var.databricks_account_host
  token = var.databricks_account_token
}

Storage Configuration


resource "databricks_mws_storage_configs" "dev_storage" {
  account_id   = var.account_id
  name         = "dev-storage"
  bucket_name  = "dbx-dev-bucket"
  iam_role_arn = var.dev_iam_role
}

Workspace Creation


resource "databricks_mws_workspaces" "dev" {
  account_id  = var.account_id
  workspace_name = "dbx-dev"
  region      = "us-east-1"
  storage_configuration_id =
    databricks_mws_storage_configs.dev_storage.id
  sku = "premium"
}
Each workspace is fully isolated at the network, storage, and compute layer.

Step 2 – Unity Catalog Metastore Automation

Create Metastore


resource "databricks_metastore" "main" {
  name          = "enterprise-metastore"
  storage_root  = "s3://databricks-uc-root/"
  region        = "us-east-1"
}

Attach Metastore to Workspaces


resource "databricks_metastore_assignment" "dev" {
  workspace_id = databricks_mws_workspaces.dev.workspace_id
  metastore_id = databricks_metastore.main.id
}

Step 3 – Catalog, Schema & Table Creation (Python)


from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

w.catalogs.create(
    name="prod_catalog",
    comment="Production data"
)

w.schemas.create(
    name="sales",
    catalog_name="prod_catalog"
)

Table Creation


spark.sql("""
CREATE TABLE prod_catalog.sales.customers (
  id STRING,
  name STRING,
  country STRING
) USING DELTA
""")

Step 4 – RBAC (User A vs User B Example)

Groups

  • group_prod_engineers
  • group_dev_engineers

Grant Permissions


w.grants.update(
  securable_type="table",
  securable_name="prod_catalog.sales.customers",
  changes=[
    {"principal": "group_prod_engineers", "privileges": ["SELECT", "MODIFY"]},
    {"principal": "group_dev_engineers", "privileges": ["SELECT"]}
  ]
)

Result

  • User A (Prod group): Read + Write
  • User B (Dev group): Read-only
RBAC is enforced at query time, not at notebook level.

Step 5 – Cluster Isolation with Policies


resource "databricks_cluster_policy" "prod_policy" {
  name = "prod-policy"
  definition = jsonencode({
    node_type_id = {
      type  = "fixed"
      value = "i3.2xlarge"
    }
    autotermination_minutes = {
      type  = "fixed"
      value = 60
    }
  })
}

Attach this policy only to group_prod_engineers.


Step 6 – CI/CD Automation Flow

Git Commit ↓ Terraform Apply ↓ Workspace + Storage + Policies ↓ Python SDK ↓ Catalogs + Schemas + RBAC

What This Enables Next

  • Safe cross-workspace data sharing
  • Read-only Prod access from Dev
  • Strong audit and compliance posture
  • Zero-touch onboarding for new teams

Enterprise Outcome

This setup gives you:

  • Environment isolation at every layer
  • Identity-driven access control
  • Full automation and repeatability
  • Security that auditors trust

Next Blog

Step 3 – Advanced Unity Catalog Patterns:
External Locations, Row-Level Security, Dynamic Views, and Cross-Account Sharing.

Step 1 – Workspace Strategy & Environment Isolation in Databricks

Step 1 – Workspace Strategy & Environment Isolation in Databricks

Step 1 – Workspace Strategy & Environment Isolation in Databricks

After completing Step 0: Identity Setup, the next critical task in enterprise onboarding is designing a robust workspace strategy and ensuring environmental isolation. Workspaces in Databricks are execution boundaries that control compute, job execution, clusters, secrets, and repos. Proper strategy ensures safe deployment, governance, and compliance.


Why Workspace Strategy Matters

Poor workspace design can lead to:

  • Accidental production data access
  • Shared clusters across teams
  • Inconsistent governance and auditing

This step defines:

  • How many workspaces are needed
  • How environments (Dev/QA/Prod) are isolated
  • How identities and access flow across workspaces
---

Workspace vs Environment Responsibilities

Responsibility Workspace Unity Catalog
User login
Cluster configuration
Job execution
Data access
Table-level security
---

Step 1.1 – Decide Workspace Topology

Databricks Account
├── Dev Workspace
├── QA Workspace
└── Prod Workspace

Start simple — one workspace per environment. Avoid creating per-team or per-user workspaces initially. Workspace isolation ensures safe Dev → QA → Prod promotion.

---

Step 1.2 – Create Workspaces

Steps:

  1. Log in to Databricks Account Console
  2. Create workspace with required region, VNET, private endpoints, and storage account
  3. Repeat for Dev, QA, Prod

Example naming convention:

Dev  → dbx-dev-us-east
QA   → dbx-qa-us-east
Prod → dbx-prod-us-east
---

Step 1.3 – Define Environment-Specific Azure AD Groups

Combine role + environment in group names:

dbx-dev-admins
dbx-dev-engineers
dbx-dev-analysts

dbx-prod-admins
dbx-prod-engineers
dbx-prod-users

This enables:

  • Same person has Dev access but limited/no Prod access
  • Clear audit trail and separation of duties
---

Step 1.4 – Assign Groups to Workspaces

Workspace access is granted via the Databricks Account Console:

Dev Workspace Example

GroupPermission
dbx-dev-adminsWorkspace Admin
dbx-dev-engineersWorkspace User
dbx-prod-engineers❌ No Access

Prod Workspace Example

GroupPermission
dbx-prod-adminsWorkspace Admin
dbx-prod-engineersWorkspace User
dbx-dev-engineers❌ No Access
---

Step 1.5 – Authentication & Access Flow

User logs in
   |
   v
Azure AD SSO
   |
   v
Databricks Account checks:
    - Is user in a group assigned to this workspace?
        |
        +-- YES → Access granted
        +-- NO  → Workspace invisible
---

Step 1.6 – Workspace Admin vs Account Admin

Role Scope
Account Admin All workspaces, identity, global settings (2–3 people max)
Workspace Admin Single workspace (clusters, jobs, repos)
---

Step 1.7 – Cluster & Job Isolation

Cluster policies per workspace:

  • Dev: small nodes, auto-termination, permissive libraries
  • Prod: fixed nodes, restricted libraries, no interactive clusters

Jobs are workspace-bound:

Git → Dev Workspace Job
        ↓
     QA Workspace Job
        ↓
     Prod Workspace Job

Secrets are workspace-scoped to ensure Dev/Prod isolation:

dev-kv/snowflake-password
prod-kv/snowflake-password
---

Step 1.8 – What Workspaces Do NOT Control

  • Table-level access
  • Row-level security
  • Column masking

These are handled by Unity Catalog in Step 2.

---

Step 1.9 – Common Mistakes to Avoid

  • Giving Dev engineers access to Prod workspace
  • Making everyone Workspace Admin
  • Using one workspace + folders for envs
  • Relying on notebook naming for isolation
---

Step 1.10 – Validation Checklist

  • Dev user logs in → sees Dev workspace only
  • Dev user tries Prod URL → access denied
  • Prod user logs in → sees Prod workspace only
  • Removing user from Azure AD group → access disappears automatically
  • No manual Databricks changes required
---

What Step 1 Enables Next

Because workspaces are properly isolated:
  • Unity Catalog can safely share data across workspaces
  • Prod data can be read-only from Dev
  • Cluster RBAC becomes enforceable
  • Auditors can validate separation of duties and compliance
---

Next Step: Step 2 – Unity Catalog Metastore & Data Isolation (Catalogs, schemas, table-level RBAC, cross-workspace sharing)

Enterprise Databricks Onboarding – Identity Setup (Azure AD)

Enterprise Databricks Onboarding – Step 0: Identity Setup (Azure AD)

Enterprise Databricks Onboarding – Step 0: Identity Setup (Azure AD → Databricks)

Identity is the foundation of every enterprise Databricks deployment. Before you talk about clusters, Unity Catalog, or RBAC, you must first answer one question:

Who is the user, and how is their access controlled?

In this step, we integrate Azure Active Directory (Azure AD) with Databricks Enterprise using SSO and SCIM provisioning.

---

Objective of Step 0

  • Azure AD becomes the single source of truth
  • No manual users or groups in Databricks
  • All access is group-based and auditable
  • Identity lifecycle is fully automated
---

High-Level Identity Architecture

+--------------------+
|     Azure AD       |
|--------------------|
| Users              |
| Groups             |
| MFA / CA Policies  |
+---------+----------+
          |
          | 1) SSO (SAML)
          |
          v
+--------------------+
| Databricks Account |
|--------------------|
| Authentication     |
| (Login)            |
+---------+----------+
          |
          | 2) SCIM Provisioning
          |
          v
+-----------------------------+
| Databricks Identity Store   |
|-----------------------------|
| Users (Read-only)           |
| Groups (SCIM-managed)       |
| Memberships                 |
+-----------------------------+
Key Principle:
Azure AD authenticates users (SSO). SCIM provisions users and groups. Databricks never owns identity.
---

First-Time Setup (Greenfield Environment)

Step 1: Define Identity Model

Databricks must consume identities — not create them.

  • ❌ No local Databricks users
  • ❌ No Databricks-only groups
  • ✅ Azure AD is authoritative
---

Step 2: Create Azure AD Groups (RBAC-Oriented)

Create role-based groups, not user-specific ones.

dbx-admins
dbx-platform
dbx-data-engineers
dbx-data-analysts
dbx-ml-engineers
dbx-prod-users
Never assign permissions directly to users later. All permissions must flow from groups.
---

Step 3: Create Azure Databricks Enterprise Application

  1. Azure Portal → Azure Active Directory
  2. Enterprise Applications → New Application
  3. Search for Azure Databricks
  4. Create the application

This application handles both SSO and SCIM provisioning.

---

Step 4: Configure SSO (Authentication)

SSO answers the question: Who are you?

SAML Configuration

Entity ID (Identifier):
https://accounts.azuredatabricks.net

Reply URL (ACS):
https://accounts.azuredatabricks.net/login/saml

User attributes:

email  → user.mail
name   → user.userprincipalname
---

SSO Login Flow

User Browser
     |
     v
Azure AD Login (MFA, CA)
     |
     v
SAML Assertion
     |
     v
Databricks Account Console

After this, users authenticate using corporate credentials only.

---

Step 5: Configure SCIM Provisioning

SCIM answers the question: What access does the user have?

Generate SCIM Token

  1. Databricks Account Console
  2. User Management → Generate SCIM token

Azure AD Provisioning Settings

Tenant URL:
https://accounts.azuredatabricks.net/api/2.0/accounts/<ACCOUNT_ID>/scim/v2

Authentication:
Bearer Token (SCIM Token)
---

SCIM Provisioning Flow

Azure AD
  |
  | Users + Groups + Memberships
  |
  v
SCIM API
  |
  v
Databricks Account
  |
  v
Workspaces / Unity Catalog / Clusters
---

Step 6: Assign Groups to the Application

Only assigned groups are synced.

Assigned Groups:
- dbx-admins
- dbx-data-engineers
- dbx-data-analysts
---

Day-2 Operations (After Go-Live)

Adding a New User

1. Create user in Azure AD
2. Add to dbx-data-engineers
3. SCIM sync runs
4. User appears in Databricks automatically

Removing a User

1. Disable user in Azure AD
2. SCIM removes user from Databricks
3. Access revoked everywhere

Changing User Role

Remove: dbx-data-analysts
Add:    dbx-data-engineers

All permissions update automatically without Databricks admin intervention.

---

Security & Compliance Benefits

  • Centralized identity management
  • Audit-friendly access controls
  • MFA and Conditional Access enforced
  • Zero-trust compatible
  • SOC2 / ISO aligned
---

Final Outcome of Step 0

Authentication → Azure AD
Authorization  → Groups
Provisioning   → SCIM
Databricks     → Identity Consumer

This identity foundation enables:

  • Unity Catalog RBAC
  • Cluster isolation
  • Workspace governance
  • Secure production onboarding
---

Next Blog: Step 1 – Workspace Strategy & Environment Isolation