Monday, 26 January 2026

Databricks APIs – Overview and Python Examples

Databricks APIs – Overview and Python Examples

Databricks APIs – Architecture, Types, and Python Examples

Databricks provides a comprehensive set of REST APIs to automate platform setup, workspace administration, data governance, compute management, and analytics workflows. These APIs are commonly used for infrastructure automation, CI/CD pipelines, and application onboarding.


Common Python Setup


import requests
import json

DATABRICKS_HOST = "https://<databricks-instance>"
TOKEN = "<DATABRICKS_TOKEN>"

HEADERS = {
    "Authorization": f"Bearer {TOKEN}",
    "Content-Type": "application/json"
}

1. Account API

Purpose: Manage Databricks accounts and workspaces.

Documentation: Databricks Account API

Create a Workspace


url = f"{DATABRICKS_HOST}/api/2.0/accounts/<ACCOUNT_ID>/workspaces"

payload = {
    "workspace_name": "dev-workspace",
    "aws_region": "us-east-1",
    "credentials_id": "cred-id",
    "storage_configuration_id": "storage-id",
    "network_id": "network-id"
}

response = requests.post(url, headers=HEADERS, json=payload)
print(response.json())

2. SCIM API

Purpose: Manage users, groups, and service principals.

Documentation: Databricks SCIM API

Create a Service Principal


url = f"{DATABRICKS_HOST}/api/2.0/preview/scim/v2/ServicePrincipals"

payload = {
    "displayName": "my-app-sp"
}

response = requests.post(url, headers=HEADERS, json=payload)
print(response.json())

3. Unity Catalog API

Purpose: Centralized data governance for catalogs, schemas, and tables.

Documentation: Unity Catalog API

Create a Catalog


url = f"{DATABRICKS_HOST}/api/2.1/unity-catalog/catalogs"

payload = {
    "name": "sales_catalog",
    "comment": "Catalog for sales domain"
}

response = requests.post(url, headers=HEADERS, json=payload)
print(response.json())

Grant Catalog Permission


url = f"{DATABRICKS_HOST}/api/2.1/unity-catalog/permissions/catalogs/sales_catalog"

payload = {
    "changes": [
        {
            "principal": "data_analysts",
            "add": ["USE_CATALOG"]
        }
    ]
}

response = requests.patch(url, headers=HEADERS, json=payload)
print(response.json())

4. Workspace API

Purpose: Manage clusters, jobs, notebooks, and workspace objects.

Documentation: Workspace API

Create a Cluster


url = f"{DATABRICKS_HOST}/api/2.0/clusters/create"

payload = {
    "cluster_name": "demo-cluster",
    "spark_version": "13.3.x-scala2.12",
    "node_type_id": "Standard_DS3_v2",
    "num_workers": 1,
    "autotermination_minutes": 30
}

response = requests.post(url, headers=HEADERS, json=payload)
print(response.json())

5. Jobs API

Purpose: Orchestrate batch and streaming workloads.

Documentation: Jobs API

Create a Job


url = f"{DATABRICKS_HOST}/api/2.1/jobs/create"

payload = {
    "name": "sample-job",
    "tasks": [
        {
            "task_key": "run_notebook",
            "notebook_task": {
                "notebook_path": "/Shared/sample_notebook"
            },
            "new_cluster": {
                "spark_version": "13.3.x-scala2.12",
                "node_type_id": "Standard_DS3_v2",
                "num_workers": 1
            }
        }
    ]
}

response = requests.post(url, headers=HEADERS, json=payload)
print(response.json())

6. Repos API

Purpose: Integrate Git repositories.

Documentation: Repos API

Create a Repo

Databricks APIs – Types & Summary

Databricks APIs – Types & Summary

Types of Databricks APIs

Databricks provides a rich set of APIs to manage both the platform and workspace workloads. These APIs are categorized based on their scope and functionality, and they are critical for automation, CI/CD, governance, and onboarding of applications.

1. Account-Level APIs (Control Plane)

These APIs manage the Databricks account itself. They allow platform engineers to create and configure workspaces, set up Unity Catalog (metastores), manage networking, storage credentials, and service principals.

Official Databricks Account API Docs

2. Workspace-Level APIs (Data Plane)

These APIs operate inside a single workspace to manage data workloads such as:

  • Clusters, Jobs, and Libraries
  • DBFS file storage
  • Secrets & instance pools
  • SQL Warehouses

Official Workspace REST API Docs

3. Unity Catalog / Metastore APIs

These APIs manage metadata, governance, and data access across multiple workspaces:

  • Create catalogs, schemas, tables, and external locations
  • Grant permissions at table, column, or catalog level
  • Attach or detach workspaces to a metastore

Unity Catalog API Reference

4. Repos API

Used to manage Git repositories integrated with Databricks (GitHub, GitLab, Azure DevOps). Enables CI/CD automation for notebooks.

Repos API Docs

5. Tokens & Authentication APIs

Used to manage personal access tokens (PATs) and service principal tokens for automation pipelines.

Token API Docs

6. SCIM API

Manages users, groups, and service principals for identity management and enterprise compliance. SCIM v2.1 is standard.

SCIM API Docs

7. SQL API

Enables programmatic execution of SQL queries and management of SQL endpoints / warehouses.

SQL API Docs

8. MLflow API

Manages the machine learning lifecycle including experiments, runs, and model registry.

MLflow API Docs

Summary Table of Databricks APIs

API Type Scope Purpose Official Documentation
Account API Account Platform setup & governance: workspaces, metastore, network, credentials, service principals Docs
Workspace REST API Workspace Data plane workloads: clusters, jobs, DBFS, libraries, secrets Docs
SCIM API Workspace / Account Identity management: users, groups, service principals Docs
Unity Catalog / Metastore API Account + Workspaces Data governance, catalogs, schemas, tables, permissions, external locations Docs
Repos API Workspace Git repository integration for CI/CD Docs
Tokens / Authentication API Account / Workspace Manage PATs & service principal tokens Docs
SQL API Workspace Programmatic SQL execution & SQL endpoint management Docs
MLflow API Workspace Machine learning lifecycle: experiments, runs, model registry Docs

For the full index of all Databricks APIs and SDKs: Databricks API Reference

Friday, 16 January 2026

nterprise Databricks on AWS – Zero Trust, Unity Catalog & Audit-Ready Architecture

Enterprise Databricks on AWS – Zero Trust, Unity Catalog & Audit-Ready Architecture

Enterprise Databricks on AWS: Zero-Trust, Unity Catalog & Audit-Ready Architecture

This document explains how to design and implement Databricks on AWS using Zero-Trust principles, Unity Catalog enforced security, cross-account data sharing, and an audit-ready architecture.


1. Zero-Trust Databricks Deployment (AWS)

What Zero-Trust Means for Databricks

  • No public IPs
  • No inbound internet access
  • Explicit identity-based access only
  • All access is authenticated, authorized, and logged

Core AWS Components

  • Dedicated VPC per Databricks workspace
  • Private subnets only
  • VPC Endpoints (PrivateLink)
  • IAM roles with least privilege
  • Security Groups with deny-by-default

VPC Design

VPC (10.0.0.0/16)
├── Private Subnet A (10.0.1.0/24) - Databricks Compute
├── Private Subnet B (10.0.2.0/24) - Databricks Compute
├── VPC Endpoint Subnet
└── No Internet Gateway

Required VPC Endpoints

  • com.amazonaws.<region>.s3
  • com.amazonaws.<region>.sts
  • com.amazonaws.<region>.logs
  • com.amazonaws.<region>.monitoring
  • Databricks Control Plane PrivateLink endpoints
Why: Databricks clusters must communicate with AWS services without touching the public internet.

Security Groups

  • No inbound rules
  • Outbound only to:
    • VPC endpoints
    • Databricks control plane CIDRs

2. Unity Catalog Enforced Security

Why Unity Catalog Is Mandatory for Enterprises

  • Centralized governance
  • Fine-grained RBAC (catalog, schema, table, column, row)
  • Cross-workspace data sharing
  • Built-in auditing

Unity Catalog Core Objects

Metastore
 ├── Catalog (prod_sales)
 │    ├── Schema (orders)
 │    │    └── Table (transactions)

Metastore Setup (AWS)

  • Create S3 bucket for UC storage
  • Enable versioning & encryption (SSE-KMS)
  • Attach IAM role to Databricks
S3 Bucket Policy:
- Allow Databricks IAM Role
- Deny public access
- Enforce TLS

RBAC Example

Group: analytics_team
Permissions:
- USE CATALOG prod_sales
- USE SCHEMA prod_sales.orders
- SELECT ON TABLE prod_sales.orders.transactions

Row-Level Security (Dynamic Views)

CREATE VIEW prod_sales.orders.secure_transactions AS
SELECT *
FROM prod_sales.orders.transactions
WHERE region = current_user();

3. Cross-Account Data Sharing (Unity Catalog)

Use Case

  • Producer account owns raw data
  • Consumer account reads curated data
  • No data copy

Architecture

Account A (Producer)
 └── Unity Catalog Metastore
      └── Shared Catalog

Account B (Consumer)
 └── Databricks Workspace
      └── Read-only access

How Sharing Works

  • Delta Sharing protocol
  • IAM role trust between accounts
  • Read-only permissions

Security Guarantees

  • No write access
  • All queries logged
  • Column and row filters enforced

4. Audit-Ready Architecture

Audit Requirements Covered

  • Who accessed what data
  • When queries were run
  • From which workspace
  • Using which identity

Audit Logs

  • Databricks audit logs → S3
  • CloudTrail for IAM & API calls
  • S3 access logs

Audit Log Flow

Databricks → S3 (Audit Logs)
AWS CloudTrail → S3
S3 → SIEM / Athena / OpenSearch

What Auditors Love

  • No shared credentials
  • Identity-based access
  • Immutable logs
  • Separation of duties

5. End-to-End Control Summary

Layer Control
Network Private VPC, PrivateLink, no internet
Identity IAM + Databricks SCIM groups
Compute Cluster policies & group binding
Data Unity Catalog RBAC + RLS
Audit Centralized logs in S3

Final Outcome

  • Zero-trust Databricks deployment
  • Centralized governance via Unity Catalog
  • Secure cross-account data sharing
  • Fully audit-ready enterprise platform

This architecture scales cleanly across Dev / Test / Prod, supports regulated workloads, and aligns with financial-grade security standards.

Databricks on AWS – Networking, Security & PrivateLink Architecture (Deep Dive)

Databricks on AWS – Networking, Security & PrivateLink Architecture (Deep Dive)

Databricks on AWS – Complete Networking & Security Architecture Guide

This document explains how Databricks is deployed securely on AWS, focusing on:

  • VPC & subnet design
  • Control plane vs data plane
  • IAM roles & instance profiles
  • Security groups & traffic flow
  • PrivateLink (frontend & backend)

1️⃣ Databricks Architecture Overview

::contentReference[oaicite:0]{index=0}

Control Plane vs Data Plane

PlaneOwned ByWhat Runs Here
Control Plane Databricks UI, REST APIs, Jobs scheduler, notebooks metadata
Data Plane Customer AWS Account Clusters, Spark executors, DBFS root, data access
Key rule: Your data never leaves your AWS account.

2️⃣ VPC Design (Customer-Managed)

Why Customer-Managed VPC?

  • Network isolation
  • PrivateLink support
  • Compliance (SOC2, PCI, HIPAA)

Recommended VPC Layout

VPC (10.0.0.0/16)
│
├── Private Subnet A (10.0.1.0/24)
│   └── Databricks Workers
│
├── Private Subnet B (10.0.2.0/24)
│   └── Databricks Workers
│
├── Public Subnet (optional)
│   └── NAT Gateway
│
└── VPC Endpoints
    ├── S3
    ├── STS
    ├── Kinesis (optional)
    └── Databricks PrivateLink
Databricks clusters should never be in public subnets.

3️⃣ Subnets & Routing

Private Subnets

  • No public IPs
  • Route to NAT Gateway (only if needed)
  • Preferred: VPC endpoints instead of NAT

Route Table (Private Subnet)

0.0.0.0/0 → NAT Gateway (optional)
pl-xxxxxx → Databricks PrivateLink
s3 → Gateway Endpoint

4️⃣ Security Groups (CRITICAL)

Databricks Cluster Security Group

DirectionPortSourcePurpose
Inbound All Self Worker ↔ Worker communication
Outbound 443 0.0.0.0/0 or VPC endpoints Control plane, S3, APIs
Databricks requires full intra-cluster communication.

5️⃣ IAM Roles & Instance Profiles

Why IAM Roles?

  • No access keys on clusters
  • Least privilege data access
  • Auditable via CloudTrail

Databricks EC2 Role

Trust Policy:
Service: ec2.amazonaws.com

Permissions Policy

{
  "Effect": "Allow",
  "Action": [
    "s3:GetObject",
    "s3:PutObject",
    "s3:ListBucket"
  ],
  "Resource": [
    "arn:aws:s3:::prod-data",
    "arn:aws:s3:::prod-data/*"
  ]
}

Instance Profile

  • IAM Role → Instance Profile
  • Attached to Databricks clusters

6️⃣ PrivateLink Architecture

::contentReference[oaicite:1]{index=1}

Frontend PrivateLink

  • Users access Databricks UI privately
  • No public internet exposure

Backend PrivateLink

  • Clusters talk to control plane privately
  • No NAT gateway required

Required VPC Endpoints

EndpointType
Databricks Control PlaneInterface
S3Gateway
STSInterface
CloudWatchInterface

7️⃣ Traffic Flow (End-to-End)

User Browser
  ↓ (PrivateLink)
Databricks Control Plane
  ↓ (PrivateLink)
Cluster Driver (Private Subnet)
  ↓
S3 via VPC Endpoint
At no point does traffic traverse the public internet.

8️⃣ Common Enterprise Decisions

DecisionRecommendation
Public vs Private workspace Private (PrivateLink)
NAT Gateway Avoid if endpoints available
IAM Users Never
Data access IAM Roles + Unity Catalog

9️⃣ What This Enables Next

  • Zero-trust Databricks deployment
  • Unity Catalog enforced security
  • Cross-account data sharing
  • Audit-ready architecture

10️⃣ Typical Enterprise Follow-Up Topics

  • Terraform modules for networking
  • Private DNS for Databricks
  • Multi-account AWS architecture
  • Cost & network optimization
This architecture is used by banks, healthcare, and regulated enterprises.

Thursday, 15 January 2026

Databricks REST API – Complete Enterprise Automation Guide (Python + AWS)

Databricks REST API – Complete Enterprise Automation Guide (Python + AWS)

Databricks REST API – Complete Enterprise Automation Guide

This guide documents almost all commonly used Databricks REST API endpoints with working Python examples for enterprise automation on AWS.


0️⃣ Authentication & Base Configuration

Account-Level APIs

Base URL: https://accounts.cloud.databricks.com
Auth: Account PAT

Workspace-Level APIs

Base URL: https://dbc-xxxx.region.databricks.com
Auth: Workspace PAT
import requests

ACCOUNT_ID = "xxxx"
ACCOUNT_HOST = "https://accounts.cloud.databricks.com"
WORKSPACE_HOST = "https://dbc-xxxx.us-east-1.databricks.com"

ACCOUNT_HEADERS = {
    "Authorization": "Bearer ACCOUNT_TOKEN",
    "Content-Type": "application/json"
}

WORKSPACE_HEADERS = {
    "Authorization": "Bearer WORKSPACE_TOKEN",
    "Content-Type": "application/json"
}

1️⃣ Identity & SCIM APIs

EndpointPurpose
POST /scim/v2/UsersCreate user
GET /scim/v2/UsersList users
POST /scim/v2/GroupsCreate group
PATCH /scim/v2/Groups/{id}Add/remove members

Create User

url = f"{ACCOUNT_HOST}/api/2.0/accounts/{ACCOUNT_ID}/scim/v2/Users"
payload = {
  "userName": "alice@company.com",
  "displayName": "Alice",
  "active": true
}
requests.post(url, headers=ACCOUNT_HEADERS, json=payload).raise_for_status()

Create Group

url = f"{ACCOUNT_HOST}/api/2.0/accounts/{ACCOUNT_ID}/scim/v2/Groups"
payload = {"displayName": "data-engineers"}
group = requests.post(url, headers=ACCOUNT_HEADERS, json=payload).json()

2️⃣ Workspace (Account-Level) APIs

EndpointDescription
POST /workspacesCreate workspace
GET /workspacesList workspaces
POST /permissionassignmentsAssign groups to workspace

Create Workspace

url = f"{ACCOUNT_HOST}/api/2.0/accounts/{ACCOUNT_ID}/workspaces"
payload = {
  "workspace_name": "prod",
  "aws_region": "us-east-1",
  "credentials_id": "cred-123",
  "storage_configuration_id": "storage-123",
  "network_id": "network-123"
}
workspace = requests.post(url, headers=ACCOUNT_HEADERS, json=payload).json()

3️⃣ Cluster APIs

EndpointDescription
POST /clusters/createCreate cluster
GET /clusters/listList clusters
POST /clusters/startStart cluster
POST /clusters/deleteDelete cluster

Create Cluster

url = f"{WORKSPACE_HOST}/api/2.0/clusters/create"
payload = {
  "cluster_name": "engineering",
  "spark_version": "13.3.x-scala2.12",
  "node_type_id": "m5.xlarge",
  "num_workers": 2
}
cluster = requests.post(url, headers=WORKSPACE_HEADERS, json=payload).json()

Set Cluster Permissions

url = f"{WORKSPACE_HOST}/api/2.0/permissions/clusters/{cluster['cluster_id']}"
payload = {
  "access_control_list": [
    {
      "group_name": "data-engineers",
      "permission_level": "CAN_ATTACH_TO"
    }
  ]
}
requests.patch(url, headers=WORKSPACE_HEADERS, json=payload)

4️⃣ Jobs API

EndpointPurpose
POST /jobs/createCreate job
POST /jobs/run-nowRun job
GET /jobs/listList jobs

Create Job

url = f"{WORKSPACE_HOST}/api/2.0/jobs/create"
payload = {
  "name": "etl-job",
  "new_cluster": {
    "spark_version": "13.3.x-scala2.12",
    "node_type_id": "m5.large",
    "num_workers": 2
  },
  "notebook_task": {
    "notebook_path": "/Shared/etl"
  }
}
job = requests.post(url, headers=WORKSPACE_HEADERS, json=payload).json()

5️⃣ SQL & Warehouses API

EndpointDescription
POST /sql/warehousesCreate SQL warehouse
POST /sql/statementsExecute SQL

Execute SQL

url = f"{WORKSPACE_HOST}/api/2.0/sql/statements"
payload = {
  "statement": "SELECT current_user(), current_date()",
  "warehouse_id": "wh-123"
}
requests.post(url, headers=WORKSPACE_HEADERS, json=payload)

6️⃣ DBFS & Workspace APIs

EndpointDescription
POST /dbfs/putUpload file
GET /workspace/listList notebooks
POST /workspace/importImport notebook

Upload File to DBFS

url = f"{WORKSPACE_HOST}/api/2.0/dbfs/put"
payload = {
  "path": "/tmp/data.txt",
  "contents": "SGVsbG8="
}
requests.post(url, headers=WORKSPACE_HEADERS, json=payload)

7️⃣ Unity Catalog APIs (Most Used)

EndpointDescription
POST /unity-catalog/catalogsCreate catalog
POST /unity-catalog/schemasCreate schema
POST /unity-catalog/tablesCreate table
PATCH /unity-catalog/permissionsGrant access

Create Catalog

url = f"{WORKSPACE_HOST}/api/2.1/unity-catalog/catalogs"
payload = {"name": "finance"}
requests.post(url, headers=WORKSPACE_HEADERS, json=payload)

Grant Table Access

url = f"{WORKSPACE_HOST}/api/2.1/unity-catalog/permissions/table/finance.payments.txns"
payload = {
  "changes": [{
    "principal": "data-scientists",
    "add": ["SELECT"]
  }]
}
requests.patch(url, headers=WORKSPACE_HEADERS, json=payload)

8️⃣ Tokens, Secrets, Repos

EndpointUse
POST /token/createCreate PAT
POST /secrets/scopes/createCreate secret scope
POST /reposCreate repo

9️⃣ Enterprise Best Practices

  • Terraform for bootstrap & security
  • Python APIs for day-2 operations
  • Unity Catalog for ALL data access
  • No IAM-based data access
This API-first approach is used by regulated banks, fintech, and large enterprises.

Next Topics You Can Publish

  • Databricks CI/CD pipelines
  • API error handling & retries
  • Zero-trust data architecture
  • Cross-account Unity Catalog sharing

Enterprise Databricks on AWS – Identity, Workspace Isolation, Unity Catalog & RBAC (Terraform-Only)

Enterprise Databricks on AWS – Identity, Workspace Isolation, Unity Catalog & RBAC (Terraform-Only)

Enterprise Databricks on AWS – Terraform-First Architecture

This article explains how to build a fully automated, enterprise-grade Databricks platform on AWS using Terraform only, covering:

  • SCIM & Identity automation
  • Workspace creation and isolation
  • Unity Catalog metastore & data isolation
  • Catalog, schema, table-level RBAC
  • Row-level security using dynamic views
  • Cross-account AWS data sharing

High-Level Enterprise Architecture

AWS Account (Databricks Account)
│
├── Account Console
│   ├── SCIM Users & Groups (Terraform)
│   ├── Unity Catalog Metastore (Terraform)
│   └── Workspaces (Dev / QA / Prod)
│
├── AWS Account A (Prod Data)
│   ├── S3 UC Managed Location
│   └── IAM Role (External Location)
│
├── AWS Account B (Analytics)
│   └── Read-only access via UC Sharing
│
└── Azure AD / Okta
    └── Identity Source (SSO + SCIM)
Design Principle: Identity, access, and data governance are controlled centrally at the Databricks Account level.

1. SCIM Group Automation with Terraform

Why SCIM Matters

SCIM ensures that Databricks users and groups are never created manually. Azure AD (or Okta) remains the source of truth.

Terraform – Databricks Account Provider

provider "databricks" {
  alias      = "account"
  host       = "https://accounts.cloud.databricks.com"
  account_id = var.databricks_account_id
}

Create Groups (Mirrors Azure AD)

resource "databricks_group" "data_engineers" {
  provider     = databricks.account
  display_name = "data-engineers"
}

resource "databricks_group" "data_scientists" {
  provider     = databricks.account
  display_name = "data-scientists"
}

Assign Users (SCIM)

resource "databricks_user" "alice" {
  provider  = databricks.account
  user_name = "alice@company.com"
}

resource "databricks_group_member" "alice_engineers" {
  provider  = databricks.account
  group_id = databricks_group.data_engineers.id
  member_id = databricks_user.alice.id
}
Result: Azure AD → SCIM → Databricks is now fully automated.

2. Workspace Creation & Environment Isolation

Enterprise Workspace Strategy

  • One workspace per environment
  • Dev cannot modify Prod
  • Shared metastore across workspaces

Create Workspace (AWS)

resource "databricks_mws_workspaces" "prod" {
  provider      = databricks.account
  workspace_name = "prod-workspace"
  aws_region     = "us-east-1"

  credentials_id = databricks_mws_credentials.this.credentials_id
  storage_configuration_id = databricks_mws_storage_configurations.this.storage_configuration_id
}

Attach Groups to Workspace

resource "databricks_mws_permission_assignment" "prod_admins" {
  provider     = databricks.account
  workspace_id = databricks_mws_workspaces.prod.workspace_id
  principal_id = databricks_group.data_engineers.id
  permissions  = ["ADMIN"]
}

3. Unity Catalog Metastore – Terraform-Only

Create Metastore

resource "databricks_metastore" "main" {
  provider     = databricks.account
  name         = "enterprise-metastore"
  region       = "us-east-1"
  storage_root = "s3://uc-metastore-root/"
}

Attach Metastore to Workspace

resource "databricks_metastore_assignment" "prod" {
  provider     = databricks.account
  workspace_id = databricks_mws_workspaces.prod.workspace_id
  metastore_id = databricks_metastore.main.id
}

4. Unity Catalog RBAC as Code (grants.tf)

Create Catalogs per Domain

resource "databricks_catalog" "finance" {
  name = "finance"
}

Create Schemas

resource "databricks_schema" "payments" {
  name       = "payments"
  catalog_name = databricks_catalog.finance.name
}

Grant Permissions

resource "databricks_grants" "finance_read" {
  catalog = databricks_catalog.finance.name

  grant {
    principal  = "data-scientists"
    privileges = ["USE_CATALOG"]
  }
}
All permissions are version-controlled and auditable.

5. Row-Level Security (Dynamic Views)

Use Case

  • US team sees US data
  • EU team sees EU data

Dynamic View

CREATE OR REPLACE VIEW finance.payments.secure_payments AS
SELECT *
FROM finance.payments.raw
WHERE region = current_user();
No data duplication. No application-side filtering.

6. Cross-Account AWS Sharing with Unity Catalog

Producer Account (Prod)

CREATE SHARE finance_share;
ALTER SHARE finance_share ADD TABLE finance.payments.raw;

Consumer Account

CREATE CATALOG finance_shared
USING SHARE finance_share
WITH PROVIDER databricks;
S3 access is mediated by UC – not IAM users.

7. Decision Diagrams for Architects

Identity Decision

Azure AD
 ├── Manual Users ❌
 └── SCIM + SSO ✅

Data Access Decision

IAM Policies ❌
Unity Catalog Grants ✅

Security Model

Workspace ACLs → Compute
Unity Catalog → Data

What This Enables Next

  • Prod data read-only from Dev
  • Cluster RBAC enforced
  • Auditor-friendly access logs
  • Multi-account AWS sharing
This is the reference architecture used by regulated enterprises.

Suggested Multi-Post Series

  1. Identity & SCIM Automation
  2. Workspace Isolation Strategy
  3. Unity Catalog Deep Dive
  4. RBAC & Data Security Patterns
  5. Cross-Account Data Sharing

Unity Catalog Metastore & Data Isolation – Enterprise Deep Dive

Unity Catalog Metastore & Data Isolation – Enterprise Deep Dive

Unity Catalog Metastore & Data Isolation

Enterprise-Level Technical Deep Dive with Real Examples (AWS Databricks)


1. What a Unity Catalog Metastore Really Is

A Unity Catalog metastore is the central security and governance control plane for Databricks. It owns:

  • All metadata (catalogs, schemas, tables, views, functions)
  • All permissions (RBAC, RLS, CLS)
  • Access to physical storage through credentials and locations
The workspace is NOT the security boundary for data. The metastore is.

2. Metastore Scope & Design Decision

Enterprise Best Practice

One Metastore per:
- Cloud
- Region
- Compliance Boundary

Why This Matters

  • Enables cross-workspace data sharing
  • Centralizes governance and audit
  • Prevents duplicated security logic
Anti-pattern: One metastore per workspace This breaks data sharing and multiplies governance overhead.

3. Real Enterprise Architecture (AWS)

AWS Account
│
├── Unity Catalog Metastore (us-east-1)
│   ├── Storage Root
│   ├── Storage Credentials
│   ├── External Locations
│   ├── Catalog: prod
│   └── Catalog: dev
│
├── Databricks Workspace: dev
└── Databricks Workspace: prod

Both workspaces attach to the same metastore.


4. Metastore Storage Root

The storage root is the default storage for managed tables. Users never access this directly.

Example


s3://company-uc-root/

IAM Role Permissions

  • s3:GetObject
  • s3:PutObject
  • s3:ListBucket
Users and clusters do NOT get these permissions directly.

5. Storage Credentials

A storage credential is a Unity Catalog object that wraps an IAM role.

Example


CREATE STORAGE CREDENTIAL prod_storage_cred
WITH IAM_ROLE 'arn:aws:iam::123456789:role/dbx-prod-uc-role';

This decouples cloud IAM from users completely.


6. External Locations (Actual Data Isolation)

External locations bind:

  • S3 path
  • Storage credential

Example


CREATE EXTERNAL LOCATION prod_sales_loc
URL 's3://prod-sales-data/'
WITH STORAGE CREDENTIAL prod_storage_cred;
Without an external location, Unity Catalog blocks access — even if S3 exists.

7. Catalog-Level Isolation

Catalogs are the first logical isolation layer.

Example


CREATE CATALOG prod;
CREATE CATALOG dev;

Access Control


GRANT USAGE ON CATALOG prod TO `group_prod_users`;

8. Schema-Level Isolation

Schemas isolate teams or business domains.

Example


CREATE SCHEMA prod.sales;
CREATE SCHEMA prod.finance;

GRANT SELECT ON SCHEMA prod.sales
TO `group_sales_analytics`;

9. Table-Level Isolation

Tables are where most security risk exists.

Example


GRANT SELECT, MODIFY
ON TABLE prod.sales.customers
TO `group_sales_engineers`;
Never grant access to PUBLIC.

10. Cross-Workspace Data Sharing

Scenario

  • Dev workspace needs read-only access to Prod data

Solution


GRANT SELECT
ON TABLE prod.sales.customers
TO `group_dev_engineers`;

No S3 access required. Unity Catalog enforces this.


11. Row-Level Security (Dynamic Views)

Business Rule

GroupCountry Access
group_us_analystsUSA
group_eu_analystsEU

Dynamic View


CREATE VIEW prod.sales.customers_secure AS
SELECT *
FROM prod.sales.customers
WHERE
  (is_member('group_us_analysts') AND country = 'US')
  OR
  (is_member('group_eu_analysts') AND country = 'EU');

12. Column-Level Security

Example


CREATE VIEW prod.sales.customers_masked AS
SELECT
  id,
  name,
  CASE
    WHEN is_member('group_pii_admins') THEN ssn
    ELSE 'XXX-XX-XXXX'
  END AS ssn
FROM prod.sales.customers;

13. Managed vs External Tables

TypeStorageUse Case
ManagedUC RootDev, sandbox
ExternalExternal LocationProd, regulated data

14. How Security Is Actually Enforced

  • At query planning
  • At query execution

Even if a user knows the S3 path, Unity Catalog blocks access.


15. Auditing & Lineage

Unity Catalog automatically captures:

  • Who accessed what
  • Which queries touched which tables
  • Downstream dependencies

Example Query


SELECT * FROM system.access.audit;

16. Common Enterprise Mistakes

  • Multiple metastores per environment
  • Granting S3 access to users
  • Relying on workspace ACLs for data
  • No catalog separation

17. Enterprise Golden Rules

  1. One metastore per region
  2. Always use groups
  3. Never grant to PUBLIC
  4. Use views for sensitive data
  5. Treat UC as a security firewall

18. End-to-End Access Example

UserGroupReadWrite
User Agroup_prod_engineersAllYes
User Bgroup_dev_engineersAllNo
User Cgroup_us_analystsUS onlyNo

Final Summary

Unity Catalog is not just metadata. It is your data firewall, governance engine, and compliance backbone.

If the metastore is designed correctly, everything else becomes simple.

Enterprise Databricks Automation on AWS – Identity, RBAC & Security as Code

Enterprise Databricks Automation on AWS – Identity, RBAC & Security as Code

Enterprise Databricks Automation on AWS

SCIM, Unity Catalog RBAC, Row-Level Security & Security-as-Code


Where This Fits in the Enterprise Series

PostTopic
Step 0Identity setup (SSO + SCIM)
Step 1Workspace strategy & environment isolation
Step 2Unity Catalog metastore & data isolation
Step 3Identity, RBAC & data security as code (this post)
Step 4CI/CD & promotion pipelines

1️⃣ SCIM Group Automation with Terraform

Why SCIM Automation Is Mandatory

  • No manual user or group creation
  • Identity source of truth = IdP
  • Permissions change automatically with group membership

Provider Configuration (Account Level)


provider "databricks" {
  host  = var.databricks_account_host
  token = var.databricks_account_token
}

Create Groups via Terraform


resource "databricks_group" "prod_engineers" {
  display_name = "group_prod_engineers"
}

resource "databricks_group" "dev_engineers" {
  display_name = "group_dev_engineers"
}

Add Users to Groups


resource "databricks_group_member" "prod_user" {
  group_id  = databricks_group.prod_engineers.id
  member_id = databricks_user.user_a.id
}
In real enterprises, users are synced automatically from IdP via SCIM. Terraform manages only group-level logic.

2️⃣ Unity Catalog RBAC as Code (grants.tf)

Why RBAC as Code Matters

  • Auditable permissions
  • No UI drift
  • Consistent across environments

grants.tf – Catalog-Level Access


resource "databricks_grants" "catalog_usage" {
  catalog = "prod_catalog"

  grant {
    principal  = "group_prod_engineers"
    privileges = ["USAGE"]
  }

  grant {
    principal  = "group_dev_engineers"
    privileges = ["USAGE"]
  }
}

Schema-Level RBAC


resource "databricks_grants" "sales_schema" {
  schema = "prod_catalog.sales"

  grant {
    principal  = "group_prod_engineers"
    privileges = ["CREATE", "SELECT"]
  }

  grant {
    principal  = "group_dev_engineers"
    privileges = ["SELECT"]
  }
}

Table-Level RBAC


resource "databricks_grants" "customers_table" {
  table = "prod_catalog.sales.customers"

  grant {
    principal  = "group_prod_engineers"
    privileges = ["SELECT", "MODIFY"]
  }

  grant {
    principal  = "group_dev_engineers"
    privileges = ["SELECT"]
  }
}
Permissions are enforced by Unity Catalog at query execution time.

3️⃣ Row-Level Security (Dynamic Views)

Use Case

User GroupAllowed Country
group_us_analystsUSA
group_eu_analystsEU

Base Table (Restricted)


REVOKE ALL PRIVILEGES ON TABLE prod_catalog.sales.customers FROM PUBLIC;

Dynamic View with RLS


CREATE VIEW prod_catalog.sales.customers_secure AS
SELECT *
FROM prod_catalog.sales.customers
WHERE
  (is_member('group_us_analysts') AND country = 'USA')
  OR
  (is_member('group_eu_analysts') AND country = 'EU');

Grant Access Only to the View


GRANT SELECT ON VIEW prod_catalog.sales.customers_secure
TO `group_us_analysts`, `group_eu_analysts`;
Row-level security is enforced automatically based on group membership. No application changes required.

4️⃣ End-to-End Access Example

UserGroupResult
User Agroup_prod_engineersRead + Write all rows
User Bgroup_dev_engineersRead-only
User Cgroup_us_analystsUSA rows only

5️⃣ CI/CD Flow (Security Included)

Git Commit
  ↓
Terraform Apply
  ↓
SCIM Groups + Workspaces + UC Grants
  ↓
Users Login
  ↓
Access Automatically Enforced

6️⃣ Common Enterprise Anti-Patterns

  • Granting permissions to users instead of groups
  • Direct access to base tables (no views)
  • Mixing Dev and Prod users in same workspace
  • Manual permission changes via UI

7️⃣ Why Auditors Love This Setup

  • All access is code-reviewed
  • Clear separation of duties
  • Full traceability in Git
  • Zero manual overrides

8️⃣ Enterprise Databricks Blog Series Roadmap

PostDescription
Part 1Identity, SSO & SCIM architecture
Part 2Workspace isolation & networking
Part 3Unity Catalog & RBAC as code
Part 4Row-level & column-level security
Part 5CI/CD promotion Dev → Prod
Part 6Operating Databricks at scale

Final Takeaway

This approach gives you:

  • Enterprise-grade security by design
  • Zero-touch onboarding
  • Strong compliance posture
  • Infrastructure and data security as code

This is how Databricks is run in regulated enterprises.

AWS Databricks Enterprise Automation – Workspaces, Isolation & RBAC

AWS Databricks Enterprise Automation – Workspaces, Isolation & RBAC

AWS Databricks Enterprise Automation

Workspaces, Environment Isolation, Unity Catalog & RBAC – Fully Automated


Why Enterprise Automation Is Mandatory

In enterprise environments, Databricks must be deployed with:

  • Strict environment isolation (Dev / QA / Prod)
  • Centralized identity and access management
  • Fine-grained data access controls
  • Auditable and repeatable infrastructure

Manual workspace creation or UI-based permission management does not scale and introduces security risk. This blog shows how to automate everything on AWS.


High-Level Architecture

AWS Account │ ├── Databricks Account (Control Plane) │ ├── Unity Catalog Metastore (Single, Central) │ ├── SCIM Groups (Synced from IdP) │ │ │ ├── Workspace: Dev │ │ ├── VPC + Subnets │ │ ├── S3 Bucket (Dev Only) │ │ └── Cluster Policies (Small / Auto-Terminate) │ │ │ └── Workspace: Prod │ ├── VPC + Subnets │ ├── S3 Bucket (Prod Only) │ └── Cluster Policies (Restricted / Large)

Technology Stack Used

ComponentPurpose
TerraformWorkspace, network, storage, cluster policy automation
Databricks REST API / SDKUnity Catalog, RBAC, grants
AWS S3Managed storage for Unity Catalog
AWS IAMSecure access to data storage
SCIM GroupsUser → group → permission mapping

Step 0 – Prerequisites (One-Time Setup)

AWS Side

  • Create dedicated S3 buckets per environment
  • Create IAM roles with least privilege access
  • Enable VPC endpoints for S3 (no public internet)

Databricks Account

  • Databricks Enterprise (Premium) account
  • Account-level admin access
  • Unity Catalog enabled

Step 1 – Automated Workspace Creation (Terraform)

Provider Configuration


provider "databricks" {
  host  = var.databricks_account_host
  token = var.databricks_account_token
}

Storage Configuration


resource "databricks_mws_storage_configs" "dev_storage" {
  account_id   = var.account_id
  name         = "dev-storage"
  bucket_name  = "dbx-dev-bucket"
  iam_role_arn = var.dev_iam_role
}

Workspace Creation


resource "databricks_mws_workspaces" "dev" {
  account_id  = var.account_id
  workspace_name = "dbx-dev"
  region      = "us-east-1"
  storage_configuration_id =
    databricks_mws_storage_configs.dev_storage.id
  sku = "premium"
}
Each workspace is fully isolated at the network, storage, and compute layer.

Step 2 – Unity Catalog Metastore Automation

Create Metastore


resource "databricks_metastore" "main" {
  name          = "enterprise-metastore"
  storage_root  = "s3://databricks-uc-root/"
  region        = "us-east-1"
}

Attach Metastore to Workspaces


resource "databricks_metastore_assignment" "dev" {
  workspace_id = databricks_mws_workspaces.dev.workspace_id
  metastore_id = databricks_metastore.main.id
}

Step 3 – Catalog, Schema & Table Creation (Python)


from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

w.catalogs.create(
    name="prod_catalog",
    comment="Production data"
)

w.schemas.create(
    name="sales",
    catalog_name="prod_catalog"
)

Table Creation


spark.sql("""
CREATE TABLE prod_catalog.sales.customers (
  id STRING,
  name STRING,
  country STRING
) USING DELTA
""")

Step 4 – RBAC (User A vs User B Example)

Groups

  • group_prod_engineers
  • group_dev_engineers

Grant Permissions


w.grants.update(
  securable_type="table",
  securable_name="prod_catalog.sales.customers",
  changes=[
    {"principal": "group_prod_engineers", "privileges": ["SELECT", "MODIFY"]},
    {"principal": "group_dev_engineers", "privileges": ["SELECT"]}
  ]
)

Result

  • User A (Prod group): Read + Write
  • User B (Dev group): Read-only
RBAC is enforced at query time, not at notebook level.

Step 5 – Cluster Isolation with Policies


resource "databricks_cluster_policy" "prod_policy" {
  name = "prod-policy"
  definition = jsonencode({
    node_type_id = {
      type  = "fixed"
      value = "i3.2xlarge"
    }
    autotermination_minutes = {
      type  = "fixed"
      value = 60
    }
  })
}

Attach this policy only to group_prod_engineers.


Step 6 – CI/CD Automation Flow

Git Commit ↓ Terraform Apply ↓ Workspace + Storage + Policies ↓ Python SDK ↓ Catalogs + Schemas + RBAC

What This Enables Next

  • Safe cross-workspace data sharing
  • Read-only Prod access from Dev
  • Strong audit and compliance posture
  • Zero-touch onboarding for new teams

Enterprise Outcome

This setup gives you:

  • Environment isolation at every layer
  • Identity-driven access control
  • Full automation and repeatability
  • Security that auditors trust

Next Blog

Step 3 – Advanced Unity Catalog Patterns:
External Locations, Row-Level Security, Dynamic Views, and Cross-Account Sharing.

Step 1 – Workspace Strategy & Environment Isolation in Databricks

Step 1 – Workspace Strategy & Environment Isolation in Databricks

Step 1 – Workspace Strategy & Environment Isolation in Databricks

After completing Step 0: Identity Setup, the next critical task in enterprise onboarding is designing a robust workspace strategy and ensuring environmental isolation. Workspaces in Databricks are execution boundaries that control compute, job execution, clusters, secrets, and repos. Proper strategy ensures safe deployment, governance, and compliance.


Why Workspace Strategy Matters

Poor workspace design can lead to:

  • Accidental production data access
  • Shared clusters across teams
  • Inconsistent governance and auditing

This step defines:

  • How many workspaces are needed
  • How environments (Dev/QA/Prod) are isolated
  • How identities and access flow across workspaces
---

Workspace vs Environment Responsibilities

Responsibility Workspace Unity Catalog
User login
Cluster configuration
Job execution
Data access
Table-level security
---

Step 1.1 – Decide Workspace Topology

Databricks Account
├── Dev Workspace
├── QA Workspace
└── Prod Workspace

Start simple — one workspace per environment. Avoid creating per-team or per-user workspaces initially. Workspace isolation ensures safe Dev → QA → Prod promotion.

---

Step 1.2 – Create Workspaces

Steps:

  1. Log in to Databricks Account Console
  2. Create workspace with required region, VNET, private endpoints, and storage account
  3. Repeat for Dev, QA, Prod

Example naming convention:

Dev  → dbx-dev-us-east
QA   → dbx-qa-us-east
Prod → dbx-prod-us-east
---

Step 1.3 – Define Environment-Specific Azure AD Groups

Combine role + environment in group names:

dbx-dev-admins
dbx-dev-engineers
dbx-dev-analysts

dbx-prod-admins
dbx-prod-engineers
dbx-prod-users

This enables:

  • Same person has Dev access but limited/no Prod access
  • Clear audit trail and separation of duties
---

Step 1.4 – Assign Groups to Workspaces

Workspace access is granted via the Databricks Account Console:

Dev Workspace Example

GroupPermission
dbx-dev-adminsWorkspace Admin
dbx-dev-engineersWorkspace User
dbx-prod-engineers❌ No Access

Prod Workspace Example

GroupPermission
dbx-prod-adminsWorkspace Admin
dbx-prod-engineersWorkspace User
dbx-dev-engineers❌ No Access
---

Step 1.5 – Authentication & Access Flow

User logs in
   |
   v
Azure AD SSO
   |
   v
Databricks Account checks:
    - Is user in a group assigned to this workspace?
        |
        +-- YES → Access granted
        +-- NO  → Workspace invisible
---

Step 1.6 – Workspace Admin vs Account Admin

Role Scope
Account Admin All workspaces, identity, global settings (2–3 people max)
Workspace Admin Single workspace (clusters, jobs, repos)
---

Step 1.7 – Cluster & Job Isolation

Cluster policies per workspace:

  • Dev: small nodes, auto-termination, permissive libraries
  • Prod: fixed nodes, restricted libraries, no interactive clusters

Jobs are workspace-bound:

Git → Dev Workspace Job
        ↓
     QA Workspace Job
        ↓
     Prod Workspace Job

Secrets are workspace-scoped to ensure Dev/Prod isolation:

dev-kv/snowflake-password
prod-kv/snowflake-password
---

Step 1.8 – What Workspaces Do NOT Control

  • Table-level access
  • Row-level security
  • Column masking

These are handled by Unity Catalog in Step 2.

---

Step 1.9 – Common Mistakes to Avoid

  • Giving Dev engineers access to Prod workspace
  • Making everyone Workspace Admin
  • Using one workspace + folders for envs
  • Relying on notebook naming for isolation
---

Step 1.10 – Validation Checklist

  • Dev user logs in → sees Dev workspace only
  • Dev user tries Prod URL → access denied
  • Prod user logs in → sees Prod workspace only
  • Removing user from Azure AD group → access disappears automatically
  • No manual Databricks changes required
---

What Step 1 Enables Next

Because workspaces are properly isolated:
  • Unity Catalog can safely share data across workspaces
  • Prod data can be read-only from Dev
  • Cluster RBAC becomes enforceable
  • Auditors can validate separation of duties and compliance
---

Next Step: Step 2 – Unity Catalog Metastore & Data Isolation (Catalogs, schemas, table-level RBAC, cross-workspace sharing)

Enterprise Databricks Onboarding – Identity Setup (Azure AD)

Enterprise Databricks Onboarding – Step 0: Identity Setup (Azure AD)

Enterprise Databricks Onboarding – Step 0: Identity Setup (Azure AD → Databricks)

Identity is the foundation of every enterprise Databricks deployment. Before you talk about clusters, Unity Catalog, or RBAC, you must first answer one question:

Who is the user, and how is their access controlled?

In this step, we integrate Azure Active Directory (Azure AD) with Databricks Enterprise using SSO and SCIM provisioning.

---

Objective of Step 0

  • Azure AD becomes the single source of truth
  • No manual users or groups in Databricks
  • All access is group-based and auditable
  • Identity lifecycle is fully automated
---

High-Level Identity Architecture

+--------------------+
|     Azure AD       |
|--------------------|
| Users              |
| Groups             |
| MFA / CA Policies  |
+---------+----------+
          |
          | 1) SSO (SAML)
          |
          v
+--------------------+
| Databricks Account |
|--------------------|
| Authentication     |
| (Login)            |
+---------+----------+
          |
          | 2) SCIM Provisioning
          |
          v
+-----------------------------+
| Databricks Identity Store   |
|-----------------------------|
| Users (Read-only)           |
| Groups (SCIM-managed)       |
| Memberships                 |
+-----------------------------+
Key Principle:
Azure AD authenticates users (SSO). SCIM provisions users and groups. Databricks never owns identity.
---

First-Time Setup (Greenfield Environment)

Step 1: Define Identity Model

Databricks must consume identities — not create them.

  • ❌ No local Databricks users
  • ❌ No Databricks-only groups
  • ✅ Azure AD is authoritative
---

Step 2: Create Azure AD Groups (RBAC-Oriented)

Create role-based groups, not user-specific ones.

dbx-admins
dbx-platform
dbx-data-engineers
dbx-data-analysts
dbx-ml-engineers
dbx-prod-users
Never assign permissions directly to users later. All permissions must flow from groups.
---

Step 3: Create Azure Databricks Enterprise Application

  1. Azure Portal → Azure Active Directory
  2. Enterprise Applications → New Application
  3. Search for Azure Databricks
  4. Create the application

This application handles both SSO and SCIM provisioning.

---

Step 4: Configure SSO (Authentication)

SSO answers the question: Who are you?

SAML Configuration

Entity ID (Identifier):
https://accounts.azuredatabricks.net

Reply URL (ACS):
https://accounts.azuredatabricks.net/login/saml

User attributes:

email  → user.mail
name   → user.userprincipalname
---

SSO Login Flow

User Browser
     |
     v
Azure AD Login (MFA, CA)
     |
     v
SAML Assertion
     |
     v
Databricks Account Console

After this, users authenticate using corporate credentials only.

---

Step 5: Configure SCIM Provisioning

SCIM answers the question: What access does the user have?

Generate SCIM Token

  1. Databricks Account Console
  2. User Management → Generate SCIM token

Azure AD Provisioning Settings

Tenant URL:
https://accounts.azuredatabricks.net/api/2.0/accounts/<ACCOUNT_ID>/scim/v2

Authentication:
Bearer Token (SCIM Token)
---

SCIM Provisioning Flow

Azure AD
  |
  | Users + Groups + Memberships
  |
  v
SCIM API
  |
  v
Databricks Account
  |
  v
Workspaces / Unity Catalog / Clusters
---

Step 6: Assign Groups to the Application

Only assigned groups are synced.

Assigned Groups:
- dbx-admins
- dbx-data-engineers
- dbx-data-analysts
---

Day-2 Operations (After Go-Live)

Adding a New User

1. Create user in Azure AD
2. Add to dbx-data-engineers
3. SCIM sync runs
4. User appears in Databricks automatically

Removing a User

1. Disable user in Azure AD
2. SCIM removes user from Databricks
3. Access revoked everywhere

Changing User Role

Remove: dbx-data-analysts
Add:    dbx-data-engineers

All permissions update automatically without Databricks admin intervention.

---

Security & Compliance Benefits

  • Centralized identity management
  • Audit-friendly access controls
  • MFA and Conditional Access enforced
  • Zero-trust compatible
  • SOC2 / ISO aligned
---

Final Outcome of Step 0

Authentication → Azure AD
Authorization  → Groups
Provisioning   → SCIM
Databricks     → Identity Consumer

This identity foundation enables:

  • Unity Catalog RBAC
  • Cluster isolation
  • Workspace governance
  • Secure production onboarding
---

Next Blog: Step 1 – Workspace Strategy & Environment Isolation

Databricks RBAC Explained with Real Examples (User A vs User B)

Databricks RBAC Explained with Real Examples (User A vs User B)

Databricks RBAC Explained with Real Example (User A vs User B)

Role-Based Access Control (RBAC) in Databricks is one of the most important concepts for enterprise security. In this blog, we will explain RBAC step-by-step using a real-world example where User A can access a table and cluster but User B cannot.


What is RBAC in Databricks?

RBAC (Role-Based Access Control) means:

  • Users do NOT get permissions directly
  • Users are added to groups (roles)
  • Permissions are granted to groups
  • Users inherit permissions via group membership

Databricks enforces RBAC at multiple layers:

Identity → Workspace → Compute → Data (Unity Catalog)

Scenario Setup

Users

UserEmail
User Aalice@company.com
User Bbob@company.com

Groups

GroupDescription
dbx-finance-teamFinance users
dbx-ml-teamML users

Resources

TypeName
Workspacefinance-prod
Clusterfinance-cluster
Catalogfinance
Schemafinance.gold
Tablefinance.gold.transactions

Step 0: Identity Setup (Azure AD → Databricks)

Azure Active Directory is the source of truth.

alice → member of dbx-finance-team
bob   → member of dbx-ml-team

Using SCIM provisioning, users and groups are automatically created in Databricks.

Important: At this point, no permissions are granted yet.

Step 1: Workspace Access (First Gate)

Users must be assigned to a workspace to even enter it.

Workspace: finance-prod

GroupRole
dbx-finance-teamUser
dbx-platform-adminsAdmin

dbx-ml-team is NOT assigned

Result

UserWorkspace Access
Alice✅ Allowed
Bob❌ Blocked

Step 2: Cluster Access (Second Gate)

Cluster: finance-cluster

GroupPermission
dbx-finance-teamCAN_ATTACH_TO
dbx-platform-adminsCAN_MANAGE

The default users group is removed.

Result

UserCluster Visibility
Alice✅ Can see and use
Bob❌ Cannot see

Step 3: Data Access Using Unity Catalog

Unity Catalog enforces fine-grained RBAC for data.

Permissions Granted

GRANT USE CATALOG ON CATALOG finance TO `dbx-finance-team`;
GRANT USE SCHEMA ON SCHEMA finance.gold TO `dbx-finance-team`;
GRANT SELECT ON TABLE finance.gold.transactions
TO `dbx-finance-team`;

No permissions are granted to dbx-ml-team.


Step 4: Execution Trace (What Actually Happens)

User A (Alice)

SELECT * FROM finance.gold.transactions;

Permission evaluation:

Workspace access → YES
Cluster access → YES
USE CATALOG → YES
USE SCHEMA → YES
SELECT TABLE → YES

Result: ✅ Query succeeds


User B (Bob)

SELECT * FROM finance.gold.transactions;

Permission evaluation:

Workspace access → NO ❌

Result: ❌ Query fails immediately


Error Messages Bob Might See

PERMISSION_DENIED: User does not have USE CATALOG privilege
or
Cluster not found
or
User is not authorized to access workspace

Key Mental Model

USER
 ↓
GROUP
 ↓
WORKSPACE
 ↓
CLUSTER
 ↓
UNITY CATALOG
 ↓
TABLE

Access is denied at the first missing permission.


Why This RBAC Model Is Secure

  • Strong isolation between teams
  • No accidental data exposure
  • Centralized governance
  • Easy onboarding and offboarding
  • Fully auditable

Wednesday, 7 January 2026

Unauthorized Pub/Sub Publish Detection

Unauthorized Pub/Sub Publish Detection

Python: Unauthorized Pub/Sub Publish Detection


import json

def get_principal(proto: dict) -> str:
    """
    Extracts the principal from protoPayload safely.
    Checks for principalEmail first, then principalSubject.
    Returns 'UNKNOWN' if not found.
    """
    auth = proto.get("authenticationInfo", {})
    return (
        auth.get("principalEmail") or
        auth.get("principalSubject") or
        "UNKNOWN"
    ).lower()


def is_unauthorized_publish(event: dict) -> bool:
    """
    Returns True if the event matches:
    - pubsub publish to DLT topic
    - principal NOT Dataflow service accounts
    - request successful
    """
    try:
        proto = event.get("protoPayload", {})

        method = proto.get("methodName", "")
        resource = proto.get("resourceName", "")
        principal = get_principal(proto)
        status_code = proto.get("status", {}).get("code")

        # 1️⃣ Check method
        if method != "google.pubsub.v1.Publisher.Publish":
            return False

        # 2️⃣ Check topic pattern: pst-<env>-siem-logs-security-logs-dlt
        if not resource.endswith("siem-logs-security-logs-dlt"):
            return False

        if not resource.split("/")[-1].startswith("pst-"):
            return False

        # 3️⃣ Check request was successful
        if status_code != 0:
            return False

        # 4️⃣ Exclude Dataflow SAs
        if "sq-dataflow" in principal or "sa-dataflow" in principal or "dataflow" in principal:
            return False

        return True

    except Exception:
        return False


# ------------------ TEST PAYLOADS ------------------

# ✅ Should return True
test_payload_1 = {
    "protoPayload": {
        "methodName": "google.pubsub.v1.Publisher.Publish",
        "resourceName": "projects/my-project/topics/pst-prod-siem-logs-security-logs-dlt",
        "authenticationInfo": {
            "principalEmail": "user@example.com"
        },
        "status": {
            "code": 0
        }
    }
}

# ❌ Should return False (Dataflow SA)
test_payload_2 = {
    "protoPayload": {
        "methodName": "google.pubsub.v1.Publisher.Publish",
        "resourceName": "projects/my-project/topics/pst-prod-siem-logs-security-logs-dlt",
        "authenticationInfo": {
            "principalEmail": "service-1234567890@dataflow-service-producer-prod.iam.gserviceaccount.com"
        },
        "status": {
            "code": 0
        }
    }
}

# ❌ Should return False (Wrong method)
test_payload_3 = {
    "protoPayload": {
        "methodName": "google.pubsub.v1.Publisher.CreateTopic",
        "resourceName": "projects/my-project/topics/pst-prod-siem-logs-security-logs-dlt",
        "authenticationInfo": {
            "principalEmail": "user@example.com"
        },
        "status": {
            "code": 0
        }
    }
}

# ------------------ TEST EXECUTION ------------------
for i, payload in enumerate([test_payload_1, test_payload_2, test_payload_3], 1):
    result = is_unauthorized_publish(payload)
    print(f"Test Payload {i}: {result}")