Enterprise Databricks on AWS – Terraform-First Architecture
This article explains how to build a fully automated, enterprise-grade Databricks platform on AWS using Terraform only, covering:
- SCIM & Identity automation
- Workspace creation and isolation
- Unity Catalog metastore & data isolation
- Catalog, schema, table-level RBAC
- Row-level security using dynamic views
- Cross-account AWS data sharing
High-Level Enterprise Architecture
AWS Account (Databricks Account)
│
├── Account Console
│ ├── SCIM Users & Groups (Terraform)
│ ├── Unity Catalog Metastore (Terraform)
│ └── Workspaces (Dev / QA / Prod)
│
├── AWS Account A (Prod Data)
│ ├── S3 UC Managed Location
│ └── IAM Role (External Location)
│
├── AWS Account B (Analytics)
│ └── Read-only access via UC Sharing
│
└── Azure AD / Okta
└── Identity Source (SSO + SCIM)
Design Principle: Identity, access, and data governance are controlled centrally at the Databricks Account level.
1. SCIM Group Automation with Terraform
Why SCIM Matters
SCIM ensures that Databricks users and groups are never created manually. Azure AD (or Okta) remains the source of truth.
Terraform – Databricks Account Provider
provider "databricks" {
alias = "account"
host = "https://accounts.cloud.databricks.com"
account_id = var.databricks_account_id
}
Create Groups (Mirrors Azure AD)
resource "databricks_group" "data_engineers" {
provider = databricks.account
display_name = "data-engineers"
}
resource "databricks_group" "data_scientists" {
provider = databricks.account
display_name = "data-scientists"
}
Assign Users (SCIM)
resource "databricks_user" "alice" {
provider = databricks.account
user_name = "alice@company.com"
}
resource "databricks_group_member" "alice_engineers" {
provider = databricks.account
group_id = databricks_group.data_engineers.id
member_id = databricks_user.alice.id
}
Result: Azure AD → SCIM → Databricks is now fully automated.
2. Workspace Creation & Environment Isolation
Enterprise Workspace Strategy
- One workspace per environment
- Dev cannot modify Prod
- Shared metastore across workspaces
Create Workspace (AWS)
resource "databricks_mws_workspaces" "prod" {
provider = databricks.account
workspace_name = "prod-workspace"
aws_region = "us-east-1"
credentials_id = databricks_mws_credentials.this.credentials_id
storage_configuration_id = databricks_mws_storage_configurations.this.storage_configuration_id
}
Attach Groups to Workspace
resource "databricks_mws_permission_assignment" "prod_admins" {
provider = databricks.account
workspace_id = databricks_mws_workspaces.prod.workspace_id
principal_id = databricks_group.data_engineers.id
permissions = ["ADMIN"]
}
3. Unity Catalog Metastore – Terraform-Only
Create Metastore
resource "databricks_metastore" "main" {
provider = databricks.account
name = "enterprise-metastore"
region = "us-east-1"
storage_root = "s3://uc-metastore-root/"
}
Attach Metastore to Workspace
resource "databricks_metastore_assignment" "prod" {
provider = databricks.account
workspace_id = databricks_mws_workspaces.prod.workspace_id
metastore_id = databricks_metastore.main.id
}
4. Unity Catalog RBAC as Code (grants.tf)
Create Catalogs per Domain
resource "databricks_catalog" "finance" {
name = "finance"
}
Create Schemas
resource "databricks_schema" "payments" {
name = "payments"
catalog_name = databricks_catalog.finance.name
}
Grant Permissions
resource "databricks_grants" "finance_read" {
catalog = databricks_catalog.finance.name
grant {
principal = "data-scientists"
privileges = ["USE_CATALOG"]
}
}
All permissions are version-controlled and auditable.
5. Row-Level Security (Dynamic Views)
Use Case
- US team sees US data
- EU team sees EU data
Dynamic View
CREATE OR REPLACE VIEW finance.payments.secure_payments AS SELECT * FROM finance.payments.raw WHERE region = current_user();
No data duplication. No application-side filtering.
6. Cross-Account AWS Sharing with Unity Catalog
Producer Account (Prod)
CREATE SHARE finance_share; ALTER SHARE finance_share ADD TABLE finance.payments.raw;
Consumer Account
CREATE CATALOG finance_shared USING SHARE finance_share WITH PROVIDER databricks;
S3 access is mediated by UC – not IAM users.
7. Decision Diagrams for Architects
Identity Decision
Azure AD ├── Manual Users ❌ └── SCIM + SSO ✅
Data Access Decision
IAM Policies ❌ Unity Catalog Grants ✅
Security Model
Workspace ACLs → Compute Unity Catalog → Data
What This Enables Next
- Prod data read-only from Dev
- Cluster RBAC enforced
- Auditor-friendly access logs
- Multi-account AWS sharing
This is the reference architecture used by regulated enterprises.
Suggested Multi-Post Series
- Identity & SCIM Automation
- Workspace Isolation Strategy
- Unity Catalog Deep Dive
- RBAC & Data Security Patterns
- Cross-Account Data Sharing
No comments:
Post a Comment