Enterprise Databricks on AWS: Zero-Trust, Unity Catalog & Audit-Ready Architecture
This document explains how to design and implement Databricks on AWS using Zero-Trust principles, Unity Catalog enforced security, cross-account data sharing, and an audit-ready architecture.
1. Zero-Trust Databricks Deployment (AWS)
What Zero-Trust Means for Databricks
- No public IPs
- No inbound internet access
- Explicit identity-based access only
- All access is authenticated, authorized, and logged
Core AWS Components
- Dedicated VPC per Databricks workspace
- Private subnets only
- VPC Endpoints (PrivateLink)
- IAM roles with least privilege
- Security Groups with deny-by-default
VPC Design
VPC (10.0.0.0/16) ├── Private Subnet A (10.0.1.0/24) - Databricks Compute ├── Private Subnet B (10.0.2.0/24) - Databricks Compute ├── VPC Endpoint Subnet └── No Internet Gateway
Required VPC Endpoints
com.amazonaws.<region>.s3com.amazonaws.<region>.stscom.amazonaws.<region>.logscom.amazonaws.<region>.monitoring- Databricks Control Plane PrivateLink endpoints
Why: Databricks clusters must communicate with AWS services
without touching the public internet.
Security Groups
- No inbound rules
- Outbound only to:
- VPC endpoints
- Databricks control plane CIDRs
2. Unity Catalog Enforced Security
Why Unity Catalog Is Mandatory for Enterprises
- Centralized governance
- Fine-grained RBAC (catalog, schema, table, column, row)
- Cross-workspace data sharing
- Built-in auditing
Unity Catalog Core Objects
Metastore ├── Catalog (prod_sales) │ ├── Schema (orders) │ │ └── Table (transactions)
Metastore Setup (AWS)
- Create S3 bucket for UC storage
- Enable versioning & encryption (SSE-KMS)
- Attach IAM role to Databricks
S3 Bucket Policy: - Allow Databricks IAM Role - Deny public access - Enforce TLS
RBAC Example
Group: analytics_team Permissions: - USE CATALOG prod_sales - USE SCHEMA prod_sales.orders - SELECT ON TABLE prod_sales.orders.transactions
Row-Level Security (Dynamic Views)
CREATE VIEW prod_sales.orders.secure_transactions AS SELECT * FROM prod_sales.orders.transactions WHERE region = current_user();
3. Cross-Account Data Sharing (Unity Catalog)
Use Case
- Producer account owns raw data
- Consumer account reads curated data
- No data copy
Architecture
Account A (Producer)
└── Unity Catalog Metastore
└── Shared Catalog
Account B (Consumer)
└── Databricks Workspace
└── Read-only access
How Sharing Works
- Delta Sharing protocol
- IAM role trust between accounts
- Read-only permissions
Security Guarantees
- No write access
- All queries logged
- Column and row filters enforced
4. Audit-Ready Architecture
Audit Requirements Covered
- Who accessed what data
- When queries were run
- From which workspace
- Using which identity
Audit Logs
- Databricks audit logs → S3
- CloudTrail for IAM & API calls
- S3 access logs
Audit Log Flow
Databricks → S3 (Audit Logs) AWS CloudTrail → S3 S3 → SIEM / Athena / OpenSearch
What Auditors Love
- No shared credentials
- Identity-based access
- Immutable logs
- Separation of duties
5. End-to-End Control Summary
| Layer | Control |
|---|---|
| Network | Private VPC, PrivateLink, no internet |
| Identity | IAM + Databricks SCIM groups |
| Compute | Cluster policies & group binding |
| Data | Unity Catalog RBAC + RLS |
| Audit | Centralized logs in S3 |
Final Outcome
- Zero-trust Databricks deployment
- Centralized governance via Unity Catalog
- Secure cross-account data sharing
- Fully audit-ready enterprise platform
This architecture scales cleanly across Dev / Test / Prod, supports regulated workloads, and aligns with financial-grade security standards.
No comments:
Post a Comment