Databricks Unity Catalog on AWS – Complete Architecture, Setup, Security and Best Practices
Unity Catalog is the centralized governance solution for Databricks that provides unified access control, auditing, lineage, and metadata management for all data assets.
It governs:
- Tables
- Files
- Machine learning models
- Dashboards
- Notebooks
Unity Catalog works across multiple workspaces and environments.
1. Unity Catalog Architecture
Unity Catalog uses a centralized governance model.
+----------------------------+
| Databricks Account |
| (Account Console Layer) |
+-------------+--------------+
|
|
+---------v-----------+
| Unity Catalog |
| Metastore |
+---------+-----------+
|
-------------------------------------------------
| | |
+-------v-------+ +-------v-------+ +-------v-------+
| Dev Workspace | | UAT Workspace | | Prod Workspace|
+---------------+ +---------------+ +---------------+
| | |
Access Access Access
| | |
S3 Storage S3 Storage S3 Storage
Key concept: A single Unity Catalog metastore can be shared across multiple workspaces.
2. Core Components of Unity Catalog
Metastore
Metastore is the top-level container that stores metadata for all data assets.
It includes:
- Catalog definitions
- Table schemas
- Permissions
- Audit logs
- Data lineage
Each region should have one metastore.
Example
Metastore: us-east-1-metastore Region: us-east-1 Storage: s3://databricks-meta-storage
Catalog
Catalog is the first level of logical organization.
Catalog -> Schema -> Tables
Example catalogs:
- finance
- marketing
- risk
- ml_models
Example
CREATE CATALOG finance;
Schema
Schemas group tables within a catalog.
CREATE SCHEMA finance.transactions;Structure:
finance (catalog) | |--- transactions (schema) | | | |--- payments_table | |--- refund_table
Tables
Unity catalog supports:- Managed tables
- External tables
Managed Table
Data stored in Databricks managed storage.
CREATE TABLE finance.transactions.payments ( id INT, amount DOUBLE, created_date DATE ) USING DELTA;
External Table
Data stored in external S3 storage.
CREATE TABLE finance.transactions.external_payments USING DELTA LOCATION 's3://finance-data/payments/';
Volumes
Volumes allow secure access to files in object storage.CREATE VOLUME finance.raw_data;Use cases:
- ML training datasets
- Raw ingestion files
- Image datasets
3. Unity Catalog Security Model
Unity Catalog uses Role Based Access Control. Hierarchy:Account | Metastore | Catalog | Schema | TablePermissions propagate downward.
Example Permission Model
| Role | Permissions |
|---|---|
| Data Engineer | CREATE TABLE, MODIFY |
| Data Analyst | SELECT |
| Data Scientist | SELECT, CREATE MODEL |
| Admin | ALL PRIVILEGES |
Grant Example
GRANT SELECT ON TABLE finance.transactions.payments TO `analyst-group`;Grant schema usage:
GRANT USAGE ON SCHEMA finance.transactions TO `analyst-group`;
4. Identity Integration
Unity Catalog integrates with:- AWS IAM
- Azure AD
- Okta
- SCIM
data_engineers data_analysts data_scientists platform_admins
5. Storage Credentials
Unity Catalog connects to S3 using storage credentials.Example IAM Role
resource "aws_iam_role" "databricks_uc_role" {
name = "databricks-uc-access-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = {
AWS = "arn:aws:iam::123456789012:root"
}
Action = "sts:AssumeRole"
}]
})
}
Create Storage Credential
CREATE STORAGE CREDENTIAL finance_credential WITH IAM_ROLE 'arn:aws:iam::123456789012:role/databricks-uc-access-role';
Create External Location
CREATE EXTERNAL LOCATION finance_s3 URL 's3://finance-data' WITH STORAGE CREDENTIAL finance_credential;
6. Unity Catalog Setup Steps
Step 1 – Create S3 Bucket
aws s3 mb s3://databricks-metastore-storage
Step 2 – Create IAM Role
This role allows Databricks to access S3.Step 3 – Create Metastore
Using Terraform:
resource "databricks_metastore" "this" {
name = "company-metastore"
storage_root = "s3://databricks-metastore-storage"
region = "us-east-1"
}
Step 4 – Assign Metastore to Workspace
resource "databricks_metastore_assignment" "this" {
workspace_id = var.workspace_id
metastore_id = databricks_metastore.this.id
}
7. Terraform Example for Unity Catalog
Provider
provider "databricks" {
host = var.databricks_host
token = var.databricks_token
}
Create Catalog
resource "databricks_catalog" "finance" {
name = "finance"
comment = "Finance data catalog"
}
Create Schema
resource "databricks_schema" "transactions" {
name = "transactions"
catalog_name = databricks_catalog.finance.name
}
Create Table
resource "databricks_sql_table" "payments" {
name = "payments"
catalog_name = databricks_catalog.finance.name
schema_name = databricks_schema.transactions.name
table_type = "MANAGED"
data_source_format = "DELTA"
}
8. Data Lineage
Unity Catalog automatically tracks lineage. Example:Raw Table -> Bronze Table -> Silver Table -> Gold TableExample pipeline:
bronze_orders
|
silver_orders
|
gold_revenue_dashboard
Benefits:
- Impact analysis
- Governance
- Compliance
9. Auditing
Unity Catalog provides audit logs for:- Table access
- Permission changes
- Query execution
AWS CloudTrailor
Databricks audit logs bucket
10. Best Practices
Use Separate Catalogs per Domain
finance marketing risk customer
Use Schema for Data Layers
bronze silver goldExample:
finance.bronze.transactions finance.silver.transactions finance.gold.revenue
Follow Least Privilege Principle
Example:- Analysts get SELECT only
- Engineers get CREATE TABLE
- Admins get ALL
Centralize Metastore
One metastore per region.Use Terraform for Governance
Never create catalogs manually in production. Use Infrastructure as Code.11. Recommended Enterprise Structure
AWS Organization
|
+---- Dev Account
| |
| +---- Databricks Workspace
|
+---- UAT Account
| |
| +---- Databricks Workspace
|
+---- Prod Account
|
+---- Databricks Workspace
Unity Catalog Metastore shared across workspaces.
12. Production Security Model
| Layer | Security |
|---|---|
| Network | Private VPC, Private Link |
| Identity | SSO with Okta or Azure AD |
| Storage | S3 IAM Role access |
| Data | Unity Catalog RBAC |
| Audit | CloudTrail logs |
Conclusion
Unity Catalog is the foundation of enterprise data governance in Databricks.
It enables:
- Centralized governance
- Fine grained access control
- Cross workspace data sharing
- Audit and lineage
- Secure access to S3
For enterprise deployments, Unity Catalog should always be deployed with:
- Separate AWS accounts for Dev, UAT, and Prod
- Terraform infrastructure
- IAM roles for storage access
- RBAC using groups
- Private networking
No comments:
Post a Comment