Tuesday, 6 January 2026

Databricks Unity Catalog on AWS – Complete Deep Dive

Databricks Unity Catalog on AWS – Complete Deep Dive

Databricks Unity Catalog on AWS – Complete Architecture, Setup, Security and Best Practices

Unity Catalog is the centralized governance solution for Databricks that provides unified access control, auditing, lineage, and metadata management for all data assets.

It governs:

  • Tables
  • Files
  • Machine learning models
  • Dashboards
  • Notebooks

Unity Catalog works across multiple workspaces and environments.


1. Unity Catalog Architecture

Unity Catalog uses a centralized governance model.

                 +----------------------------+
                 |   Databricks Account      |
                 | (Account Console Layer)   |
                 +-------------+--------------+
                               |
                               |
                     +---------v-----------+
                     |  Unity Catalog      |
                     |  Metastore          |
                     +---------+-----------+
                               |
        -------------------------------------------------
        |                        |                       |
+-------v-------+       +-------v-------+       +-------v-------+
| Dev Workspace |       | UAT Workspace |       | Prod Workspace|
+---------------+       +---------------+       +---------------+

         |                       |                      |
      Access                 Access                 Access
         |                       |                      |

      S3 Storage             S3 Storage              S3 Storage

Key concept: A single Unity Catalog metastore can be shared across multiple workspaces.


2. Core Components of Unity Catalog

Metastore

Metastore is the top-level container that stores metadata for all data assets.

It includes:

  • Catalog definitions
  • Table schemas
  • Permissions
  • Audit logs
  • Data lineage

Each region should have one metastore.

Example

Metastore: us-east-1-metastore
Region: us-east-1
Storage: s3://databricks-meta-storage

Catalog

Catalog is the first level of logical organization.

Catalog -> Schema -> Tables

Example catalogs:

  • finance
  • marketing
  • risk
  • ml_models

Example

CREATE CATALOG finance;

Schema

Schemas group tables within a catalog.

CREATE SCHEMA finance.transactions;
Structure:
finance (catalog)
   |
   |--- transactions (schema)
   |        |
   |        |--- payments_table
   |        |--- refund_table

Tables

Unity catalog supports:
  • Managed tables
  • External tables

Managed Table

Data stored in Databricks managed storage.

CREATE TABLE finance.transactions.payments
(
 id INT,
 amount DOUBLE,
 created_date DATE
)
USING DELTA;

External Table

Data stored in external S3 storage.

CREATE TABLE finance.transactions.external_payments
USING DELTA
LOCATION 's3://finance-data/payments/';

Volumes

Volumes allow secure access to files in object storage.
CREATE VOLUME finance.raw_data;
Use cases:
  • ML training datasets
  • Raw ingestion files
  • Image datasets

3. Unity Catalog Security Model

Unity Catalog uses Role Based Access Control. Hierarchy:
Account
   |
Metastore
   |
Catalog
   |
Schema
   |
Table
Permissions propagate downward.

Example Permission Model

Role Permissions
Data Engineer CREATE TABLE, MODIFY
Data Analyst SELECT
Data Scientist SELECT, CREATE MODEL
Admin ALL PRIVILEGES

Grant Example

GRANT SELECT
ON TABLE finance.transactions.payments
TO `analyst-group`;
Grant schema usage:
GRANT USAGE
ON SCHEMA finance.transactions
TO `analyst-group`;

4. Identity Integration

Unity Catalog integrates with:
  • AWS IAM
  • Azure AD
  • Okta
  • SCIM
Example group structure:
data_engineers
data_analysts
data_scientists
platform_admins

5. Storage Credentials

Unity Catalog connects to S3 using storage credentials.

Example IAM Role

resource "aws_iam_role" "databricks_uc_role" {

  name = "databricks-uc-access-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
        Effect = "Allow"
        Principal = {
            AWS = "arn:aws:iam::123456789012:root"
        }
        Action = "sts:AssumeRole"
    }]
  })
}

Create Storage Credential

CREATE STORAGE CREDENTIAL finance_credential
WITH IAM_ROLE 'arn:aws:iam::123456789012:role/databricks-uc-access-role';

Create External Location

CREATE EXTERNAL LOCATION finance_s3
URL 's3://finance-data'
WITH STORAGE CREDENTIAL finance_credential;

6. Unity Catalog Setup Steps

Step 1 – Create S3 Bucket

aws s3 mb s3://databricks-metastore-storage

Step 2 – Create IAM Role

This role allows Databricks to access S3.

Step 3 – Create Metastore

Using Terraform:
resource "databricks_metastore" "this" {

 name = "company-metastore"

 storage_root = "s3://databricks-metastore-storage"

 region = "us-east-1"
}

Step 4 – Assign Metastore to Workspace

resource "databricks_metastore_assignment" "this" {

 workspace_id = var.workspace_id

 metastore_id = databricks_metastore.this.id
}

7. Terraform Example for Unity Catalog

Provider

provider "databricks" {
 host  = var.databricks_host
 token = var.databricks_token
}

Create Catalog

resource "databricks_catalog" "finance" {

 name = "finance"

 comment = "Finance data catalog"
}

Create Schema

resource "databricks_schema" "transactions" {

 name = "transactions"

 catalog_name = databricks_catalog.finance.name
}

Create Table

resource "databricks_sql_table" "payments" {

 name = "payments"

 catalog_name = databricks_catalog.finance.name

 schema_name = databricks_schema.transactions.name

 table_type = "MANAGED"

 data_source_format = "DELTA"
}

8. Data Lineage

Unity Catalog automatically tracks lineage. Example:
Raw Table -> Bronze Table -> Silver Table -> Gold Table
Example pipeline:
bronze_orders
     |
silver_orders
     |
gold_revenue_dashboard
Benefits:
  • Impact analysis
  • Governance
  • Compliance

9. Auditing

Unity Catalog provides audit logs for:
  • Table access
  • Permission changes
  • Query execution
Logs stored in:
AWS CloudTrail
or
Databricks audit logs bucket

10. Best Practices

Use Separate Catalogs per Domain

finance
marketing
risk
customer

Use Schema for Data Layers

bronze
silver
gold
Example:
finance.bronze.transactions
finance.silver.transactions
finance.gold.revenue

Follow Least Privilege Principle

Example:
  • Analysts get SELECT only
  • Engineers get CREATE TABLE
  • Admins get ALL

Centralize Metastore

One metastore per region.

Use Terraform for Governance

Never create catalogs manually in production. Use Infrastructure as Code.

11. Recommended Enterprise Structure

AWS Organization
    |
    +---- Dev Account
    |         |
    |         +---- Databricks Workspace
    |
    +---- UAT Account
    |         |
    |         +---- Databricks Workspace
    |
    +---- Prod Account
              |
              +---- Databricks Workspace
Unity Catalog Metastore shared across workspaces.

12. Production Security Model

Layer Security
Network Private VPC, Private Link
Identity SSO with Okta or Azure AD
Storage S3 IAM Role access
Data Unity Catalog RBAC
Audit CloudTrail logs

Conclusion

Unity Catalog is the foundation of enterprise data governance in Databricks.

It enables:

  • Centralized governance
  • Fine grained access control
  • Cross workspace data sharing
  • Audit and lineage
  • Secure access to S3

For enterprise deployments, Unity Catalog should always be deployed with:

  • Separate AWS accounts for Dev, UAT, and Prod
  • Terraform infrastructure
  • IAM roles for storage access
  • RBAC using groups
  • Private networking

No comments:

Post a Comment