Unity Catalog Metastore & Data Isolation
Enterprise-Level Technical Deep Dive with Real Examples (AWS Databricks)
1. What a Unity Catalog Metastore Really Is
A Unity Catalog metastore is the central security and governance control plane for Databricks. It owns:
- All metadata (catalogs, schemas, tables, views, functions)
- All permissions (RBAC, RLS, CLS)
- Access to physical storage through credentials and locations
2. Metastore Scope & Design Decision
Enterprise Best Practice
One Metastore per: - Cloud - Region - Compliance Boundary
Why This Matters
- Enables cross-workspace data sharing
- Centralizes governance and audit
- Prevents duplicated security logic
3. Real Enterprise Architecture (AWS)
AWS Account │ ├── Unity Catalog Metastore (us-east-1) │ ├── Storage Root │ ├── Storage Credentials │ ├── External Locations │ ├── Catalog: prod │ └── Catalog: dev │ ├── Databricks Workspace: dev └── Databricks Workspace: prod
Both workspaces attach to the same metastore.
4. Metastore Storage Root
The storage root is the default storage for managed tables. Users never access this directly.
Example
s3://company-uc-root/
IAM Role Permissions
- s3:GetObject
- s3:PutObject
- s3:ListBucket
5. Storage Credentials
A storage credential is a Unity Catalog object that wraps an IAM role.
Example
CREATE STORAGE CREDENTIAL prod_storage_cred
WITH IAM_ROLE 'arn:aws:iam::123456789:role/dbx-prod-uc-role';
This decouples cloud IAM from users completely.
6. External Locations (Actual Data Isolation)
External locations bind:
- S3 path
- Storage credential
Example
CREATE EXTERNAL LOCATION prod_sales_loc
URL 's3://prod-sales-data/'
WITH STORAGE CREDENTIAL prod_storage_cred;
7. Catalog-Level Isolation
Catalogs are the first logical isolation layer.
Example
CREATE CATALOG prod;
CREATE CATALOG dev;
Access Control
GRANT USAGE ON CATALOG prod TO `group_prod_users`;
8. Schema-Level Isolation
Schemas isolate teams or business domains.
Example
CREATE SCHEMA prod.sales;
CREATE SCHEMA prod.finance;
GRANT SELECT ON SCHEMA prod.sales
TO `group_sales_analytics`;
9. Table-Level Isolation
Tables are where most security risk exists.
Example
GRANT SELECT, MODIFY
ON TABLE prod.sales.customers
TO `group_sales_engineers`;
10. Cross-Workspace Data Sharing
Scenario
- Dev workspace needs read-only access to Prod data
Solution
GRANT SELECT
ON TABLE prod.sales.customers
TO `group_dev_engineers`;
No S3 access required. Unity Catalog enforces this.
11. Row-Level Security (Dynamic Views)
Business Rule
| Group | Country Access |
|---|---|
| group_us_analysts | USA |
| group_eu_analysts | EU |
Dynamic View
CREATE VIEW prod.sales.customers_secure AS
SELECT *
FROM prod.sales.customers
WHERE
(is_member('group_us_analysts') AND country = 'US')
OR
(is_member('group_eu_analysts') AND country = 'EU');
12. Column-Level Security
Example
CREATE VIEW prod.sales.customers_masked AS
SELECT
id,
name,
CASE
WHEN is_member('group_pii_admins') THEN ssn
ELSE 'XXX-XX-XXXX'
END AS ssn
FROM prod.sales.customers;
13. Managed vs External Tables
| Type | Storage | Use Case |
|---|---|---|
| Managed | UC Root | Dev, sandbox |
| External | External Location | Prod, regulated data |
14. How Security Is Actually Enforced
- At query planning
- At query execution
Even if a user knows the S3 path, Unity Catalog blocks access.
15. Auditing & Lineage
Unity Catalog automatically captures:
- Who accessed what
- Which queries touched which tables
- Downstream dependencies
Example Query
SELECT * FROM system.access.audit;
16. Common Enterprise Mistakes
- Multiple metastores per environment
- Granting S3 access to users
- Relying on workspace ACLs for data
- No catalog separation
17. Enterprise Golden Rules
- One metastore per region
- Always use groups
- Never grant to PUBLIC
- Use views for sensitive data
- Treat UC as a security firewall
18. End-to-End Access Example
| User | Group | Read | Write |
|---|---|---|---|
| User A | group_prod_engineers | All | Yes |
| User B | group_dev_engineers | All | No |
| User C | group_us_analysts | US only | No |
Final Summary
Unity Catalog is not just metadata. It is your data firewall, governance engine, and compliance backbone.
If the metastore is designed correctly, everything else becomes simple.
No comments:
Post a Comment