Databricks Unity Catalog Governance Example
5. Unity Catalog Object Hierarchy
Unity Catalog organizes data assets in a hierarchical structure:
Metastore
└── Catalog
└── Schema
└── Tables / Views / Functions
Example Enterprise Structure:
Metastore: enterprise_metastore
Catalogs
├── raw_data
│ └── bronze
│ └── customer_raw
│
├── curated_data
│ └── silver
│ └── customer_clean
│
└── analytics
└── gold
└── customer_revenue_summary
6. Example Governance Implementation
We have three teams and example users:
| Team | Role |
|---|---|
| Data Engineers | Build ETL pipelines |
| Data Scientists | Build ML models |
| Business Analysts | Query reporting data |
Users:
- john.engineer@company.com
- sara.scientist@company.com
- mike.analyst@company.com
Groups:
- data_engineers
- data_scientists
- business_analysts
Step 1: Create Groups (SQL)
CREATE GROUP data_engineers; CREATE GROUP data_scientists; CREATE GROUP business_analysts;
Step 2: Add Users to Groups (SQL)
ALTER GROUP data_engineers ADD USER `john.engineer@company.com`; ALTER GROUP data_scientists ADD USER `sara.scientist@company.com`; ALTER GROUP business_analysts ADD USER `mike.analyst@company.com`;
Step 3: Create Catalogs (SQL)
CREATE CATALOG raw_data COMMENT 'Raw ingestion data catalog'; CREATE CATALOG curated_data COMMENT 'Processed datasets catalog'; CREATE CATALOG analytics COMMENT 'Business reporting catalog';
Step 4: Assign Catalog Ownership (SQL)
ALTER CATALOG raw_data OWNER TO data_engineers; ALTER CATALOG curated_data OWNER TO data_engineers; ALTER CATALOG analytics OWNER TO data_scientists;
Step 5: Create Schemas (SQL)
USE CATALOG raw_data; CREATE SCHEMA bronze COMMENT 'Raw ingestion layer'; USE CATALOG curated_data; CREATE SCHEMA silver COMMENT 'Cleaned data layer'; USE CATALOG analytics; CREATE SCHEMA gold COMMENT 'Business reporting layer';
Step 6: Create Tables (SQL)
-- Bronze Table CREATE TABLE raw_data.bronze.customer_raw ( customer_id STRING, name STRING, email STRING, created_date TIMESTAMP ) USING DELTA; -- Silver Table CREATE TABLE curated_data.silver.customer_clean ( customer_id STRING, name STRING, email STRING, created_date TIMESTAMP ) USING DELTA; -- Gold Table CREATE TABLE analytics.gold.customer_revenue_summary ( customer_id STRING, total_revenue DOUBLE, last_purchase DATE ) USING DELTA;
Step 7: Assign Permissions (SQL)
Data Engineers:
GRANT USE CATALOG ON CATALOG raw_data TO data_engineers; GRANT USE SCHEMA ON SCHEMA raw_data.bronze TO data_engineers; GRANT CREATE TABLE ON SCHEMA raw_data.bronze TO data_engineers; GRANT MODIFY ON SCHEMA raw_data.bronze TO data_engineers;
Data Scientists:
GRANT USE CATALOG ON CATALOG curated_data TO data_scientists; GRANT USE SCHEMA ON SCHEMA curated_data.silver TO data_scientists; GRANT SELECT ON ALL TABLES IN SCHEMA curated_data.silver TO data_scientists;
Business Analysts:
GRANT USE CATALOG ON CATALOG analytics TO business_analysts; GRANT USE SCHEMA ON SCHEMA analytics.gold TO business_analysts; GRANT SELECT ON ALL TABLES IN SCHEMA analytics.gold TO business_analysts;
Python API Examples
Using the databricks-sdk for managing Unity Catalog programmatically:
# Install SDK pip install databricks-sdk
Create Catalog
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
w.catalogs.create(
name="raw_data",
comment="Raw ingestion data"
)
Create Schema
w.schemas.create(
name="bronze",
catalog_name="raw_data",
comment="Raw ingestion layer"
)
Grant Permissions
w.grants.update(
securable_type="SCHEMA",
securable_name="raw_data.bronze",
changes=[
{
"principal": "data_engineers",
"add": ["CREATE_TABLE", "USE_SCHEMA"]
}
]
)
Create Table
spark.sql("""
CREATE TABLE raw_data.bronze.customer_raw (
customer_id STRING,
name STRING,
email STRING
) USING DELTA
""")
Enterprise Governance Summary
| Layer | Access |
|---|---|
| Bronze | Data Engineers |
| Silver | Data Engineers + Data Scientists |
| Gold | Business Analysts |
Roles:
- Metastore Admin: Manage governance
- Data Engineer: ETL pipelines
- Data Scientist: ML & modeling
- Analyst: Reporting
Best Practices:
- Use groups instead of individual users
- Restrict raw data access
- Enable audit logging
- Use external locations for S3
- Enforce least privilege access
No comments:
Post a Comment