Unity Catalog Governance – Databricks on AWS
Unity Catalog is the centralized governance solution for Databricks that provides unified data governance, fine-grained access control, auditing, lineage tracking, and data discovery across all workspaces.
Unity Catalog allows organizations to control access to data assets such as catalogs, schemas, tables, views, volumes, and machine learning models while enforcing enterprise security policies.
1. Authentication
Authentication determines who the user is. In enterprise environments, authentication is typically integrated with the organization's Identity Provider (IdP).
Typical Authentication Flow
- User attempts to access Databricks workspace
- User is redirected to corporate Identity Provider
- Identity Provider validates credentials
- Authentication token is issued
- User is granted access to Databricks
Supported Authentication Methods
- Single Sign-On (SSO)
- SAML 2.0
- SCIM User Provisioning
- OAuth Tokens
Example Enterprise Setup
Users authenticate through the corporate identity provider such as Okta or Azure Active Directory. Once authenticated, the identity provider synchronizes user groups to Databricks using SCIM provisioning. These groups are then used by Unity Catalog to manage data permissions.
2. Authorization
Authorization determines what the authenticated user can access. Unity Catalog implements Role-Based Access Control (RBAC) to manage permissions.
Access control is applied to the following securable objects:
- Catalogs
- Schemas
- Tables
- Views
- Volumes
- Functions
- Models
Example Permission Model
| Role | Access Level |
|---|---|
| Data Engineer | Create tables and manage pipelines |
| Data Scientist | Read curated datasets |
| Business Analyst | Query Gold layer datasets |
3. Unity Catalog Role Types
Unity Catalog uses administrative roles and permission-based roles to control governance.
Account Administrator
- Highest level administrative role
- Manages Databricks account settings
- Creates Unity Catalog metastore
- Assigns metastore administrators
Metastore Administrator
- Manages catalogs and storage locations
- Controls overall data governance policies
- Grants permissions to catalogs
Catalog Owner
- Full control of a catalog
- Can create schemas
- Can grant permissions within the catalog
Schema Owner
- Manages objects inside schema
- Creates tables and views
- Manages schema permissions
Table Owner
- Full control of table
- Can modify schema
- Can grant SELECT, INSERT, UPDATE permissions
4. Identity Types in Unity Catalog
Unity Catalog supports multiple identity types for managing access control.
Users
Individual human identities authenticated through the enterprise identity provider.
Example Users:
- data.engineer@company.com
- data.scientist@company.com
- analyst@company.com
Groups
Groups are collections of users synchronized from the Identity Provider. Permissions are assigned to groups instead of individual users to simplify governance.
Example Groups:
| Group Name | Purpose |
|---|---|
| DataEngineers | Develop ETL pipelines |
| DataScientists | Access curated datasets |
| BusinessAnalysts | Query aggregated data |
Service Principals
Service principals represent non-human identities used by applications, automation scripts, or CI/CD pipelines.
Example:
- ETL pipeline service principal
- Airflow automation user
- CI/CD deployment identity
5. Unity Catalog Object Hierarchy
Unity Catalog organizes data assets in a hierarchical structure.
Metastore
└── Catalog
└── Schema
└── Tables / Views / Functions
Example Hierarchy
Metastore: enterprise_metastore Catalog: finance Schema: transactions Tables: daily_transactions monthly_revenue
6. Example Governance Implementation
Below is an example of implementing governance using Unity Catalog.
Catalog Level
| Catalog | Owner | Purpose |
|---|---|---|
| raw_data | DataEngineeringTeam | Raw ingested data |
| curated_data | DataEngineeringTeam | Clean and processed datasets |
| analytics | DataAnalyticsTeam | Business reporting data |
Schema Example
| Schema | Purpose |
|---|---|
| bronze | Raw ingestion layer |
| silver | Cleaned and standardized data |
| gold | Business-ready datasets |
7. Example Permission Assignments
| Group | Object | Permission |
|---|---|---|
| DataEngineers | bronze schema | CREATE TABLE |
| DataScientists | silver schema | SELECT |
| BusinessAnalysts | gold schema | SELECT |
8. Data Lineage and Auditing
Unity Catalog automatically tracks data lineage and access activity.
Capabilities
- Column-level lineage
- End-to-end pipeline visibility
- Query history tracking
- Audit logging
Example
If a Gold table is created from a Silver table using a transformation job, Unity Catalog automatically records the lineage between these datasets.
9. Security Best Practices
- Use groups instead of assigning permissions to individual users
- Apply least privilege principle
- Separate environments (Dev, Test, Production)
- Enable audit logging
- Use service principals for automation
- Restrict raw data access
10. Governance Architecture Summary
| Component | Purpose |
|---|---|
| Identity Provider | User authentication |
| Unity Catalog | Centralized governance |
| Groups | Access management |
| Catalogs and Schemas | Logical data organization |
| Permissions | Fine-grained access control |
This governance model ensures that enterprise data assets are securely managed, access is properly controlled, and compliance requirements are met while enabling scalable analytics workloads in Databricks on AWS.
No comments:
Post a Comment