Data Classification in Databricks using Unity Catalog
Data classification is the process of labeling data based on sensitivity, such as PII (Personally Identifiable Information), Confidential, Financial, or Public data.
In Databricks, data classification is implemented using Unity Catalog through:
- Column Tags
- Column Masking Policies
- Row-Level Security
- Role-Based Access Control (RBAC)
- Audit Logging
1. Example Scenario
Suppose we have a customers table in the Gold layer. Some columns contain sensitive information.
sales_data.gold.customers
Example Table Structure
| customer_id | name | ssn | region | signup_date | |
|---|---|---|---|---|---|
| 101 | John Smith | john@email.com | 123-45-6789 | US | 2025-01-01 |
| 102 | Sarah Lee | sarah@email.com | 987-65-4321 | US | 2025-02-01 |
2. Column Classification Using Tags
Unity Catalog allows tagging columns with classification labels.
Example: Mark Columns as PII
ALTER TABLE sales_data.gold.customers
ALTER COLUMN email
SET TAGS ('classification'='PII');
ALTER TABLE sales_data.gold.customers
ALTER COLUMN ssn
SET TAGS ('classification'='Sensitive');
These tags help in governance, discovery, and compliance.
3. Column Masking Policies
Masking policies allow dynamic data protection based on user roles.
Create Masking Policy
CREATE MASKING POLICY mask_pii
AS (val STRING)
RETURN
CASE
WHEN is_account_group_member('data_engineers') THEN val
ELSE '***MASKED***'
END;
Apply Masking to Column
ALTER TABLE sales_data.gold.customers ALTER COLUMN ssn SET MASKING POLICY mask_pii;
Behavior:
- Data Engineers → See actual SSN
- Data Analysts → See masked value
- Other users → See masked value
4. Role-Based Access Control (RBAC)
Unity Catalog controls access at catalog, schema, and table levels.
Example Enterprise Role Strategy
| Role | Bronze | Silver | Gold |
|---|---|---|---|
| Data Engineer | Full Access | Full Access | Full Access |
| Data Analyst | No Access | No Access | Read Only |
| BI Team | No Access | No Access | Read Only |
Example Gold-Only Access
GRANT USAGE ON CATALOG sales_data TO `analysts`; GRANT USAGE ON SCHEMA sales_data.gold TO `analysts`; GRANT SELECT ON ALL TABLES IN SCHEMA sales_data.gold TO `analysts`;
5. Row-Level Security (Optional)
Row-level security restricts data based on conditions.
Example: Region-Based Access
CREATE ROW FILTER region_filter AS (region = current_user_region()); ALTER TABLE sales_data.gold.orders SET ROW FILTER region_filter;
6. Complete Governance Model
Enterprise data classification typically includes:
- Column tags for classification
- Masking policies for PII
- Role-based access control
- Row-level filtering
- Audit logging
- Dev/UAT/Prod separation
7. How Big Companies Use This
Organizations use data classification in Databricks for:
- Fraud detection
- Trading analytics
- Customer 360
- AI feature stores
- Regulatory compliance
Typical enterprise scale:
- 50TB+ data daily
- 1000+ tables
- 100+ pipelines
Conclusion
Data classification in Databricks using Unity Catalog provides enterprise-grade governance for sensitive data.
By combining column tags, masking policies, and role-based access control, organizations can securely manage PII and other sensitive information while enabling analytics and AI workloads.
No comments:
Post a Comment