Tuesday, 6 January 2026

Data Classification in Databricks using Unity Catalog

Data Classification in Databricks using Unity Catalog

Data Classification in Databricks using Unity Catalog

Data classification is the process of labeling data based on sensitivity, such as PII (Personally Identifiable Information), Confidential, Financial, or Public data.

In Databricks, data classification is implemented using Unity Catalog through:

  • Column Tags
  • Column Masking Policies
  • Row-Level Security
  • Role-Based Access Control (RBAC)
  • Audit Logging

1. Example Scenario

Suppose we have a customers table in the Gold layer. Some columns contain sensitive information.

sales_data.gold.customers

Example Table Structure

customer_id name email ssn region signup_date
101 John Smith john@email.com 123-45-6789 US 2025-01-01
102 Sarah Lee sarah@email.com 987-65-4321 US 2025-02-01

2. Column Classification Using Tags

Unity Catalog allows tagging columns with classification labels.

Example: Mark Columns as PII

ALTER TABLE sales_data.gold.customers
ALTER COLUMN email
SET TAGS ('classification'='PII');

ALTER TABLE sales_data.gold.customers
ALTER COLUMN ssn
SET TAGS ('classification'='Sensitive');

These tags help in governance, discovery, and compliance.


3. Column Masking Policies

Masking policies allow dynamic data protection based on user roles.

Create Masking Policy

CREATE MASKING POLICY mask_pii
AS (val STRING)
RETURN
  CASE
    WHEN is_account_group_member('data_engineers') THEN val
    ELSE '***MASKED***'
  END;

Apply Masking to Column

ALTER TABLE sales_data.gold.customers
ALTER COLUMN ssn
SET MASKING POLICY mask_pii;

Behavior:

  • Data Engineers → See actual SSN
  • Data Analysts → See masked value
  • Other users → See masked value

4. Role-Based Access Control (RBAC)

Unity Catalog controls access at catalog, schema, and table levels.

Example Enterprise Role Strategy

Role Bronze Silver Gold
Data Engineer Full Access Full Access Full Access
Data Analyst No Access No Access Read Only
BI Team No Access No Access Read Only

Example Gold-Only Access

GRANT USAGE ON CATALOG sales_data TO `analysts`;

GRANT USAGE ON SCHEMA sales_data.gold TO `analysts`;

GRANT SELECT ON ALL TABLES IN SCHEMA sales_data.gold TO `analysts`;

5. Row-Level Security (Optional)

Row-level security restricts data based on conditions.

Example: Region-Based Access

CREATE ROW FILTER region_filter
AS (region = current_user_region());

ALTER TABLE sales_data.gold.orders
SET ROW FILTER region_filter;

6. Complete Governance Model

Enterprise data classification typically includes:

  • Column tags for classification
  • Masking policies for PII
  • Role-based access control
  • Row-level filtering
  • Audit logging
  • Dev/UAT/Prod separation

7. How Big Companies Use This

Organizations use data classification in Databricks for:

  • Fraud detection
  • Trading analytics
  • Customer 360
  • AI feature stores
  • Regulatory compliance

Typical enterprise scale:

  • 50TB+ data daily
  • 1000+ tables
  • 100+ pipelines

Conclusion

Data classification in Databricks using Unity Catalog provides enterprise-grade governance for sensitive data.

By combining column tags, masking policies, and role-based access control, organizations can securely manage PII and other sensitive information while enabling analytics and AI workloads.

No comments:

Post a Comment