Complete Medallion Architecture Guide (Bronze, Silver, Gold)
This guide explains the full Medallion Architecture pattern used in modern data platforms such as Databricks Lakehouse.
1. Bronze Layer (Raw Data)
Bronze stores exact raw data from source systems without transformation.
Typical Sources:
- API
- Kafka / Streaming systems
- CDC databases
- JSON files
- Application logs
Example Bronze Table
bronze.orders_raw
Example Bronze Records
| order_id | payload | ingestion_time |
|---|---|---|
| 1001 | {"customer":"C101","amount":250,"status":"NEW"} | 2026-01-01 |
| 1002 | {"customer":"C102","amount":300,"status":"NEW"} | 2026-01-01 |
| 1003 | {"customer":"C101","amount":150,"status":"CANCELLED"} | 2026-01-02 |
Create Bronze Table
CREATE TABLE bronze.orders_raw ( order_id STRING, payload STRING, ingestion_time TIMESTAMP ) USING DELTA;
2. Silver Layer (Cleaned Data)
Silver layer contains structured and validated data. JSON is parsed and converted into columns.
Example Silver Table
silver.orders_clean
Silver Data Example
| order_id | customer_id | amount | status | order_date |
|---|---|---|---|---|
| 1001 | C101 | 250 | NEW | 2026-01-01 |
| 1002 | C102 | 300 | NEW | 2026-01-01 |
| 1003 | C101 | 150 | CANCELLED | 2026-01-02 |
Bronze → Silver Transformation
CREATE OR REPLACE TABLE silver.orders_clean AS SELECT order_id, get_json_object(payload,'$.customer') AS customer_id, CAST(get_json_object(payload,'$.amount') AS DOUBLE) AS amount, get_json_object(payload,'$.status') AS status, DATE(ingestion_time) AS order_date FROM bronze.orders_raw;
3. Gold Layer (Business Aggregates)
Gold layer contains analytics-ready data for reporting and dashboards.
Example Gold Table
gold.daily_sales
Gold Data Example
| order_date | customer_id | total_orders | total_sales |
|---|---|---|---|
| 2026-01-01 | C101 | 1 | 250 |
| 2026-01-01 | C102 | 1 | 300 |
| 2026-01-02 | C101 | 1 | 150 |
Silver → Gold Aggregation
CREATE OR REPLACE TABLE gold.daily_sales AS SELECT order_date, customer_id, COUNT(order_id) AS total_orders, SUM(amount) AS total_sales FROM silver.orders_clean GROUP BY order_date, customer_id;
4. Final Data Flow
Raw Source Data
│
▼
Bronze Table
orders_raw
(JSON payload)
│
▼
Silver Table
orders_clean
(structured columns)
│
▼
Gold Table
daily_sales
(aggregated analytics)
5. How This Appears in Unity Catalog
Catalog: sales_data
Schemas
│
├── bronze
│ └ orders_raw
│
├── silver
│ └ orders_clean
│
└── gold
└ daily_sales
Full table names:
sales_data.bronze.orders_raw sales_data.silver.orders_clean sales_data.gold.daily_sales
6. How Unity Catalog Controls Access
Unity Catalog provides centralized Role-Based Access Control (RBAC). Permissions can be granted at:
- Catalog level
- Schema level
- Table level
- Column level
This enables fine-grained governance across Bronze, Silver, and Gold layers.
Example Enterprise Role Strategy
| Role | Bronze Access | Silver Access | Gold Access |
|---|---|---|---|
| Data Engineer | Full Access | Full Access | Full Access |
| Data Analyst | No Access | No Access | Read Only |
| Data Scientist | Read Access | Read Access | Read Access |
| BI Team | No Access | No Access | Read Access |
7. Example Unity Catalog Grants
Data Engineer Access
GRANT ALL PRIVILEGES ON CATALOG sales_data TO `data_engineers`;
Data Analyst Access (Gold Only)
GRANT USAGE ON CATALOG sales_data TO `data_analysts`; GRANT USAGE ON SCHEMA sales_data.gold TO `data_analysts`; GRANT SELECT ON ALL TABLES IN SCHEMA sales_data.gold TO `data_analysts`;This ensures analysts can only access Gold layer tables.
6. Real Enterprise Pattern
Large organizations structure data by domain.
catalog: ecommerce
schemas
│
├ bronze
│ ├ orders_raw
│ ├ payments_raw
│ └ customers_raw
│
├ silver
│ ├ orders_clean
│ ├ payments_clean
│ └ customers_clean
│
└ gold
├ daily_sales
├ customer_lifetime_value
└ product_revenue
7. Storage Layout in AWS
When running on Amazon Web Services, tables are stored in Amazon S3 using Delta Lake format.
s3://company-datalake/ bronze/ orders_raw/ silver/ orders_clean/ gold/ daily_sales/
Each table uses Delta Lake which provides:
- ACID transactions
- Time travel
- Schema enforcement
- High performance analytics
- Efficient storage optimization
8. How Big Companies Use This
Companies use Medallion architecture for:
- Fraud detection
- Trading analytics
- Customer 360
- AI feature stores
Typical pipeline size:
- 50TB+ data daily
- 1000+ tables
- 100+ pipelines
Conclusion
Medallion Architecture provides a scalable, governed, and high-performance data design pattern. It is widely used in modern lakehouse platforms such as Databricks.
No comments:
Post a Comment