Databricks Architecture Matrix with DR Best Practices

Databricks Architecture Matrix (Serverless on AWS) with DR Best Practices

This document explains where major Databricks components reside when running Databricks Serverless on AWS and the recommended Disaster Recovery (DR) strategy for each component.

Control Plane Components

Component	Purpose	Where It Runs	Plane	DR Best Practice
Workspace UI	User interface for notebooks and jobs	Databricks SaaS	Control Plane	Create secondary workspace in another region
Workspace APIs	Automation APIs	Databricks SaaS	Control Plane	Automate infrastructure using Terraform
Users & Groups	User identity management	Databricks account services	Control Plane	Use centralized IdP like Okta/Azure AD
Authentication / SSO	Login via external identity provider	Databricks account services	Control Plane	Configure SSO redundancy at IdP level
Permissions / RBAC	Access control policies	Databricks control services	Control Plane	Store policies as code using Terraform
Notebook Source Code	Notebook scripts	Workspace storage	Control Plane	Sync notebooks with Git repositories
Notebook Outputs	Charts and query results	Workspace storage	Control Plane	Do not rely on outputs; regenerate from data
Workspace Files	Files uploaded to workspace	Workspace storage	Control Plane	Store important files in S3 or Git
Repos (Git Integration)	Git source control integration	Workspace metadata	Control Plane	Maintain source code in GitHub/GitLab
Job Scheduler	Schedules workflows	Databricks orchestration service	Control Plane	Define jobs using Infrastructure-as-Code
Workflows	Pipeline orchestration	Databricks orchestration service	Control Plane	Export workflows via API and Terraform
SQL Query Planner	SQL optimization engine	Databricks query services	Control Plane	No DR needed (managed by Databricks)
SQL Warehouse Management	Serverless SQL management	Databricks control services	Control Plane	Recreate warehouses in secondary region
Unity Catalog	Central governance system	Databricks governance service	Control Plane	Replicate catalogs configuration using scripts
Metastore	Metadata storage	Databricks metadata services	Control Plane	Export metadata periodically
Data Lineage	Tracks data relationships	Databricks governance services	Control Plane	Export lineage metadata via APIs
Audit Logs	Security logs	Databricks governance services	Control Plane	Send logs to centralized SIEM storage
Cluster Management	Compute lifecycle management	Databricks control services	Control Plane	Recreate clusters via automation
Feature Store Metadata	Feature definitions	Databricks metadata services	Control Plane	Backup definitions in Git
Model Registry Metadata	ML model tracking	Databricks metadata services	Control Plane	Replicate registry configuration
Lakehouse Monitoring Metadata	Dataset monitoring metrics	Databricks monitoring services	Control Plane	Export monitoring metrics
Vector Search Metadata	Vector index configuration	Databricks control services	Control Plane	Recreate vector indexes from embeddings

Data Plane Components (Customer AWS)

Component	Purpose	Where It Runs	Plane	DR Best Practice
Serverless Spark Compute	Executes jobs	AWS compute	Data Plane	Deploy in multi-region workspace
SQL Warehouse Compute	SQL query execution	AWS compute	Data Plane	Provision warehouses in secondary region
Delta Table Data	Table storage	S3	Data Plane	Enable S3 cross-region replication
Managed Tables	Managed table storage	S3	Data Plane	Use versioned S3 buckets
External Tables	External dataset storage	S3	Data Plane	Replicate underlying S3 storage
DBFS Root	Databricks filesystem	S3	Data Plane	Enable bucket replication
Unity Catalog Managed Storage	Catalog table storage	S3	Data Plane	Cross-region replication
Unity Catalog Volumes	Governed file storage	S3	Data Plane	Replicate S3 buckets
MLflow Model Artifacts	ML models	S3	Data Plane	Replicate artifact bucket
Feature Store Data	ML feature datasets	S3	Data Plane	S3 replication and versioning
Vector Search Index Data	Embedding storage	S3	Data Plane	Rebuild indexes from replicated embeddings
Streaming Checkpoints	Streaming progress tracking	S3	Data Plane	Replicate checkpoint directories
Temporary Spark Shuffle Data	Intermediate processing	Compute disk	Data Plane	No DR required (recomputed)
Job Execution Logs	Spark logs	S3	Data Plane	Send logs to centralized logging system
ML Training Data	Training datasets	S3	Data Plane	Multi-region S3 replication
Delta Transaction Logs	Table version metadata	S3	Data Plane	Protect using S3 versioning

Architecture Flow

Databricks Control Plane (Managed by Databricks) Workspace UI Authentication Unity Catalog Metastore Metadata Query Planner Job Scheduler | | Secure API v AWS Data Plane (Customer Account) Serverless Spark Compute SQL Warehouses Delta Tables ML Models Spark Temp Storage | v Amazon S3 Data Lake

Tips to Improve Knowledge

Friday, 13 March 2026

Databricks Architecture Matrix with DR Best Practices

Databricks Architecture Matrix (Serverless on AWS) with DR Best Practices

Control Plane Components

Data Plane Components (Customer AWS)

Architecture Flow

No comments:

Post a Comment