Databricks Architecture Matrix (Serverless on AWS) with DR Best Practices
This document explains where major Databricks components reside when running Databricks Serverless on AWS and the recommended Disaster Recovery (DR) strategy for each component.
Control Plane Components
| Component | Purpose | Where It Runs | Plane | DR Best Practice |
|---|---|---|---|---|
| Workspace UI | User interface for notebooks and jobs | Databricks SaaS | Control Plane | Create secondary workspace in another region |
| Workspace APIs | Automation APIs | Databricks SaaS | Control Plane | Automate infrastructure using Terraform |
| Users & Groups | User identity management | Databricks account services | Control Plane | Use centralized IdP like Okta/Azure AD |
| Authentication / SSO | Login via external identity provider | Databricks account services | Control Plane | Configure SSO redundancy at IdP level |
| Permissions / RBAC | Access control policies | Databricks control services | Control Plane | Store policies as code using Terraform |
| Notebook Source Code | Notebook scripts | Workspace storage | Control Plane | Sync notebooks with Git repositories |
| Notebook Outputs | Charts and query results | Workspace storage | Control Plane | Do not rely on outputs; regenerate from data |
| Workspace Files | Files uploaded to workspace | Workspace storage | Control Plane | Store important files in S3 or Git |
| Repos (Git Integration) | Git source control integration | Workspace metadata | Control Plane | Maintain source code in GitHub/GitLab |
| Job Scheduler | Schedules workflows | Databricks orchestration service | Control Plane | Define jobs using Infrastructure-as-Code |
| Workflows | Pipeline orchestration | Databricks orchestration service | Control Plane | Export workflows via API and Terraform |
| SQL Query Planner | SQL optimization engine | Databricks query services | Control Plane | No DR needed (managed by Databricks) |
| SQL Warehouse Management | Serverless SQL management | Databricks control services | Control Plane | Recreate warehouses in secondary region |
| Unity Catalog | Central governance system | Databricks governance service | Control Plane | Replicate catalogs configuration using scripts |
| Metastore | Metadata storage | Databricks metadata services | Control Plane | Export metadata periodically |
| Data Lineage | Tracks data relationships | Databricks governance services | Control Plane | Export lineage metadata via APIs |
| Audit Logs | Security logs | Databricks governance services | Control Plane | Send logs to centralized SIEM storage |
| Cluster Management | Compute lifecycle management | Databricks control services | Control Plane | Recreate clusters via automation |
| Feature Store Metadata | Feature definitions | Databricks metadata services | Control Plane | Backup definitions in Git |
| Model Registry Metadata | ML model tracking | Databricks metadata services | Control Plane | Replicate registry configuration |
| Lakehouse Monitoring Metadata | Dataset monitoring metrics | Databricks monitoring services | Control Plane | Export monitoring metrics |
| Vector Search Metadata | Vector index configuration | Databricks control services | Control Plane | Recreate vector indexes from embeddings |
Data Plane Components (Customer AWS)
| Component | Purpose | Where It Runs | Plane | DR Best Practice |
|---|---|---|---|---|
| Serverless Spark Compute | Executes jobs | AWS compute | Data Plane | Deploy in multi-region workspace |
| SQL Warehouse Compute | SQL query execution | AWS compute | Data Plane | Provision warehouses in secondary region |
| Delta Table Data | Table storage | S3 | Data Plane | Enable S3 cross-region replication |
| Managed Tables | Managed table storage | S3 | Data Plane | Use versioned S3 buckets |
| External Tables | External dataset storage | S3 | Data Plane | Replicate underlying S3 storage |
| DBFS Root | Databricks filesystem | S3 | Data Plane | Enable bucket replication |
| Unity Catalog Managed Storage | Catalog table storage | S3 | Data Plane | Cross-region replication |
| Unity Catalog Volumes | Governed file storage | S3 | Data Plane | Replicate S3 buckets |
| MLflow Model Artifacts | ML models | S3 | Data Plane | Replicate artifact bucket |
| Feature Store Data | ML feature datasets | S3 | Data Plane | S3 replication and versioning |
| Vector Search Index Data | Embedding storage | S3 | Data Plane | Rebuild indexes from replicated embeddings |
| Streaming Checkpoints | Streaming progress tracking | S3 | Data Plane | Replicate checkpoint directories |
| Temporary Spark Shuffle Data | Intermediate processing | Compute disk | Data Plane | No DR required (recomputed) |
| Job Execution Logs | Spark logs | S3 | Data Plane | Send logs to centralized logging system |
| ML Training Data | Training datasets | S3 | Data Plane | Multi-region S3 replication |
| Delta Transaction Logs | Table version metadata | S3 | Data Plane | Protect using S3 versioning |
Architecture Flow
Databricks Control Plane (Managed by Databricks)
Workspace UI
Authentication
Unity Catalog
Metastore Metadata
Query Planner
Job Scheduler
|
| Secure API
v
AWS Data Plane (Customer Account)
Serverless Spark Compute
SQL Warehouses
Delta Tables
ML Models
Spark Temp Storage
|
v
Amazon S3 Data Lake
No comments:
Post a Comment