Friday, 13 March 2026

Databricks Architecture Matrix with DR Best Practices

Databricks Architecture Matrix with DR Best Practices

Databricks Architecture Matrix (Serverless on AWS) with DR Best Practices

This document explains where major Databricks components reside when running Databricks Serverless on AWS and the recommended Disaster Recovery (DR) strategy for each component.

Control Plane Components

Component Purpose Where It Runs Plane DR Best Practice
Workspace UIUser interface for notebooks and jobsDatabricks SaaSControl PlaneCreate secondary workspace in another region
Workspace APIsAutomation APIsDatabricks SaaSControl PlaneAutomate infrastructure using Terraform
Users & GroupsUser identity managementDatabricks account servicesControl PlaneUse centralized IdP like Okta/Azure AD
Authentication / SSOLogin via external identity providerDatabricks account servicesControl PlaneConfigure SSO redundancy at IdP level
Permissions / RBACAccess control policiesDatabricks control servicesControl PlaneStore policies as code using Terraform
Notebook Source CodeNotebook scriptsWorkspace storageControl PlaneSync notebooks with Git repositories
Notebook OutputsCharts and query resultsWorkspace storageControl PlaneDo not rely on outputs; regenerate from data
Workspace FilesFiles uploaded to workspaceWorkspace storageControl PlaneStore important files in S3 or Git
Repos (Git Integration)Git source control integrationWorkspace metadataControl PlaneMaintain source code in GitHub/GitLab
Job SchedulerSchedules workflowsDatabricks orchestration serviceControl PlaneDefine jobs using Infrastructure-as-Code
WorkflowsPipeline orchestrationDatabricks orchestration serviceControl PlaneExport workflows via API and Terraform
SQL Query PlannerSQL optimization engineDatabricks query servicesControl PlaneNo DR needed (managed by Databricks)
SQL Warehouse ManagementServerless SQL managementDatabricks control servicesControl PlaneRecreate warehouses in secondary region
Unity CatalogCentral governance systemDatabricks governance serviceControl PlaneReplicate catalogs configuration using scripts
MetastoreMetadata storageDatabricks metadata servicesControl PlaneExport metadata periodically
Data LineageTracks data relationshipsDatabricks governance servicesControl PlaneExport lineage metadata via APIs
Audit LogsSecurity logsDatabricks governance servicesControl PlaneSend logs to centralized SIEM storage
Cluster ManagementCompute lifecycle managementDatabricks control servicesControl PlaneRecreate clusters via automation
Feature Store MetadataFeature definitionsDatabricks metadata servicesControl PlaneBackup definitions in Git
Model Registry MetadataML model trackingDatabricks metadata servicesControl PlaneReplicate registry configuration
Lakehouse Monitoring MetadataDataset monitoring metricsDatabricks monitoring servicesControl PlaneExport monitoring metrics
Vector Search MetadataVector index configurationDatabricks control servicesControl PlaneRecreate vector indexes from embeddings

Data Plane Components (Customer AWS)

Component Purpose Where It Runs Plane DR Best Practice
Serverless Spark ComputeExecutes jobsAWS computeData PlaneDeploy in multi-region workspace
SQL Warehouse ComputeSQL query executionAWS computeData PlaneProvision warehouses in secondary region
Delta Table DataTable storageS3Data PlaneEnable S3 cross-region replication
Managed TablesManaged table storageS3Data PlaneUse versioned S3 buckets
External TablesExternal dataset storageS3Data PlaneReplicate underlying S3 storage
DBFS RootDatabricks filesystemS3Data PlaneEnable bucket replication
Unity Catalog Managed StorageCatalog table storageS3Data PlaneCross-region replication
Unity Catalog VolumesGoverned file storageS3Data PlaneReplicate S3 buckets
MLflow Model ArtifactsML modelsS3Data PlaneReplicate artifact bucket
Feature Store DataML feature datasetsS3Data PlaneS3 replication and versioning
Vector Search Index DataEmbedding storageS3Data PlaneRebuild indexes from replicated embeddings
Streaming CheckpointsStreaming progress trackingS3Data PlaneReplicate checkpoint directories
Temporary Spark Shuffle DataIntermediate processingCompute diskData PlaneNo DR required (recomputed)
Job Execution LogsSpark logsS3Data PlaneSend logs to centralized logging system
ML Training DataTraining datasetsS3Data PlaneMulti-region S3 replication
Delta Transaction LogsTable version metadataS3Data PlaneProtect using S3 versioning

Architecture Flow

Databricks Control Plane (Managed by Databricks) Workspace UI Authentication Unity Catalog Metastore Metadata Query Planner Job Scheduler | | Secure API v AWS Data Plane (Customer Account) Serverless Spark Compute SQL Warehouses Delta Tables ML Models Spark Temp Storage | v Amazon S3 Data Lake

No comments:

Post a Comment