Tuesday, 9 December 2025

Databricks on AWS - Technical Guide

Databricks on AWS - Technical Guide

Databricks on AWS - Technical Guide

1. What is Databricks?

Databricks is a unified data and AI platform built on Apache Spark. It is designed for:

  • Big Data Processing (ETL, analytics)
  • Machine Learning (ML, AI)
  • Data Science and Analytics Exploration
  • Data Lakehouse architecture combining data lakes and warehouses
---

2. Databricks Architecture on AWS

Databricks separates control plane (managed by Databricks) from data plane (your AWS account).

Control Plane

  • Managed by Databricks in Databricks AWS accounts
  • Manages authentication, notebooks, job scheduling, cluster orchestration
  • Serverless for users, no EC2 instances to manage

Data Plane

  • Runs in your AWS account
  • Compute: EC2 instances in your VPC
  • Storage: S3 (Delta Lake), EBS (temporary storage for nodes)
  • Networking: Private subnets, VPC Peering, PrivateLink
Databricks AWS Architecture ---

3. Key AWS Integrations

AWS ServiceUse Case in Databricks
S3Delta Lake storage, raw data, ML datasets
EC2Compute for clusters
EBSTemporary cluster storage (shuffle, logs)
Kinesis / KafkaReal-time streaming ingestion
Redshift / RDS / AuroraData warehousing and relational DB queries
Glue Data CatalogMetadata management for Delta tables
KMSEncrypt S3 buckets and cluster storage
IAMSecure access to S3, KMS, and other services
---

4. Cluster Types & Runtimes

Cluster TypeUse CaseNotes
StandardETL, analyticsOptional auto-scaling
High ConcurrencyMultiple SQL usersOptimized for concurrent queries
Job ClusterBatch job executionTerminate after job completes
ML / GPU ClusterMachine Learning / Deep LearningGPU-enabled EC2 instances

Databricks Runtimes

  • Standard Runtime: Spark workloads
  • ML Runtime: MLlib, TensorFlow, PyTorch, scikit-learn
  • GPU Runtime: Deep learning / model training
  • Photon Engine: Optimized Delta queries
---

5. Storage & Delta Lake

Delta Lake provides:

  • ACID Transactions
  • Time Travel Queries
  • Schema Enforcement
  • Batch + Streaming support
// Mount S3 bucket to DBFS
dbutils.fs.mount(
  source="s3a://my-bucket",
  mount_point="/mnt/mydata",
  extra_configs={"fs.s3a.awsAccessKeyId":"YOUR_KEY","fs.s3a.awsSecretAccessKey":"YOUR_SECRET"}
)
---

6. Data Ingestion & Streaming

// Kafka Example
df = spark.readStream.format("kafka") \
  .option("kafka.bootstrap.servers","broker:9092") \
  .option("subscribe","topic_name") \
  .load()
- Supports **Kinesis Data Streams** for AWS-native ingestion - **Checkpointing** in S3 or DBFS to track streaming state - Watermarking for late-arriving data ---

7. ML / AI Workloads

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml import Pipeline
import mlflow
import mlflow.spark

# Load data
data = spark.read.format("delta").load("/mnt/delta/customer_data")

# Prepare features
assembler = VectorAssembler(inputCols=["age","account_age","num_transactions"], outputCol="features")
rf = RandomForestClassifier(featuresCol="features", labelCol="churn")
pipeline = Pipeline(stages=[assembler, rf])

# Train model
model = pipeline.fit(data)

# Log model to MLflow
mlflow.start_run()
mlflow.spark.log_model(model, "churn_model")
mlflow.end_run()
---

8. Security

  • IAM Roles & Policies: Assign Databricks role for S3, Kinesis access
  • KMS: Encrypt S3 buckets and cluster storage
  • Private Networking: VPC Peering, PrivateLink
  • Audit Logging: CloudTrail, S3 access logs
---

9. Autoscaling & Performance

  • Auto-scaling: min/max worker nodes per cluster
  • Photon Engine: faster Delta table queries
  • Caching: df.cache() for repeated computations
---

10. Sample Data Flow (AWS + Databricks)

Scenario: Real-time e-commerce analytics

  • Ingest customer clickstreams → Kinesis Data Stream
  • Databricks Streaming Cluster processes & cleans data → Delta Lake (S3)
  • Analyze via Databricks SQL / Redshift Spectrum dashboards
  • Train recommendation ML models on Delta Lake
  • Secure via IAM roles, KMS encryption, private VPC
AWS Databricks Data Flow ---

11. Quick Hands-On Use Cases in Databricks on AWS

ETL Pipeline

df = spark.read.json("s3://my-bucket/raw_logs/")
df_clean = df.select("userId","eventType","timestamp").filter(df.eventType!="ignore")
df_clean.write.format("delta").mode("overwrite").save("/mnt/delta/clean_logs")

Streaming Analytics

df_stream = spark.readStream.format("kinesis")\
  .option("streamName","clickstream")\
  .option("region","us-east-1").load()

# Process streaming data
df_clean = df_stream.filter(df_stream.eventType!="ignore")
df_clean.writeStream.format("delta").option("checkpointLocation","/mnt/delta/checkpoint").start("/mnt/delta/streaming_data")

ML Model Training

assembler = VectorAssembler(inputCols=["age","account_age"], outputCol="features")
rf = RandomForestClassifier(featuresCol="features", labelCol="churn")
pipeline = Pipeline(stages=[assembler, rf])
model = pipeline.fit(data)
mlflow.spark.log_model(model, "churn_model")
---

References

No comments:

Post a Comment