Databricks on AWS - Technical Guide
1. What is Databricks?
Databricks is a unified data and AI platform built on Apache Spark. It is designed for:
- Big Data Processing (ETL, analytics)
- Machine Learning (ML, AI)
- Data Science and Analytics Exploration
- Data Lakehouse architecture combining data lakes and warehouses
2. Databricks Architecture on AWS
Databricks separates control plane (managed by Databricks) from data plane (your AWS account).
Control Plane
- Managed by Databricks in Databricks AWS accounts
- Manages authentication, notebooks, job scheduling, cluster orchestration
- Serverless for users, no EC2 instances to manage
Data Plane
- Runs in your AWS account
- Compute: EC2 instances in your VPC
- Storage: S3 (Delta Lake), EBS (temporary storage for nodes)
- Networking: Private subnets, VPC Peering, PrivateLink
---
3. Key AWS Integrations
| AWS Service | Use Case in Databricks |
|---|---|
| S3 | Delta Lake storage, raw data, ML datasets |
| EC2 | Compute for clusters |
| EBS | Temporary cluster storage (shuffle, logs) |
| Kinesis / Kafka | Real-time streaming ingestion |
| Redshift / RDS / Aurora | Data warehousing and relational DB queries |
| Glue Data Catalog | Metadata management for Delta tables |
| KMS | Encrypt S3 buckets and cluster storage |
| IAM | Secure access to S3, KMS, and other services |
4. Cluster Types & Runtimes
| Cluster Type | Use Case | Notes |
|---|---|---|
| Standard | ETL, analytics | Optional auto-scaling |
| High Concurrency | Multiple SQL users | Optimized for concurrent queries |
| Job Cluster | Batch job execution | Terminate after job completes |
| ML / GPU Cluster | Machine Learning / Deep Learning | GPU-enabled EC2 instances |
Databricks Runtimes
- Standard Runtime: Spark workloads
- ML Runtime: MLlib, TensorFlow, PyTorch, scikit-learn
- GPU Runtime: Deep learning / model training
- Photon Engine: Optimized Delta queries
5. Storage & Delta Lake
Delta Lake provides:
- ACID Transactions
- Time Travel Queries
- Schema Enforcement
- Batch + Streaming support
// Mount S3 bucket to DBFS
dbutils.fs.mount(
source="s3a://my-bucket",
mount_point="/mnt/mydata",
extra_configs={"fs.s3a.awsAccessKeyId":"YOUR_KEY","fs.s3a.awsSecretAccessKey":"YOUR_SECRET"}
)
---
6. Data Ingestion & Streaming
// Kafka Example
df = spark.readStream.format("kafka") \
.option("kafka.bootstrap.servers","broker:9092") \
.option("subscribe","topic_name") \
.load()
- Supports **Kinesis Data Streams** for AWS-native ingestion
- **Checkpointing** in S3 or DBFS to track streaming state
- Watermarking for late-arriving data
---
7. ML / AI Workloads
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml import Pipeline
import mlflow
import mlflow.spark
# Load data
data = spark.read.format("delta").load("/mnt/delta/customer_data")
# Prepare features
assembler = VectorAssembler(inputCols=["age","account_age","num_transactions"], outputCol="features")
rf = RandomForestClassifier(featuresCol="features", labelCol="churn")
pipeline = Pipeline(stages=[assembler, rf])
# Train model
model = pipeline.fit(data)
# Log model to MLflow
mlflow.start_run()
mlflow.spark.log_model(model, "churn_model")
mlflow.end_run()
---
8. Security
- IAM Roles & Policies: Assign Databricks role for S3, Kinesis access
- KMS: Encrypt S3 buckets and cluster storage
- Private Networking: VPC Peering, PrivateLink
- Audit Logging: CloudTrail, S3 access logs
9. Autoscaling & Performance
- Auto-scaling: min/max worker nodes per cluster
- Photon Engine: faster Delta table queries
- Caching: df.cache() for repeated computations
10. Sample Data Flow (AWS + Databricks)
Scenario: Real-time e-commerce analytics
- Ingest customer clickstreams → Kinesis Data Stream
- Databricks Streaming Cluster processes & cleans data → Delta Lake (S3)
- Analyze via Databricks SQL / Redshift Spectrum dashboards
- Train recommendation ML models on Delta Lake
- Secure via IAM roles, KMS encryption, private VPC
---
11. Quick Hands-On Use Cases in Databricks on AWS
ETL Pipeline
df = spark.read.json("s3://my-bucket/raw_logs/")
df_clean = df.select("userId","eventType","timestamp").filter(df.eventType!="ignore")
df_clean.write.format("delta").mode("overwrite").save("/mnt/delta/clean_logs")
Streaming Analytics
df_stream = spark.readStream.format("kinesis")\
.option("streamName","clickstream")\
.option("region","us-east-1").load()
# Process streaming data
df_clean = df_stream.filter(df_stream.eventType!="ignore")
df_clean.writeStream.format("delta").option("checkpointLocation","/mnt/delta/checkpoint").start("/mnt/delta/streaming_data")
ML Model Training
assembler = VectorAssembler(inputCols=["age","account_age"], outputCol="features") rf = RandomForestClassifier(featuresCol="features", labelCol="churn") pipeline = Pipeline(stages=[assembler, rf]) model = pipeline.fit(data) mlflow.spark.log_model(model, "churn_model")---
No comments:
Post a Comment