Strategy: S3 + Glue + Athena + Lake Formation – Multi-Region Setup
🔹 1. S3: Setup Cross-Region Replication (CRR)
Action | Detail |
Primary S3 Bucket | Create in Region-A |
Secondary S3 Bucket | Create in Region-B |
Enable Versioning | On both buckets |
Setup CRR | Replicate to secondary bucket (KMS-compatible) |
Optional | Use replication time control (RTC) for guaranteed SLA |
🔹 2. Glue Catalog, Jobs, Crawlers – Create in Both Regions
Resource | Recommendation |
Glue Tables | Define via Terraform/CDK – deploy to both regions |
Crawlers | Same config in both regions (can be turned off in DR until needed) |
Glue Jobs | Store scripts in replicated S3, job definitions deployed to both regions |
Glue Workflows | Replicated logic (disabled in DR until failover) |
✅ Use Infrastructure-as-Code (IaC) (Terraform or CDK) to deploy in Region-A and Region-B
🔹 3. Lake Formation: Pre-Create Policies in Both Regions
What to Do | Details |
LF-Tags | Create in both regions via IaC or script |
Grant Permissions | Use grant-permissions in Region-A and Region-B |
Principal Mapping | Keep IAM roles/groups consistent across both regions |
🔁 Export from Region-A using list-permissions, and apply to Region-B regularly (daily or via CI/CD)
🔁 Automation Option
Use Terraform with Workspaces or StackSets
bash
CopyEdit
terraform workspace new region-a
terraform apply -var="aws_region=us-east-1"
terraform workspace new region-b
terraform apply -var="aws_region=us-west-2"
|
|
|
|
|
|
|
|
|
|
|
|
✅ Summary Diagram
pgsql
CopyEdit
+-------------------------+
| Primary Region |
| (us-east-1) |
+-------------------------+
| S3 (Lake House)
| Glue Catalog, Jobs
| Athena Queries
| Lake Formation Policy
▼
+-----------------------------------+
| S3 CRR Replication to Region B |
+-----------------------------------+
▼
+-------------------------+
| DR Region |
| (us-west-2) |
+-------------------------+
| S3 + Catalog Pre-Built
| Glue Crawlers & Jobs (disabled)
| Lake Formation policies in sync
▼
Manual or automatic failover
# Terraform Template: Lake House DR Setup (S3 + Glue + Athena + Lake Formation) provider "aws" { alias = "primary" region = "us-east-1" } provider "aws" { alias = "dr" region = "us-west-2" } ############################ # 1. S3 Buckets + CRR ############################ resource "aws_s3_bucket" "lakehouse_primary" { provider = aws.primary bucket = "my-lakehouse-primary" versioning { enabled = true } } resource "aws_s3_bucket" "lakehouse_dr" { provider = aws.dr bucket = "my-lakehouse-dr" versioning { enabled = true } } resource "aws_s3_bucket_replication_configuration" "replication" { provider = aws.primary bucket = aws_s3_bucket.lakehouse_primary.id role = aws_iam_role.replication_role.arn rules { id = "replicate-all" status = "Enabled" destination { bucket = aws_s3_bucket.lakehouse_dr.arn storage_class = "STANDARD" } } } resource "aws_iam_role" "replication_role" { name = "s3-replication-role" assume_role_policy = jsonencode({ Version = "2012-10-17", Statement = [{ Effect = "Allow", Principal = { Service = "s3.amazonaws.com" }, Action = "sts:AssumeRole" }] }) } resource "aws_iam_role_policy" "replication_policy" { role = aws_iam_role.replication_role.id policy = jsonencode({ Version = "2012-10-17", Statement = [ { Effect = "Allow", Action = ["s3:GetReplicationConfiguration", "s3:ListBucket"], Resource = [aws_s3_bucket.lakehouse_primary.arn] }, { Effect = "Allow", Action = ["s3:GetObjectVersion", "s3:GetObjectVersionAcl"], Resource = ["${aws_s3_bucket.lakehouse_primary.arn}/*"] }, { Effect = "Allow", Action = ["s3:ReplicateObject", "s3:ReplicateDelete", "s3:ReplicateTags"], Resource = ["${aws_s3_bucket.lakehouse_dr.arn}/*"] } ] }) } ############################ # 2. Glue Catalog Table ############################ resource "aws_glue_catalog_database" "catalog_db" { provider = aws.primary name = "lakehouse_db" } resource "aws_glue_catalog_database" "catalog_db_dr" { provider = aws.dr name = "lakehouse_db" } resource "aws_glue_catalog_table" "sample_table" { provider = aws.primary name = "sample_data" database_name = aws_glue_catalog_database.catalog_db.name table_type = "EXTERNAL_TABLE" parameters = { classification = "parquet" EXTERNAL = "TRUE" } storage_descriptor { location = "s3://${aws_s3_bucket.lakehouse_primary.bucket}/data/" input_format = "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat" output_format = "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat" serde_info { serialization_library = "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe" } } } resource "aws_glue_catalog_table" "sample_table_dr" { provider = aws.dr name = "sample_data" database_name = aws_glue_catalog_database.catalog_db_dr.name table_type = "EXTERNAL_TABLE" parameters = { classification = "parquet" EXTERNAL = "TRUE" } storage_descriptor { location = "s3://${aws_s3_bucket.lakehouse_dr.bucket}/data/" input_format = "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat" output_format = "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat" serde_info { serialization_library = "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe" } } } ############################ # 3. Lake Formation (Optional) # You must grant LF permissions via CLI/Boto3 ############################ # NOTE: Lake Formation resources are limited in Terraform; # permissions must be scripted using `aws lakeformation grant-permissions` # Recommendation: use Boto3 scripts or Terraform `null_resource` with local-exec
No comments:
Post a Comment