Thursday, 26 June 2025

DR - S3 + Glue + Athena + Lake Formation – Multi-Region Setup

 

Strategy: S3 + Glue + Athena + Lake Formation – Multi-Region Setup 

 

🔹 1. S3: Setup Cross-Region Replication (CRR) 

Action 

Detail 

Primary S3 Bucket 

Create in Region-A 

Secondary S3 Bucket 

Create in Region-B 

Enable Versioning 

On both buckets 

Setup CRR 

Replicate to secondary bucket (KMS-compatible) 

Optional 

Use replication time control (RTC) for guaranteed SLA 

 

🔹 2. Glue Catalog, Jobs, Crawlers – Create in Both Regions 

Resource 

Recommendation 

Glue Tables 

Define via Terraform/CDK – deploy to both regions 

Crawlers 

Same config in both regions (can be turned off in DR until needed) 

Glue Jobs 

Store scripts in replicated S3, job definitions deployed to both regions 

Glue Workflows 

Replicated logic (disabled in DR until failover) 

✅ Use Infrastructure-as-Code (IaC) (Terraform or CDK) to deploy in Region-A and Region-B 

 

🔹 3. Lake Formation: Pre-Create Policies in Both Regions 

What to Do 

Details 

LF-Tags 

Create in both regions via IaC or script 

Grant Permissions 

Use grant-permissions in Region-A and Region-B 

Principal Mapping 

Keep IAM roles/groups consistent across both regions 

🔁 Export from Region-A using list-permissions, and apply to Region-B regularly (daily or via CI/CD) 

 

🔁 Automation Option 

Use Terraform with Workspaces or StackSets 

bash 

CopyEdit 

terraform workspace new region-a 
terraform apply -var="aws_region=us-east-1" 
 
terraform workspace new region-b 
terraform apply -var="aws_region=us-west-2" 
 

 

 

 

 

 

 

 

 

 

 

 

 

 

✅ Summary Diagram 

pgsql 

CopyEdit 

             +-------------------------+ 
              |     Primary Region      | 
              |      (us-east-1)        | 
              +-------------------------+ 
                 |  S3 (Lake House) 
                 |  Glue Catalog, Jobs 
                 |  Athena Queries 
                 |  Lake Formation Policy 
                 ▼ 
       +-----------------------------------+ 
       |   S3 CRR Replication to Region B  | 
       +-----------------------------------+ 
                 ▼ 
              +-------------------------+ 
              |     DR Region           | 
              |      (us-west-2)        | 
              +-------------------------+ 
                 | S3 + Catalog Pre-Built 
                 | Glue Crawlers & Jobs (disabled) 
                 | Lake Formation policies in sync 
                 ▼ 
          Manual or automatic failover 
 

 


 # Terraform Template: Lake House DR Setup (S3 + Glue + Athena + Lake Formation) provider "aws" { alias = "primary" region = "us-east-1" } provider "aws" { alias = "dr" region = "us-west-2" } ############################ # 1. S3 Buckets + CRR ############################ resource "aws_s3_bucket" "lakehouse_primary" { provider = aws.primary bucket = "my-lakehouse-primary" versioning { enabled = true } } resource "aws_s3_bucket" "lakehouse_dr" { provider = aws.dr bucket = "my-lakehouse-dr" versioning { enabled = true } } resource "aws_s3_bucket_replication_configuration" "replication" { provider = aws.primary bucket = aws_s3_bucket.lakehouse_primary.id role = aws_iam_role.replication_role.arn rules { id = "replicate-all" status = "Enabled" destination { bucket = aws_s3_bucket.lakehouse_dr.arn storage_class = "STANDARD" } } } resource "aws_iam_role" "replication_role" { name = "s3-replication-role" assume_role_policy = jsonencode({ Version = "2012-10-17", Statement = [{ Effect = "Allow", Principal = { Service = "s3.amazonaws.com" }, Action = "sts:AssumeRole" }] }) } resource "aws_iam_role_policy" "replication_policy" { role = aws_iam_role.replication_role.id policy = jsonencode({ Version = "2012-10-17", Statement = [ { Effect = "Allow", Action = ["s3:GetReplicationConfiguration", "s3:ListBucket"], Resource = [aws_s3_bucket.lakehouse_primary.arn] }, { Effect = "Allow", Action = ["s3:GetObjectVersion", "s3:GetObjectVersionAcl"], Resource = ["${aws_s3_bucket.lakehouse_primary.arn}/*"] }, { Effect = "Allow", Action = ["s3:ReplicateObject", "s3:ReplicateDelete", "s3:ReplicateTags"], Resource = ["${aws_s3_bucket.lakehouse_dr.arn}/*"] } ] }) } ############################ # 2. Glue Catalog Table ############################ resource "aws_glue_catalog_database" "catalog_db" { provider = aws.primary name = "lakehouse_db" } resource "aws_glue_catalog_database" "catalog_db_dr" { provider = aws.dr name = "lakehouse_db" } resource "aws_glue_catalog_table" "sample_table" { provider = aws.primary name = "sample_data" database_name = aws_glue_catalog_database.catalog_db.name table_type = "EXTERNAL_TABLE" parameters = { classification = "parquet" EXTERNAL = "TRUE" } storage_descriptor { location = "s3://${aws_s3_bucket.lakehouse_primary.bucket}/data/" input_format = "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat" output_format = "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat" serde_info { serialization_library = "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe" } } } resource "aws_glue_catalog_table" "sample_table_dr" { provider = aws.dr name = "sample_data" database_name = aws_glue_catalog_database.catalog_db_dr.name table_type = "EXTERNAL_TABLE" parameters = { classification = "parquet" EXTERNAL = "TRUE" } storage_descriptor { location = "s3://${aws_s3_bucket.lakehouse_dr.bucket}/data/" input_format = "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat" output_format = "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat" serde_info { serialization_library = "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe" } } } ############################ # 3. Lake Formation (Optional) # You must grant LF permissions via CLI/Boto3 ############################ # NOTE: Lake Formation resources are limited in Terraform; # permissions must be scripted using `aws lakeformation grant-permissions` # Recommendation: use Boto3 scripts or Terraform `null_resource` with local-exec


 

 

No comments:

Post a Comment