Wednesday, 25 June 2025

Glue - part 1

🔷 What is AWS Glue? 

AWS Glue is a serverless data integration service that lets you discover, prepare, transform, and load data for analytics, ML, and app development. It supports structured, semi-structured, and unstructured data. 

 

📦 Key Components of AWS Glue 

Component 

Description 

Glue Data Catalog 

Central metadata repository to register schemas and tables. 

Crawlers 

Scan data in S3 or JDBC sources and create tables in the Glue Data Catalog. 

ETL Jobs 

Code-based (Python/Scala) jobs for data transformation. 

Triggers 

Automate job execution based on schedules/events. 

Workflows 

Manage ETL pipelines with multiple dependent jobs and triggers. 

Glue Studio 

Visual editor to create jobs without writing code. 

Glue DataBrew 

Visual data preparation tool for business analysts. 

Glue Streaming 

Real-time ETL support for streaming data (e.g., Kinesis/Kafka). 

Glue Connectors 

Integrate with external sources: RDS, MongoDB, Snowflake, etc. 

Glue Marketplace 

Get third-party connectors and job blueprints. 

 

🔄 Glue Functionalities with Examples 

🔹 1. Glue Crawlers 

  • Detect schema from source data (e.g., S3, RDS, Redshift). 

  • Create/Update tables in Glue Catalog. 

bash 

# Create crawler with AWS CLI 
aws glue create-crawler \ 
 --name "my-crawler" \ 
 --role "GlueServiceRole" \ 
 --database-name "my-db" \ 
 --targets '{"S3Targets": [{"Path": "s3://my-bucket/data"}]}' 
 

 

🔹 2. Glue Jobs 

  • Run ETL logic (transform, clean, enrich). 

  • PySpark or Scala-based code or visual through Glue Studio. 

  • Can read/write from: 

  • S3 (Parquet, CSV, JSON) 

  • RDS, JDBC 

  • Kafka/Kinesis 

  • Redshift 

python 

# PySpark Glue job 
datasource = glueContext.create_dynamic_frame.from_catalog(database="sales", table_name="orders") 
filtered = Filter.apply(datasource, lambda row: row["status"] == "completed") 
glueContext.write_dynamic_frame.from_options( 
   frame=filtered, connection_type="s3", 
   connection_options={"path": "s3://output-bucket/"}, 
   format="parquet" 
) 
 

 

🔹 3. Glue Workflows 

  • Orchestrate multiple jobs, triggers, crawlers. 

  • Track each execution as a workflow run. 

  • Visual interface available. 

 

🔹 4. Glue Studio 

  • Low-code visual ETL builder. 

  • Choose sources, apply transforms, and run jobs. 

 

🔹 5. Glue Streaming Jobs 

  • Connect to real-time streams like Kafka, Kinesis. 

  • Apply ETL on streaming data. 

 

🔹 6. Glue DataBrew (No-code Tool) 

  • Drag-and-drop based data profiling and transformation. 

  • Auto-detect anomalies and clean data. 

 

🔐 IAM Permissions for AWS Glue 

IAM controls who can do what in Glue. 

🧾 Common IAM Actions: 

Action 

Purpose 

glue:CreateJob 

Create new Glue job 

glue:StartJobRun 

Execute Glue job 

glue:GetTable 

View table metadata 

glue:CreateCrawler 

Create crawlers 

glue:GetDatabase 

View database metadata 

glue:UpdateJob 

Modify existing job 

📄 Example IAM Policy for a Glue Developer 

json 

{ 
 "Version": "2012-10-17", 
 "Statement": [ 
   { 
     "Effect": "Allow", 
     "Action": [ 
       "glue:*", 
       "s3:ListBucket", 
       "s3:GetObject", 
       "s3:PutObject" 
     ], 
     "Resource": "*" 
   } 
 ] 
} 
 

🎯 Best Practice: Don’t use "Resource": "*" in production. Define exact S3 buckets, Glue databases, etc. 

 

🔐 Governance with Lake Formation (LakeView) 

🔸 What is Lake Formation? 

Lake Formation builds governance on top of Glue Data Catalog. It controls fine-grained access to data and metadata. 

🔸 Key Features: 

Feature 

Description 

Fine-Grained Permissions 

Table-, column-, and row-level access control. 

Tag-Based Access 

Use Lake Formation Tags to classify and govern data. 

Auditing 

Tracks data access for governance & compliance. 

Cross-account sharing 

Secure data access across AWS accounts. 

 

🛡️ How Glue + Lake Formation Security Works 

  1. Register data location (e.g., S3 bucket). 

  1. Grant data access via Lake Formation instead of only IAM. 

  1. Restrict table/column access by: 

  1. Table name 

  1. Column filters 

  1. Row-level filters 

  1. IAM still controls who can run jobs/crawlers, but Lake Formation governs access to data. 

 

✅ Lake Formation Permissions Model 

Layer 

Who Controls 

Example 

IAM 

Glue Job execution 

glue:StartJobRun 

LF 

Data access 

SELECT access on table sales only for amount column 

 

🎯 Example: Enforcing Access Control with Lake Formation 

Step 1: Register location 

bash 

aws lakeformation register-resource \ 
 --resource-arn arn:aws:s3:::my-bucket \ 
 --use-service-linked-role 
 

Step 2: Grant Table Permissions 

bash 

aws lakeformation grant-permissions \ 
 --principal DataLakePrincipalIdentifier=arn:aws:iam::111122223333:user/athena-user \ 
 --permissions "SELECT" "DESCRIBE" \ 
 --resource '{ "Table": {"DatabaseName":"sales", "Name":"transactions"} }' 
 

Step 3: Grant Column-Level Access 

bash 

aws lakeformation grant-permissions \ 
 --principal DataLakePrincipalIdentifier=arn:aws:iam::111122223333:user/analyst \ 
 --permissions "SELECT" \ 
 --resource '{ "TableWithColumns": { "DatabaseName": "sales", "Name": "transactions", "ColumnNames": ["customer_id", "amount"] } }' 
 

 

💡 Best Practices 

Area 

Practice 

IAM 

Use least privilege. Assign only needed Glue and S3 permissions. 

Lake Formation 

Use LF permissions for data governance, not IAM bucket policies. 

Catalog 

Enable encryption and versioning for Glue Data Catalog. 

Logging 

Enable CloudTrail for Glue and Lake Formation actions. 

 

🔚 Summary 

Service 

Role 

Glue 

ETL, schema catalog, job execution 

Lake Formation 

Security and governance over Glue Catalog and S3 

IAM 

Controls job/crawler/Glue resource access 

Athena 

Queries S3 data using Glue Catalog metadata 

 

🔄 AWS Glue Workflows – In Depth 

✅ What are Glue Workflows? 

AWS Glue Workflows let you orchestrate complex ETL pipelines consisting of Jobs, Crawlers, and Triggers. You can define dependencies between steps and monitor their execution in a directed acyclic graph (DAG). 

 

📌 Components of a Glue Workflow 

Component 

Description 

Workflow 

Logical container that manages the overall ETL pipeline. 

Start Trigger 

Automatically starts the workflow based on schedule or on-demand. 

Actions 

Crawler runs, job runs, or custom trigger invocations. 

Conditions 

Specify success/failure conditions to control step execution. 

 

🧱 Glue Workflow Architecture Example 

Use case: 

  1. Crawl raw data 

  1. Run transformation job 

  1. Crawl processed data 

  1. Notify downstream system (optional) 

yaml 

Workflow: customer-txn-etl-workflow 
 ├── Trigger: start (Scheduled or OnDemand) 
 │    └── Crawler: raw-data-crawler 
 │         └── Job: transform-customer-txns 
 │              └── Crawler: processed-data-crawler 
 

 

🛠️ Step-by-Step: Create a Glue Workflow 

Step 1: Create Workflow 

Via Console: 

  • Go to Glue → Workflows → “Add Workflow” 

  • Name: customer-txn-etl-workflow 

Or CLI: 

bash 

aws glue create-workflow --name customer-txn-etl-workflow 
 

 

Step 2: Add Start Trigger 

bash 

aws glue create-trigger \ 
 --name "start-trigger" \ 
 --workflow-name "customer-txn-etl-workflow" \ 
 --type SCHEDULED \ 
 --schedule "cron(0 2 * * ? *)" \ 
 --actions '[{"CrawlerName": "raw-data-crawler"}]' \ 
 --start-on-creation 
 

This runs the crawler daily at 2 AM UTC. 

 

Step 3: Add a Job Action (Transformation) 

bash 

aws glue create-trigger \ 
 --name "transform-job-trigger" \ 
 --workflow-name "customer-txn-etl-workflow" \ 
 --type CONDITIONAL \ 
 --predicate '{"Logical": "ANY", "Conditions": [{"LogicalOperator": "EQUALS", "CrawlerName": "raw-data-crawler", "CrawlState": "SUCCEEDED"}]}' \ 
 --actions '[{"JobName": "transform-customer-txns"}]' \ 
 --start-on-creation 
 

 

Step 4: Add a Crawler Action (Processed Data) 

bash 

aws glue create-trigger \ 
 --name "processed-data-crawler-trigger" \ 
 --workflow-name "customer-txn-etl-workflow" \ 
 --type CONDITIONAL \ 
 --predicate '{"Logical": "ANY", "Conditions": [{"LogicalOperator": "EQUALS", "JobName": "transform-customer-txns", "State": "SUCCEEDED"}]}' \ 
 --actions '[{"CrawlerName": "processed-data-crawler"}]' \ 
 --start-on-creation 
 

 

🔍 Monitor Workflow Execution 

  1. Go to Glue → Workflows → Click on customer-txn-etl-workflow 

  1. View the DAG, status of each step (SUCCEEDED, FAILED, RUNNING). 

  1. Can retry failed nodes or manually trigger. 

 

🎯 Why Use Workflows? 

Benefit 

Description 

✅ Dependency Management 

Enforce sequence like: crawl → transform → catalog output 

✅ Unified Monitoring 

Single view of all jobs/triggers in pipeline 

✅ Failure Handling 

Can configure retries, alerting, and error branching 

✅ Scheduled or Event-Driven 

Run on schedule or manually/in response to events 

 

🧠 Advanced Workflow Tips 

Feature 

Use Case 

Dynamic Job Arguments 

Pass input/output paths, partition dates 

Error Triggering 

Use OnFailure triggers to run alerts or fallback actions 

Run Properties 

Pass custom metadata to jobs 

EventBridge Triggering 

Start workflows from S3 uploads, API calls, etc. 

 

📝 Sample Workflow with Visual Editor (Glue Studio) 

  1. Go to Glue Studio → Workflows 

  1. Use visual flow designer to drag: 

  1. Crawler (raw) 

  1. Transform Job 

  1. Crawler (processed) 

  1. Link the steps with “On Success” edges 

  1. Save and Run Workflow or schedule it. 

 

✅ Example: Create a Complete ETL Workflow with CLI 

bash 

# Create the workflow 
aws glue create-workflow --name "customer-etl-pipeline" 
 
# Create start trigger 
aws glue create-trigger \ 
 --name "start" \ 
 --workflow-name "customer-etl-pipeline" \ 
 --type ON_DEMAND \ 
 --actions '[{"CrawlerName":"raw-data-crawler"}]' \ 
 --start-on-creation 
 
# Add job step after crawler 
aws glue create-trigger \ 
 --name "transform-step" \ 
 --workflow-name "customer-etl-pipeline" \ 
 --type CONDITIONAL \ 
 --predicate '{"Logical":"ANY","Conditions":[{"CrawlerName":"raw-data-crawler","CrawlState":"SUCCEEDED"}]}' \ 
 --actions '[{"JobName":"clean-transform-job"}]' \ 
 --start-on-creation 
 
# Add final crawler 
aws glue create-trigger \ 
 --name "final-catalog-update" \ 
 --workflow-name "customer-etl-pipeline" \ 
 --type CONDITIONAL \ 
 --predicate '{"Logical":"ANY","Conditions":[{"JobName":"clean-transform-job","State":"SUCCEEDED"}]}' \ 
 --actions '[{"CrawlerName":"processed-data-crawler"}]' \ 
 --start-on-creation 
 

 

🔚 Summary of Glue Workflow Use 

Step 

Action 

Create Workflow 

Acts as the container 

Create Start Trigger 

Defines how pipeline starts (manual/schedule) 

Add Crawler Actions 

Catalogs raw/processed data 

Add Job Actions 

Transform data 

Add Conditional Triggers 

Build dependencies between steps 

Monitor DAG 

Review end-to-end execution flow 

 

 

No comments:

Post a Comment