Tips to Improve Knowledge: Glue

🔷 What is AWS Glue?

AWS Glue is a serverless data integration service that lets you discover, prepare, transform, and load data for analytics, ML, and app development. It supports structured, semi-structured, and unstructured data.

📦 Key Components of AWS Glue

Component	Description
Glue Data Catalog	Central metadata repository to register schemas and tables.
Crawlers	Scan data in S3 or JDBC sources and create tables in the Glue Data Catalog.
ETL Jobs	Code-based (Python/Scala) jobs for data transformation.
Triggers	Automate job execution based on schedules/events.
Workflows	Manage ETL pipelines with multiple dependent jobs and triggers.
Glue Studio	Visual editor to create jobs without writing code.
Glue DataBrew	Visual data preparation tool for business analysts.
Glue Streaming	Real-time ETL support for streaming data (e.g., Kinesis/Kafka).
Glue Connectors	Integrate with external sources: RDS, MongoDB, Snowflake, etc.
Glue Marketplace	Get third-party connectors and job blueprints.

🔄 Glue Functionalities with Examples

🔹 1. Glue Crawlers

Detect schema from source data (e.g., S3, RDS, Redshift).

Create/Update tables in Glue Catalog.

bash

# Create crawler with AWS CLI
aws glue create-crawler \
--name "my-crawler" \
--role "GlueServiceRole" \
--database-name "my-db" \
--targets '{"S3Targets": [{"Path": "s3://my-bucket/data"}]}'

🔹 2. Glue Jobs

Run ETL logic (transform, clean, enrich).

PySpark or Scala-based code or visual through Glue Studio.

Can read/write from:

S3 (Parquet, CSV, JSON)

RDS, JDBC

Kafka/Kinesis

Redshift

python

# PySpark Glue job
datasource = glueContext.create_dynamic_frame.from_catalog(database="sales", table_name="orders")
filtered = Filter.apply(datasource, lambda row: row["status"] == "completed")
glueContext.write_dynamic_frame.from_options(
   frame=filtered, connection_type="s3",
   connection_options={"path": "s3://output-bucket/"},
   format="parquet"
)

🔹 3. Glue Workflows

Orchestrate multiple jobs, triggers, crawlers.

Track each execution as a workflow run.

Visual interface available.

🔹 4. Glue Studio

Low-code visual ETL builder.

Choose sources, apply transforms, and run jobs.

🔹 5. Glue Streaming Jobs

Connect to real-time streams like Kafka, Kinesis.

Apply ETL on streaming data.

🔹 6. Glue DataBrew (No-code Tool)

Drag-and-drop based data profiling and transformation.

Auto-detect anomalies and clean data.

🔐 IAM Permissions for AWS Glue

IAM controls who can do what in Glue.

🧾 Common IAM Actions:

Action	Purpose
glue:CreateJob	Create new Glue job
glue:StartJobRun	Execute Glue job
glue:GetTable	View table metadata
glue:CreateCrawler	Create crawlers
glue:GetDatabase	View database metadata
glue:UpdateJob	Modify existing job

📄 Example IAM Policy for a Glue Developer

json

{
"Version": "2012-10-17",
"Statement": [
   {
     "Effect": "Allow",
     "Action": [
       "glue:*",
       "s3:ListBucket",
       "s3:GetObject",
       "s3:PutObject"
    ],
     "Resource": "*"
   }
]
}

🎯 Best Practice: Don’t use "Resource": "*" in production. Define exact S3 buckets, Glue databases, etc.

🔐 Governance with Lake Formation (LakeView)

🔸 What is Lake Formation?

Lake Formation builds governance on top of Glue Data Catalog. It controls fine-grained access to data and metadata.

🔸 Key Features:

Feature	Description
Fine-Grained Permissions	Table-, column-, and row-level access control.
Tag-Based Access	Use Lake Formation Tags to classify and govern data.
Auditing	Tracks data access for governance & compliance.
Cross-account sharing	Secure data access across AWS accounts.

🛡️ How Glue + Lake Formation Security Works

Grant data access via Lake Formation instead of only IAM.

Restrict table/column access by:

Table name

Column filters

Row-level filters

IAM still controls who can run jobs/crawlers, but Lake Formation governs access to data.

✅ Lake Formation Permissions Model

Layer	Who Controls	Example
IAM	Glue Job execution	glue:StartJobRun
LF	Data access	SELECT access on table sales only for amount column

🎯 Example: Enforcing Access Control with Lake Formation

Step 1: Register location

bash

aws lakeformation register-resource \
--resource-arn arn:aws:s3:::my-bucket \
--use-service-linked-role

Step 2: Grant Table Permissions

bash

aws lakeformation grant-permissions \
--principal DataLakePrincipalIdentifier=arn:aws:iam::111122223333:user/athena-user \
--permissions "SELECT" "DESCRIBE" \
--resource '{ "Table": {"DatabaseName":"sales", "Name":"transactions"} }'

Step 3: Grant Column-Level Access

bash

aws lakeformation grant-permissions \
--principal DataLakePrincipalIdentifier=arn:aws:iam::111122223333:user/analyst \
--permissions "SELECT" \
--resource '{ "TableWithColumns": { "DatabaseName": "sales", "Name": "transactions", "ColumnNames": ["customer_id", "amount"] } }'

💡 Best Practices

Area	Practice
IAM	Use least privilege. Assign only needed Glue and S3 permissions.
Lake Formation	Use LF permissions for data governance, not IAM bucket policies.
Catalog	Enable encryption and versioning for Glue Data Catalog.
Logging	Enable CloudTrail for Glue and Lake Formation actions.

🔚 Summary

Service	Role
Glue	ETL, schema catalog, job execution
Lake Formation	Security and governance over Glue Catalog and S3
IAM	Controls job/crawler/Glue resource access
Athena	Queries S3 data using Glue Catalog metadata

🔄 AWS Glue Workflows – In Depth

✅ What are Glue Workflows?

AWS Glue Workflows let you orchestrate complex ETL pipelines consisting of Jobs, Crawlers, and Triggers. You can define dependencies between steps and monitor their execution in a directed acyclic graph (DAG).

📌 Components of a Glue Workflow

Component	Description
Workflow	Logical container that manages the overall ETL pipeline.
Start Trigger	Automatically starts the workflow based on schedule or on-demand.
Actions	Crawler runs, job runs, or custom trigger invocations.
Conditions	Specify success/failure conditions to control step execution.

🧱 Glue Workflow Architecture Example

Use case:

Crawl raw data

Run transformation job

Crawl processed data

Notify downstream system (optional)

yaml

Workflow: customer-txn-etl-workflow
├── Trigger: start (Scheduled or OnDemand)
│    └── Crawler: raw-data-crawler
│         └── Job: transform-customer-txns
│              └── Crawler: processed-data-crawler

🛠️ Step-by-Step: Create a Glue Workflow

Step 1: Create Workflow

Via Console:

Go to Glue → Workflows → “Add Workflow”

Name: customer-txn-etl-workflow

Or CLI:

bash

aws glue create-workflow --name customer-txn-etl-workflow

Step 2: Add Start Trigger

bash

aws glue create-trigger \
--name "start-trigger" \
--workflow-name "customer-txn-etl-workflow" \
--type SCHEDULED \
--schedule "cron(0 2 * * ? *)" \
--actions '[{"CrawlerName": "raw-data-crawler"}]' \
--start-on-creation

This runs the crawler daily at 2 AM UTC.

Step 3: Add a Job Action (Transformation)

bash

aws glue create-trigger \
--name "transform-job-trigger" \
--workflow-name "customer-txn-etl-workflow" \
--type CONDITIONAL \
--predicate '{"Logical": "ANY", "Conditions": [{"LogicalOperator": "EQUALS", "CrawlerName": "raw-data-crawler", "CrawlState": "SUCCEEDED"}]}' \
--actions '[{"JobName": "transform-customer-txns"}]' \
--start-on-creation

Step 4: Add a Crawler Action (Processed Data)

bash

aws glue create-trigger \
--name "processed-data-crawler-trigger" \
--workflow-name "customer-txn-etl-workflow" \
--type CONDITIONAL \
--predicate '{"Logical": "ANY", "Conditions": [{"LogicalOperator": "EQUALS", "JobName": "transform-customer-txns", "State": "SUCCEEDED"}]}' \
--actions '[{"CrawlerName": "processed-data-crawler"}]' \
--start-on-creation

🔍 Monitor Workflow Execution

Go to Glue → Workflows → Click on customer-txn-etl-workflow

View the DAG, status of each step (SUCCEEDED, FAILED, RUNNING).

Can retry failed nodes or manually trigger.

🎯 Why Use Workflows?

Benefit	Description
✅ Dependency Management	Enforce sequence like: crawl → transform → catalog output
✅ Unified Monitoring	Single view of all jobs/triggers in pipeline
✅ Failure Handling	Can configure retries, alerting, and error branching
✅ Scheduled or Event-Driven	Run on schedule or manually/in response to events

🧠 Advanced Workflow Tips

Feature	Use Case
Dynamic Job Arguments	Pass input/output paths, partition dates
Error Triggering	Use OnFailure triggers to run alerts or fallback actions
Run Properties	Pass custom metadata to jobs
EventBridge Triggering	Start workflows from S3 uploads, API calls, etc.

📝 Sample Workflow with Visual Editor (Glue Studio)

Go to Glue Studio → Workflows

Use visual flow designer to drag:

Crawler (raw)

Transform Job

Crawler (processed)

Link the steps with “On Success” edges

Save and Run Workflow or schedule it.

✅ Example: Create a Complete ETL Workflow with CLI

bash

# Create the workflow
aws glue create-workflow --name "customer-etl-pipeline"

# Create start trigger
aws glue create-trigger \
--name "start" \
--workflow-name "customer-etl-pipeline" \
--type ON_DEMAND \
--actions '[{"CrawlerName":"raw-data-crawler"}]' \
--start-on-creation

# Add job step after crawler
aws glue create-trigger \
--name "transform-step" \
--workflow-name "customer-etl-pipeline" \
--type CONDITIONAL \
--predicate '{"Logical":"ANY","Conditions":[{"CrawlerName":"raw-data-crawler","CrawlState":"SUCCEEDED"}]}' \
--actions '[{"JobName":"clean-transform-job"}]' \
--start-on-creation

# Add final crawler
aws glue create-trigger \
--name "final-catalog-update" \
--workflow-name "customer-etl-pipeline" \
--type CONDITIONAL \
--predicate '{"Logical":"ANY","Conditions":[{"JobName":"clean-transform-job","State":"SUCCEEDED"}]}' \
--actions '[{"CrawlerName":"processed-data-crawler"}]' \
--start-on-creation

🔚 Summary of Glue Workflow Use

Step	Action
Create Workflow	Acts as the container
Create Start Trigger	Defines how pipeline starts (manual/schedule)
Add Crawler Actions	Catalogs raw/processed data
Add Job Actions	Transform data
Add Conditional Triggers	Build dependencies between steps
Monitor DAG	Review end-to-end execution flow

Tips to Improve Knowledge

Wednesday, 25 June 2025

Glue - part 1

No comments:

Post a Comment