Tips to Improve Knowledge: Glue - interview questions

🔰 Basic Level (1–15)

What is AWS Glue?
- Serverless data integration service to discover, catalog, clean, transform, and load data.
What is the Glue Data Catalog?
- Central metadata repository used by Glue, Athena, Redshift Spectrum, etc.
What is a Glue Crawler?
- Service that scans data stores (like S3) and automatically infers schemas to create/update tables in the Data Catalog.
What are the programming languages supported by AWS Glue jobs?
- PySpark (Python), Scala, and recently Python shell for lightweight scripts.
What types of data formats does Glue support?
- CSV, JSON, Parquet, Avro, ORC, XML (via custom classifier).
What is a DynamicFrame in Glue?
- AWS Glue abstraction over Spark DataFrames optimized for schema handling and transformations.
What is the difference between DynamicFrame and DataFrame?
- DynamicFrames support schema inference, changing schemas, and come with built-in transformations; DataFrames are faster but less flexible.
How do you schedule a Glue job?
- Using Glue triggers or AWS EventBridge.
What is the use of Glue Dev Endpoint?
- Provides an interactive Jupyter notebook environment for ETL script development.
What is Glue Studio?
- A visual ETL development interface to create and monitor jobs without deep coding.
What is a Glue trigger?
- A way to automatically start jobs or workflows based on time, events, or job/crawler completion.
How do you register a new data source in Glue?
- Use Glue Crawlers or manually add it to the Data Catalog.
What is a partition in Glue?
- Logical sub-division of a table (like year, month) to optimize query performance.
What is the default retry policy for a Glue job?
- 3 retries with exponential backoff, configurable up to 10.
What is the difference between s3:// and s3a:// in Glue?
- s3a:// uses the Amazon EMR S3A connector and is optimized for Hadoop/Spark performance.

🔁 Intermediate Level (16–30)

How do you perform schema evolution in Glue?
- Enable schema updates in Crawlers, or manage changes programmatically in jobs.
How does Glue interact with Lake Formation?
- Glue jobs use the Data Catalog governed by Lake Formation for fine-grained access control.
How do you create and manage Glue Workflows?
- Use the console or CLI to orchestrate multiple crawlers, jobs, and triggers.
Explain Glue job bookmarking.
- Allows Glue to keep track of previously processed data to avoid reprocessing (e.g., using checkpointing).
What are Glue Connections?
- Configuration to connect to JDBC databases, data warehouses, or data lakes.
How can you run parallel jobs in Glue?
- Create multiple triggers in a workflow or run jobs asynchronously with StartJobRun.
Can you use external tables in Glue Data Catalog?
- Yes, Glue can reference external data sources like S3, RDS, Redshift.
How do you handle semi-structured data like JSON in Glue?
- Use Relationalize transformation to flatten nested data structures.
What is the max number of concurrent Glue jobs per account?
- Default is 5 concurrent job runs; can be increased by AWS support.
What are the key differences between Spark and Python Shell jobs in Glue?
- Spark jobs support distributed processing; Python shell jobs are used for lightweight scripting tasks (e.g., data validation).
How do you optimize Glue job performance?
- Use partitioning, bucketing, filtering early, broadcast joins, and job bookmarks.
What is the default memory and DPUs (Data Processing Units) in Glue?
- Glue 2.0 jobs use 1 DPU = 4 vCPU + 16GB RAM (minimum 2 DPUs for Spark jobs).
How do you manage job dependencies in Glue?
- Using workflows with triggers and conditions.
What is Glue Streaming?
- Real-time ETL from sources like Kinesis or Kafka, available in Glue version 3.0+.
How do you debug a Glue job failure?
- Check job logs in CloudWatch, enable job metrics, review stack traces and bookmark errors.

🚀 Advanced Level (31–40)

Explain how to secure Glue data access with Lake Formation and IAM.
- IAM controls resource access; Lake Formation governs data-level access (row, column, table).
How do you implement row-level security in Glue with Lake Formation?
- Create row filters in Lake Formation and assign to principals using data lake permissions.
How would you manage schema evolution in a production Glue pipeline?
- Store schema versions in Catalog or schema registry, track changes via Glue tables or Glue versioned job scripts.
How can you use version control with Glue Jobs?
- Store PySpark scripts in Git, and use CI/CD pipelines to update Glue jobs via SDK/CLI.
Can Glue Jobs read from multiple sources and write to multiple targets?
- Yes, Spark-based jobs can support multi-source multi-target ETL flows.
What are Glue Blueprints?
- Predefined ETL job templates that automate tasks like data ingestion from RDS to S3.
How can you monitor Glue job performance over time?
- Use CloudWatch metrics, Glue job run history, and Athena to analyze job logs.
How do you enforce data lineage and auditing in Glue?
- Use Glue job history, CloudTrail, Lake Formation audit logs, and add metadata tracking to ETL scripts.
What are Glue custom classifiers?
- User-defined classifiers to infer schema for data not supported natively (e.g., XML, log formats).
Explain how to build a CDC (Change Data Capture) pipeline using Glue.
- Extract delta/changed data using bookmarks or source timestamps → transform → load into versioned/partitioned target → catalog updates.

🛠️ 🔗 AWS Glue + RDS/Aurora Integration — Interview Questions (41–50)

41. How do you connect AWS Glue to an RDS or Aurora database?

The Glue job is in the same VPC/subnet/security group as the RDS instance.
The database allows inbound access from Glue via the selected security group.
If using Aurora Serverless v2, proper routing and IAM permissions are required.

42. What types of JDBC connections are supported by AWS Glue for RDS and Aurora?

PostgreSQL
MySQL
MariaDB
Oracle
SQL Server

Aurora is compatible with MySQL or PostgreSQL and works with the respective JDBC drivers.

43. What permissions are required to access RDS from Glue?

json

{
  "Effect": "Allow",
  "Action": [
    "glue:*",
    "rds:DescribeDBInstances",
    "secretsmanager:GetSecretValue",
    "ec2:DescribeSubnets",
    "ec2:DescribeSecurityGroups"
  ],
  "Resource": "*"
}

Also, if credentials are stored in Secrets Manager, Glue must be granted access to that secret.

44. How do you create a Glue connection to Aurora using the console?

Go to AWS Glue → Connections → Add connection
Choose JDBC

JDBC URL example for Aurora PostgreSQL:

php-template

jdbc:postgresql://<aurora-endpoint>:5432/<dbname>

Enter database username and password (or use Secrets Manager)
Select VPC, subnet, and security group that can reach Aurora

45. How can you catalog RDS/Aurora tables using Glue?

After creating the connection, configure a Glue Crawler:
- Data source: JDBC
- Connection: select the one pointing to Aurora
- Include/exclude specific schemas/tables
- Target database in Glue Catalog
When run, the crawler introspects the JDBC source and creates catalog tables.

46. Can you read from RDS and write to S3 in the same Glue job?

python

# Example snippet
datasource = glueContext.create_dynamic_frame.from_catalog(
    database="rds_catalog_db",
    table_name="public_customers"
)

glueContext.write_dynamic_frame.from_options(
    frame=datasource,
    connection_type="s3",
    connection_options={"path": "s3://my-output-bucket/customers/"},
    format="parquet"
)

47. What are the best practices when connecting Glue to RDS?

Enable encryption and restrict Glue IAM role permissions.
Use Secrets Manager for credentials, not hardcoded values.
Enable SSL connections (by appending ?ssl=true to JDBC URL).
Use narrow security groups and VPC endpoints for network security.
Use Glue job bookmarks if doing incremental fetch.

48. How can you handle schema drift when reading from RDS using Glue?

Use Crawlers to regularly update Glue Catalog schema.
In jobs, use DynamicFrame (more forgiving to schema changes).
Consider schema versioning or backups of previous schema definitions.

49. Can you connect Glue to RDS in a different AWS account?

VPC Peering or Transit Gateway to route traffic.
Security group rules allowing cross-account access.
Glue job IAM role in Account A must be trusted by Account B if using cross-account Secrets Manager.
Catalog sharing via Lake Formation Resource Links (optional).

50. What are the limitations of using Glue with Aurora?

High-concurrency Aurora queries from Glue may cause load spikes.
JDBC query timeouts or throttling can occur on large datasets.
No automatic schema change notification from Aurora to Glue.
Aurora Serverless has scaling delay, which may impact latency-sensitive jobs.

Tips to Improve Knowledge

Wednesday, 25 June 2025

Glue - interview questions