Wednesday, 25 June 2025

Glue - interview questions

 

🔰 Basic Level (1–15)

  1. What is AWS Glue?

    • Serverless data integration service to discover, catalog, clean, transform, and load data.

  2. What is the Glue Data Catalog?

    • Central metadata repository used by Glue, Athena, Redshift Spectrum, etc.

  3. What is a Glue Crawler?

    • Service that scans data stores (like S3) and automatically infers schemas to create/update tables in the Data Catalog.

  4. What are the programming languages supported by AWS Glue jobs?

    • PySpark (Python), Scala, and recently Python shell for lightweight scripts.

  5. What types of data formats does Glue support?

    • CSV, JSON, Parquet, Avro, ORC, XML (via custom classifier).

  6. What is a DynamicFrame in Glue?

    • AWS Glue abstraction over Spark DataFrames optimized for schema handling and transformations.

  7. What is the difference between DynamicFrame and DataFrame?

    • DynamicFrames support schema inference, changing schemas, and come with built-in transformations; DataFrames are faster but less flexible.

  8. How do you schedule a Glue job?

    • Using Glue triggers or AWS EventBridge.

  9. What is the use of Glue Dev Endpoint?

    • Provides an interactive Jupyter notebook environment for ETL script development.

  10. What is Glue Studio?

    • A visual ETL development interface to create and monitor jobs without deep coding.

  11. What is a Glue trigger?

    • A way to automatically start jobs or workflows based on time, events, or job/crawler completion.

  12. How do you register a new data source in Glue?

    • Use Glue Crawlers or manually add it to the Data Catalog.

  13. What is a partition in Glue?

    • Logical sub-division of a table (like year, month) to optimize query performance.

  14. What is the default retry policy for a Glue job?

    • 3 retries with exponential backoff, configurable up to 10.

  15. What is the difference between s3:// and s3a:// in Glue?

    • s3a:// uses the Amazon EMR S3A connector and is optimized for Hadoop/Spark performance.


🔁 Intermediate Level (16–30)

  1. How do you perform schema evolution in Glue?

    • Enable schema updates in Crawlers, or manage changes programmatically in jobs.

  2. How does Glue interact with Lake Formation?

    • Glue jobs use the Data Catalog governed by Lake Formation for fine-grained access control.

  3. How do you create and manage Glue Workflows?

    • Use the console or CLI to orchestrate multiple crawlers, jobs, and triggers.

  4. Explain Glue job bookmarking.

    • Allows Glue to keep track of previously processed data to avoid reprocessing (e.g., using checkpointing).

  5. What are Glue Connections?

    • Configuration to connect to JDBC databases, data warehouses, or data lakes.

  6. How can you run parallel jobs in Glue?

    • Create multiple triggers in a workflow or run jobs asynchronously with StartJobRun.

  7. Can you use external tables in Glue Data Catalog?

    • Yes, Glue can reference external data sources like S3, RDS, Redshift.

  8. How do you handle semi-structured data like JSON in Glue?

    • Use Relationalize transformation to flatten nested data structures.

  9. What is the max number of concurrent Glue jobs per account?

    • Default is 5 concurrent job runs; can be increased by AWS support.

  10. What are the key differences between Spark and Python Shell jobs in Glue?

    • Spark jobs support distributed processing; Python shell jobs are used for lightweight scripting tasks (e.g., data validation).

  11. How do you optimize Glue job performance?

    • Use partitioning, bucketing, filtering early, broadcast joins, and job bookmarks.

  12. What is the default memory and DPUs (Data Processing Units) in Glue?

    • Glue 2.0 jobs use 1 DPU = 4 vCPU + 16GB RAM (minimum 2 DPUs for Spark jobs).

  13. How do you manage job dependencies in Glue?

    • Using workflows with triggers and conditions.

  14. What is Glue Streaming?

    • Real-time ETL from sources like Kinesis or Kafka, available in Glue version 3.0+.

  15. How do you debug a Glue job failure?

    • Check job logs in CloudWatch, enable job metrics, review stack traces and bookmark errors.


🚀 Advanced Level (31–40)

  1. Explain how to secure Glue data access with Lake Formation and IAM.

    • IAM controls resource access; Lake Formation governs data-level access (row, column, table).

  2. How do you implement row-level security in Glue with Lake Formation?

    • Create row filters in Lake Formation and assign to principals using data lake permissions.

  3. How would you manage schema evolution in a production Glue pipeline?

    • Store schema versions in Catalog or schema registry, track changes via Glue tables or Glue versioned job scripts.

  4. How can you use version control with Glue Jobs?

    • Store PySpark scripts in Git, and use CI/CD pipelines to update Glue jobs via SDK/CLI.

  5. Can Glue Jobs read from multiple sources and write to multiple targets?

    • Yes, Spark-based jobs can support multi-source multi-target ETL flows.

  6. What are Glue Blueprints?

    • Predefined ETL job templates that automate tasks like data ingestion from RDS to S3.

  7. How can you monitor Glue job performance over time?

    • Use CloudWatch metrics, Glue job run history, and Athena to analyze job logs.

  8. How do you enforce data lineage and auditing in Glue?

    • Use Glue job history, CloudTrail, Lake Formation audit logs, and add metadata tracking to ETL scripts.

  9. What are Glue custom classifiers?

    • User-defined classifiers to infer schema for data not supported natively (e.g., XML, log formats).

  10. Explain how to build a CDC (Change Data Capture) pipeline using Glue.

    • Extract delta/changed data using bookmarks or source timestamps → transform → load into versioned/partitioned target → catalog updates.


🛠️ 🔗 AWS Glue + RDS/Aurora Integration — Interview Questions (41–50)


41. How do you connect AWS Glue to an RDS or Aurora database?

Answer:
You create a Glue Connection of type JDBC using the database endpoint, port, database name, and credentials. You also need to ensure:

  • The Glue job is in the same VPC/subnet/security group as the RDS instance.

  • The database allows inbound access from Glue via the selected security group.

  • If using Aurora Serverless v2, proper routing and IAM permissions are required.


42. What types of JDBC connections are supported by AWS Glue for RDS and Aurora?

Answer:

  • PostgreSQL

  • MySQL

  • MariaDB

  • Oracle

  • SQL Server

Aurora is compatible with MySQL or PostgreSQL and works with the respective JDBC drivers.


43. What permissions are required to access RDS from Glue?

Answer:
IAM Role used by Glue must have:

json

{ "Effect": "Allow", "Action": [ "glue:*", "rds:DescribeDBInstances", "secretsmanager:GetSecretValue", "ec2:DescribeSubnets", "ec2:DescribeSecurityGroups" ], "Resource": "*" }

Also, if credentials are stored in Secrets Manager, Glue must be granted access to that secret.


44. How do you create a Glue connection to Aurora using the console?

Answer:

  1. Go to AWS Glue → Connections → Add connection

  2. Choose JDBC

  3. JDBC URL example for Aurora PostgreSQL:

    php-template

    jdbc:postgresql://<aurora-endpoint>:5432/<dbname>
  4. Enter database username and password (or use Secrets Manager)

  5. Select VPC, subnet, and security group that can reach Aurora


45. How can you catalog RDS/Aurora tables using Glue?

Answer:

  • After creating the connection, configure a Glue Crawler:

    • Data source: JDBC

    • Connection: select the one pointing to Aurora

    • Include/exclude specific schemas/tables

    • Target database in Glue Catalog

  • When run, the crawler introspects the JDBC source and creates catalog tables.


46. Can you read from RDS and write to S3 in the same Glue job?

Answer:
Yes. You can extract data from RDS (via JDBC) and write to S3 in Parquet/CSV/JSON format.

python

# Example snippet datasource = glueContext.create_dynamic_frame.from_catalog( database="rds_catalog_db", table_name="public_customers" ) glueContext.write_dynamic_frame.from_options( frame=datasource, connection_type="s3", connection_options={"path": "s3://my-output-bucket/customers/"}, format="parquet" )

47. What are the best practices when connecting Glue to RDS?

Answer:

  • Enable encryption and restrict Glue IAM role permissions.

  • Use Secrets Manager for credentials, not hardcoded values.

  • Enable SSL connections (by appending ?ssl=true to JDBC URL).

  • Use narrow security groups and VPC endpoints for network security.

  • Use Glue job bookmarks if doing incremental fetch.


48. How can you handle schema drift when reading from RDS using Glue?

Answer:

  • Use Crawlers to regularly update Glue Catalog schema.

  • In jobs, use DynamicFrame (more forgiving to schema changes).

  • Consider schema versioning or backups of previous schema definitions.


49. Can you connect Glue to RDS in a different AWS account?

Answer:
Yes, but requires:

  • VPC Peering or Transit Gateway to route traffic.

  • Security group rules allowing cross-account access.

  • Glue job IAM role in Account A must be trusted by Account B if using cross-account Secrets Manager.

  • Catalog sharing via Lake Formation Resource Links (optional).


50. What are the limitations of using Glue with Aurora?

Answer:

  • High-concurrency Aurora queries from Glue may cause load spikes.

  • JDBC query timeouts or throttling can occur on large datasets.

  • No automatic schema change notification from Aurora to Glue.

  • Aurora Serverless has scaling delay, which may impact latency-sensitive jobs.

No comments:

Post a Comment