🔰 Basic Level (1–15)
-
What is AWS Glue?
-
Serverless data integration service to discover, catalog, clean, transform, and load data.
-
-
What is the Glue Data Catalog?
-
Central metadata repository used by Glue, Athena, Redshift Spectrum, etc.
-
-
What is a Glue Crawler?
-
Service that scans data stores (like S3) and automatically infers schemas to create/update tables in the Data Catalog.
-
-
What are the programming languages supported by AWS Glue jobs?
-
PySpark (Python), Scala, and recently Python shell for lightweight scripts.
-
-
What types of data formats does Glue support?
-
CSV, JSON, Parquet, Avro, ORC, XML (via custom classifier).
-
-
What is a DynamicFrame in Glue?
-
AWS Glue abstraction over Spark DataFrames optimized for schema handling and transformations.
-
-
What is the difference between DynamicFrame and DataFrame?
-
DynamicFrames support schema inference, changing schemas, and come with built-in transformations; DataFrames are faster but less flexible.
-
-
How do you schedule a Glue job?
-
Using Glue triggers or AWS EventBridge.
-
-
What is the use of Glue Dev Endpoint?
-
Provides an interactive Jupyter notebook environment for ETL script development.
-
-
What is Glue Studio?
-
A visual ETL development interface to create and monitor jobs without deep coding.
-
-
What is a Glue trigger?
-
A way to automatically start jobs or workflows based on time, events, or job/crawler completion.
-
-
How do you register a new data source in Glue?
-
Use Glue Crawlers or manually add it to the Data Catalog.
-
-
What is a partition in Glue?
-
Logical sub-division of a table (like year, month) to optimize query performance.
-
-
What is the default retry policy for a Glue job?
-
3 retries with exponential backoff, configurable up to 10.
-
-
What is the difference between
s3://
ands3a://
in Glue?-
s3a://
uses the Amazon EMR S3A connector and is optimized for Hadoop/Spark performance.
-
🔁 Intermediate Level (16–30)
-
How do you perform schema evolution in Glue?
-
Enable schema updates in Crawlers, or manage changes programmatically in jobs.
-
-
How does Glue interact with Lake Formation?
-
Glue jobs use the Data Catalog governed by Lake Formation for fine-grained access control.
-
-
How do you create and manage Glue Workflows?
-
Use the console or CLI to orchestrate multiple crawlers, jobs, and triggers.
-
-
Explain Glue job bookmarking.
-
Allows Glue to keep track of previously processed data to avoid reprocessing (e.g., using checkpointing).
-
-
What are Glue Connections?
-
Configuration to connect to JDBC databases, data warehouses, or data lakes.
-
-
How can you run parallel jobs in Glue?
-
Create multiple triggers in a workflow or run jobs asynchronously with
StartJobRun
.
-
-
Can you use external tables in Glue Data Catalog?
-
Yes, Glue can reference external data sources like S3, RDS, Redshift.
-
-
How do you handle semi-structured data like JSON in Glue?
-
Use
Relationalize
transformation to flatten nested data structures.
-
-
What is the max number of concurrent Glue jobs per account?
-
Default is 5 concurrent job runs; can be increased by AWS support.
-
-
What are the key differences between Spark and Python Shell jobs in Glue?
-
Spark jobs support distributed processing; Python shell jobs are used for lightweight scripting tasks (e.g., data validation).
-
-
How do you optimize Glue job performance?
-
Use partitioning, bucketing, filtering early, broadcast joins, and job bookmarks.
-
-
What is the default memory and DPUs (Data Processing Units) in Glue?
-
Glue 2.0 jobs use 1 DPU = 4 vCPU + 16GB RAM (minimum 2 DPUs for Spark jobs).
-
-
How do you manage job dependencies in Glue?
-
Using workflows with triggers and conditions.
-
-
What is Glue Streaming?
-
Real-time ETL from sources like Kinesis or Kafka, available in Glue version 3.0+.
-
-
How do you debug a Glue job failure?
-
Check job logs in CloudWatch, enable job metrics, review stack traces and bookmark errors.
-
🚀 Advanced Level (31–40)
-
Explain how to secure Glue data access with Lake Formation and IAM.
-
IAM controls resource access; Lake Formation governs data-level access (row, column, table).
-
-
How do you implement row-level security in Glue with Lake Formation?
-
Create row filters in Lake Formation and assign to principals using data lake permissions.
-
-
How would you manage schema evolution in a production Glue pipeline?
-
Store schema versions in Catalog or schema registry, track changes via Glue tables or Glue versioned job scripts.
-
-
How can you use version control with Glue Jobs?
-
Store PySpark scripts in Git, and use CI/CD pipelines to update Glue jobs via SDK/CLI.
-
-
Can Glue Jobs read from multiple sources and write to multiple targets?
-
Yes, Spark-based jobs can support multi-source multi-target ETL flows.
-
-
What are Glue Blueprints?
-
Predefined ETL job templates that automate tasks like data ingestion from RDS to S3.
-
-
How can you monitor Glue job performance over time?
-
Use CloudWatch metrics, Glue job run history, and Athena to analyze job logs.
-
-
How do you enforce data lineage and auditing in Glue?
-
Use Glue job history, CloudTrail, Lake Formation audit logs, and add metadata tracking to ETL scripts.
-
-
What are Glue custom classifiers?
-
User-defined classifiers to infer schema for data not supported natively (e.g., XML, log formats).
-
-
Explain how to build a CDC (Change Data Capture) pipeline using Glue.
-
Extract delta/changed data using bookmarks or source timestamps → transform → load into versioned/partitioned target → catalog updates.
-
🛠️ 🔗 AWS Glue + RDS/Aurora Integration — Interview Questions (41–50)
41. How do you connect AWS Glue to an RDS or Aurora database?
✅ Answer:
You create a Glue Connection of typeJDBC
using the database endpoint, port, database name, and credentials. You also need to ensure:
-
The Glue job is in the same VPC/subnet/security group as the RDS instance.
-
The database allows inbound access from Glue via the selected security group.
-
If using Aurora Serverless v2, proper routing and IAM permissions are required.
42. What types of JDBC connections are supported by AWS Glue for RDS and Aurora?
✅ Answer:
-
PostgreSQL
-
MySQL
-
MariaDB
-
Oracle
-
SQL Server
Aurora is compatible with MySQL or PostgreSQL and works with the respective JDBC drivers.
43. What permissions are required to access RDS from Glue?
✅ Answer:
IAM Role used by Glue must have:
json
{
"Effect": "Allow",
"Action": [
"glue:*",
"rds:DescribeDBInstances",
"secretsmanager:GetSecretValue",
"ec2:DescribeSubnets",
"ec2:DescribeSecurityGroups"
],
"Resource": "*"
}
Also, if credentials are stored in Secrets Manager, Glue must be granted access to that secret.
44. How do you create a Glue connection to Aurora using the console?
✅ Answer:
-
Go to AWS Glue → Connections → Add connection
-
Choose JDBC
-
JDBC URL example for Aurora PostgreSQL:
php-templatejdbc:postgresql://<aurora-endpoint>:5432/<dbname>
-
Enter database username and password (or use Secrets Manager)
-
Select VPC, subnet, and security group that can reach Aurora
45. How can you catalog RDS/Aurora tables using Glue?
✅ Answer:
-
After creating the connection, configure a Glue Crawler:
-
Data source: JDBC
-
Connection: select the one pointing to Aurora
-
Include/exclude specific schemas/tables
-
Target database in Glue Catalog
-
-
When run, the crawler introspects the JDBC source and creates catalog tables.
46. Can you read from RDS and write to S3 in the same Glue job?
✅ Answer:
Yes. You can extract data from RDS (via JDBC) and write to S3 in Parquet/CSV/JSON format.
python
# Example snippet
datasource = glueContext.create_dynamic_frame.from_catalog(
database="rds_catalog_db",
table_name="public_customers"
)
glueContext.write_dynamic_frame.from_options(
frame=datasource,
connection_type="s3",
connection_options={"path": "s3://my-output-bucket/customers/"},
format="parquet"
)
47. What are the best practices when connecting Glue to RDS?
✅ Answer:
-
Enable encryption and restrict Glue IAM role permissions.
-
Use Secrets Manager for credentials, not hardcoded values.
-
Enable SSL connections (by appending
?ssl=true
to JDBC URL). -
Use narrow security groups and VPC endpoints for network security.
-
Use Glue job bookmarks if doing incremental fetch.
48. How can you handle schema drift when reading from RDS using Glue?
✅ Answer:
-
Use Crawlers to regularly update Glue Catalog schema.
-
In jobs, use
DynamicFrame
(more forgiving to schema changes). -
Consider schema versioning or backups of previous schema definitions.
49. Can you connect Glue to RDS in a different AWS account?
✅ Answer:
Yes, but requires:
-
VPC Peering or Transit Gateway to route traffic.
-
Security group rules allowing cross-account access.
-
Glue job IAM role in Account A must be trusted by Account B if using cross-account Secrets Manager.
-
Catalog sharing via Lake Formation Resource Links (optional).
50. What are the limitations of using Glue with Aurora?
✅ Answer:
-
High-concurrency Aurora queries from Glue may cause load spikes.
-
JDBC query timeouts or throttling can occur on large datasets.
-
No automatic schema change notification from Aurora to Glue.
-
Aurora Serverless has scaling delay, which may impact latency-sensitive jobs.
No comments:
Post a Comment