Prerequisites
1. Python + PySpark Environment
Install the necessary packages:
You’ll also need the Iceberg Spark runtime JARs. You can get them from Maven Central.
Or, use the Spark --packages option when running the job.
2. AWS Setup
-
✅ S3 bucket created (e.g.,
s3://my-data-lake/iceberg/) -
✅ IAM role with access to S3, Glue, and Lake Formation
-
✅ AWS Glue database created:
iceberg_demo -
✅ Lake Formation permissions granted if enabled
🧊 Step-by-Step Iceberg Creation Using PySpark
📁 File: create_iceberg_spark.py
📂 What Happens in S3?
After this runs, S3 will contain:
These files are crucial to Iceberg's table versioning and query planning.
🔗 Glue Catalog Entry
A new table called customer will appear under:
-
Database:
iceberg_demo -
Table Type:
EXTERNAL -
Format:
ICEBERG
This happens because we used glue_catalog with Iceberg’s AWS support.
🔍 Query via Athena
-
Go to Athena
-
Choose Data source:
AwsDataCatalog -
Database:
iceberg_demo -
Run query:
You will see:
| customer_id | name | city |
|---|---|---|
| 1 | Alice | NY |
| 2 | Bob | LA |
| 3 | Charlie | Chicago |
🛡️ Lake Formation Permissions (Recap)
-
In Lake Formation → Permissions:
-
Grant
user2:SELECT, INSERT, DELETE -
Grant
user1:SELECTonly
-
🔄 Summary
| Component | Value |
|---|---|
| Engine | Apache Spark + Iceberg |
| Catalog | AWS Glue Catalog |
| Table Format | Apache Iceberg (Parquet) |
| Storage | S3 (s3://my-data-lake/iceberg/) |
| Query Engine | Athena |
No comments:
Post a Comment