Prerequisites
1. Python + PySpark Environment
Install the necessary packages:
You’ll also need the Iceberg Spark runtime JARs. You can get them from Maven Central.
Or, use the Spark --packages
option when running the job.
2. AWS Setup
-
✅ S3 bucket created (e.g.,
s3://my-data-lake/iceberg/
) -
✅ IAM role with access to S3, Glue, and Lake Formation
-
✅ AWS Glue database created:
iceberg_demo
-
✅ Lake Formation permissions granted if enabled
🧊 Step-by-Step Iceberg Creation Using PySpark
📁 File: create_iceberg_spark.py
📂 What Happens in S3?
After this runs, S3 will contain:
These files are crucial to Iceberg's table versioning and query planning.
🔗 Glue Catalog Entry
A new table called customer
will appear under:
-
Database:
iceberg_demo
-
Table Type:
EXTERNAL
-
Format:
ICEBERG
This happens because we used glue_catalog
with Iceberg’s AWS support.
🔍 Query via Athena
-
Go to Athena
-
Choose Data source:
AwsDataCatalog
-
Database:
iceberg_demo
-
Run query:
You will see:
customer_id | name | city |
---|---|---|
1 | Alice | NY |
2 | Bob | LA |
3 | Charlie | Chicago |
🛡️ Lake Formation Permissions (Recap)
-
In Lake Formation → Permissions:
-
Grant
user2
:SELECT, INSERT, DELETE
-
Grant
user1
:SELECT
only
-
🔄 Summary
Component | Value |
---|---|
Engine | Apache Spark + Iceberg |
Catalog | AWS Glue Catalog |
Table Format | Apache Iceberg (Parquet) |
Storage | S3 (s3://my-data-lake/iceberg/ ) |
Query Engine | Athena |
No comments:
Post a Comment