Tips to Improve Knowledge: Iceberg with S3

Wednesday, 25 June 2025

Iceberg with S3

Prerequisites

1. Python + PySpark Environment

Install the necessary packages:

bash

pip install pyspark

You’ll also need the Iceberg Spark runtime JARs. You can get them from Maven Central.

Or, use the Spark --packages option when running the job.

2. AWS Setup

✅ S3 bucket created (e.g., s3://my-data-lake/iceberg/)
✅ IAM role with access to S3, Glue, and Lake Formation
✅ AWS Glue database created: iceberg_demo
✅ Lake Formation permissions granted if enabled

🧊 Step-by-Step Iceberg Creation Using PySpark

📁 File: `create_iceberg_spark.py`

python

from pyspark.sql import SparkSession

# === Step 1: Create Spark Session with Iceberg and Glue catalog ===
spark = SparkSession.builder \
    .appName("IcebergTableWriter") \
    .config("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") \
    .config("spark.sql.catalog.glue_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \
    .config("spark.sql.catalog.glue_catalog.warehouse", "s3://my-data-lake/iceberg/") \
    .config("spark.sql.defaultCatalog", "glue_catalog") \
    .getOrCreate()

# === Step 2: Create DataFrame ===
data = [
    (1, "Alice", "NY"),
    (2, "Bob", "LA"),
    (3, "Charlie", "Chicago")
]
columns = ["customer_id", "name", "city"]
df = spark.createDataFrame(data, columns)

# === Step 3: Write DataFrame as Iceberg Table ===
df.writeTo("glue_catalog.iceberg_demo.customer") \
    .using("iceberg") \
    .createOrReplace()

print("✅ Iceberg table created successfully")

📂 What Happens in S3?

After this runs, S3 will contain:

kotlin

s3://my-data-lake/iceberg/iceberg_demo/customer/
├── data/
│   └── *.parquet
├── metadata/
│   └── v1.metadata.json
├── snapshot/

These files are crucial to Iceberg's table versioning and query planning.

🔗 Glue Catalog Entry

A new table called customer will appear under:

Database: iceberg_demo
Table Type: EXTERNAL
Format: ICEBERG

This happens because we used glue_catalog with Iceberg’s AWS support.

🔍 Query via Athena

Go to Athena
Choose Data source: AwsDataCatalog
Database: iceberg_demo
Run query:

sql

SELECT * FROM customer;

You will see:

customer_id	name	city
1	Alice	NY
2	Bob	LA
3	Charlie	Chicago

🛡️ Lake Formation Permissions (Recap)

In Lake Formation → Permissions:
- Grant user2: SELECT, INSERT, DELETE
- Grant user1: SELECT only

🔄 Summary

Component	Value
Engine	Apache Spark + Iceberg
Catalog	AWS Glue Catalog
Table Format	Apache Iceberg (Parquet)
Storage	S3 (`s3://my-data-lake/iceberg/`)
Query Engine	Athena

Tips to Improve Knowledge

Wednesday, 25 June 2025

Iceberg with S3

Prerequisites

1. Python + PySpark Environment

2. AWS Setup

🧊 Step-by-Step Iceberg Creation Using PySpark

📁 File: `create_iceberg_spark.py`

📂 What Happens in S3?

🔗 Glue Catalog Entry

🔍 Query via Athena

🛡️ Lake Formation Permissions (Recap)

🔄 Summary

No comments:

Post a Comment

Wednesday, 25 June 2025

Iceberg with S3

Prerequisites

1. Python + PySpark Environment

2. AWS Setup

🧊 Step-by-Step Iceberg Creation Using PySpark

📁 File: create_iceberg_spark.py

📂 What Happens in S3?

🔗 Glue Catalog Entry

🔍 Query via Athena

🛡️ Lake Formation Permissions (Recap)

🔄 Summary

No comments:

Post a Comment

📁 File: `create_iceberg_spark.py`