Wednesday, 25 June 2025

Iceberg with S3

 

Prerequisites

1. Python + PySpark Environment

Install the necessary packages:

bash

pip install pyspark

You’ll also need the Iceberg Spark runtime JARs. You can get them from Maven Central.

Or, use the Spark --packages option when running the job.


2. AWS Setup

  • ✅ S3 bucket created (e.g., s3://my-data-lake/iceberg/)

  • ✅ IAM role with access to S3, Glue, and Lake Formation

  • ✅ AWS Glue database created: iceberg_demo

  • ✅ Lake Formation permissions granted if enabled


🧊 Step-by-Step Iceberg Creation Using PySpark

📁 File: create_iceberg_spark.py

python

from pyspark.sql import SparkSession # === Step 1: Create Spark Session with Iceberg and Glue catalog === spark = SparkSession.builder \ .appName("IcebergTableWriter") \ .config("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog") \ .config("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") \ .config("spark.sql.catalog.glue_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \ .config("spark.sql.catalog.glue_catalog.warehouse", "s3://my-data-lake/iceberg/") \ .config("spark.sql.defaultCatalog", "glue_catalog") \ .getOrCreate() # === Step 2: Create DataFrame === data = [ (1, "Alice", "NY"), (2, "Bob", "LA"), (3, "Charlie", "Chicago") ] columns = ["customer_id", "name", "city"] df = spark.createDataFrame(data, columns) # === Step 3: Write DataFrame as Iceberg Table === df.writeTo("glue_catalog.iceberg_demo.customer") \ .using("iceberg") \ .createOrReplace() print("✅ Iceberg table created successfully")

📂 What Happens in S3?

After this runs, S3 will contain:

kotlin

s3://my-data-lake/iceberg/iceberg_demo/customer/ ├── data/ │ └── *.parquet ├── metadata/ │ └── v1.metadata.json ├── snapshot/

These files are crucial to Iceberg's table versioning and query planning.


🔗 Glue Catalog Entry

A new table called customer will appear under:

  • Database: iceberg_demo

  • Table Type: EXTERNAL

  • Format: ICEBERG

This happens because we used glue_catalog with Iceberg’s AWS support.


🔍 Query via Athena

  1. Go to Athena

  2. Choose Data source: AwsDataCatalog

  3. Database: iceberg_demo

  4. Run query:

sql

SELECT * FROM customer;

You will see:

customer_idnamecity
1AliceNY
2BobLA
3CharlieChicago

🛡️ Lake Formation Permissions (Recap)

  • In Lake Formation → Permissions:

    • Grant user2: SELECT, INSERT, DELETE

    • Grant user1: SELECT only


🔄 Summary

ComponentValue
EngineApache Spark + Iceberg
CatalogAWS Glue Catalog
Table FormatApache Iceberg (Parquet)
StorageS3 (s3://my-data-lake/iceberg/)
Query EngineAthena


No comments:

Post a Comment