Wednesday, 25 June 2025

Lake Formation

 AWS Lake Formation – Complete Overview 

 

✅ What is AWS Lake Formation? 

AWS Lake Formation is a data lake governance service that helps you: 

  • Build, secure, and manage data lakes on Amazon S3. 

  • Control access at database, table, column, and even row level. 

  • Centrally manage permissions across Athena, Redshift Spectrum, EMR, SageMaker, and QuickSight. 

  • Share data securely across accounts and audit access via CloudTrail. 

🧠 It works on top of the AWS Glue Data Catalog, and extends its capabilities by providing fine-grained security, tagging, filtering, and access control. 

 

🔩 Key Offerings of AWS Lake Formation 

Feature 

Description 

Data Lake Location Registration 

Register S3 paths as secure data lake storage. 

Centralized Access Control 

Replace IAM bucket policies with fine-grained access to tables, columns, and rows. 

Column-Level and Row-Level Security 

Grant specific users access to only certain columns or rows. 

LF-Tags (Lake Formation Tags) 

Tag resources (tables, columns) and apply tag-based access control. 

Cross-Account Sharing 

Share tables or views with other AWS accounts securely. 

Audit Logging 

View who accessed which table, what query was run, and when. 

LakeView 

(New) SQL-based virtual views for data filtering and masking. 

Federated Access 

Integrate with Active Directory, Okta, IAM Identity Center, etc. 

Transactional Data Ingestion 

Support for ACID transactions via governed tables. 

Resource Links 

Share data across accounts without duplication using links. 

 

🧱 Lake Formation Architecture 

csharp 

CopyEdit 

[S3 Data] 
  ↓ 
[Glue Data Catalog] 
  ↓ 
[Lake Formation Policies / LF-Tags / LakeViews] 
  ↓ 
[Consumers: Athena, EMR, SageMaker, Redshift Spectrum, QuickSight] 
 

  • Lake Formation wraps Glue Catalog and S3 with governance controls. 

  • It decouples storage from security using metadata and policies. 

 

🔒 Types of Permissions in Lake Formation 

Permission Type 

Description 

Example 

Table-level 

SELECT, INSERT, DELETE on full table 

Allow access to all columns in sales_data 

Column-level 

Restrict access to certain columns 

Allow name and email, deny ssn 

Row-level 

Filter rows per user/group 

Only allow region = 'US' for marketing 

LF-Tag based 

Apply policies based on tags 

Grant access to tables tagged PII=False 

 

🛠️ Example Setup: Lake Formation with S3 and Glue 

🎯 Goal: 

You have a CSV in S3 with customer data. You want to: 

  • Catalog it using Glue 

  • Restrict access to columns (ssn) 

  • Allow access to only region = 'US' 

  • Query data using Athena 

 

🧩 Step 1: Upload Data to S3 

plaintext 

CopyEdit 

s3://my-data-lake/customer_data/customers.csv 
 

 

🧩 Step 2: Register S3 Location with Lake Formation 

bash 

CopyEdit 

aws lakeformation register-resource \ 
 --resource-arn arn:aws:s3:::my-data-lake \ 
 --use-service-linked-role 
 

 

🧩 Step 3: Create a Glue Crawler 

  • Source: s3://my-data-lake/customer_data/ 

  • Output: Database: customer_db, Table: customers 

 

🧩 Step 4: Set Table Permissions in Lake Formation 

bash 

CopyEdit 

aws lakeformation grant-permissions \ 
 --principal DataLakePrincipalIdentifier=arn:aws:iam::111122223333:user/analyst \ 
 --permissions "SELECT" "DESCRIBE" \ 
 --resource '{"Table": {"DatabaseName":"customer_db", "Name":"customers"}}' 
 

 

🧩 Step 5: Apply Column Masking (Optional) 

To mask the SSN column: 

bash 

CopyEdit 

aws lakeformation grant-permissions \ 
 --principal DataLakePrincipalIdentifier=arn:aws:iam::111122223333:user/analyst \ 
 --permissions "SELECT" \ 
 --resource '{ 
   "TableWithColumns": { 
     "DatabaseName": "customer_db", 
     "Name": "customers", 
     "ColumnNames": ["name", "region", "email"] 
   } 
 }' 
 

Now, ssn column is hidden from this user. 

 

🧩 Step 6: Row-Level Filter (Optional) 

To restrict access to US-only data: 

bash 

CopyEdit 

aws lakeformation put-data-lake-settings \ 
 --data-lake-settings '{ 
   "DataLakeAdmins": [{"DataLakePrincipalIdentifier":"arn:aws:iam::111122223333:user/admin"}] 
 }' 
 
aws lakeformation create-data-lake-settings \ 
 --principal DataLakePrincipalIdentifier=arn:aws:iam::111122223333:user/analyst \ 
 --resource '{"Table": {"DatabaseName":"customer_db","Name":"customers"}}' \ 
 --permissions "SELECT" \ 
 --LFTagPolicy { 
   "Expression": [ 
     { 
       "TagKey": "Region", 
       "TagValues": ["US"] 
     } 
   ] 
 } 
 

 

🧩 Step 7: Query via Athena 

In Athena: 

sql 

CopyEdit 

SELECT * FROM customer_db.customers; 
 

  • The analyst will only see columns they have access to. 

  • Only rows where region = 'US' will be returned. 

 

🔁 Cross-Account Sharing 

You can share the table/view with another AWS account: 

  • Create a resource link in the consumer account 

  • Use grant-permissions to share metadata + data access 

 

📊 Monitoring & Auditing 

  • All access is logged via AWS CloudTrail: 

  • Who accessed which table/view 

  • What queries were run 

  • What data was returned 

  • Use Amazon CloudWatch + Athena to analyze Lake Formation logs 

 

🧠 Advanced Features 

Feature 

Use Case 

LakeView 

SQL views with filtering/masking over Glue tables 

LF-Tags 

Automate permissions using tags like PII=True, Region=US 

Federated Access 

Use AD, Okta, IAM Identity Center for authentication 

SAML Authentication 

Allow external identities to query the data 

Transactions (ACID) 

Ingest/modify governed data with ACID properties 

Hybrid Data Sources 

Secure JDBC sources (RDS, Redshift, etc.) as part of the lake 

 

Summary 

Capability 

Supported by Lake Formation 

Table access control 

 

Column-level security 

 

Row-level filtering 

 

Cross-account sharing 

 

Tag-based policies 

 

SQL views with masking/filtering 

✅ (LakeView) 

Integration with AWS analytics services 

 

Auditing & monitoring 

 

 

No comments:

Post a Comment