AI-Ready Lakehouse Pipeline with AWS Glue + Athena

This multi-layer ETL pipeline uses AWS Glue, S3, and Athena to process event data from raw JSON to ML-ready Gold features. Designed for fast, scalable analytics, real-world AI workflows, and enterprise-grade data quality + insights.

View Source Code on GitHub

Project Purpose

This project demonstrates how to build a scalable, AI-optimized data lakehouse using AWS Glue (PySpark), Amazon S3, and Athena. The goal is to take raw JSON events, transform and clean them through Bronze and Silver layers, and generate analytical Gold features for machine learning and BI.

Technologies Used

- AWS Glue (Spark 3.5) for scalable ETL jobs
- Amazon S3 with Parquet for cost-efficient storage
- AWS Glue Crawlers and Data Catalog
- Amazon Athena (Trino SQL engine)
- PySpark for transformations and aggregations
- GitHub for reproducibility and documentation

Pipeline Overview

The pipeline follows the modern medallion architecture:

🥉 Bronze: Raw JSON → normalized Parquet
🥈 Silver: Deduplicated, cleaned, partitioned data
🥇 Gold: User-level feature store with aggregations like click count, last interaction, event ratios

Glue jobs run on-demand, and metadata is registered via crawlers. All layers are queryable via Athena.