This multi-layer ETL pipeline uses AWS Glue, S3, and Athena to process event data from raw JSON to ML-ready Gold features. Designed for fast, scalable analytics, real-world AI workflows, and enterprise-grade data quality + insights.
View Source Code on GitHubThis project demonstrates how to build a scalable, AI-optimized data lakehouse using AWS Glue (PySpark), Amazon S3, and Athena. The goal is to take raw JSON events, transform and clean them through Bronze and Silver layers, and generate analytical Gold features for machine learning and BI.
- AWS Glue (Spark 3.5) for scalable ETL jobs
- Amazon S3 with Parquet for cost-efficient storage
- AWS Glue Crawlers and Data Catalog
- Amazon Athena (Trino SQL engine)
- PySpark for transformations and aggregations
- GitHub for reproducibility and documentation
The pipeline follows the modern medallion architecture: