A breakdown of the medallion-layered ETL pipeline using AWS Glue, S3, and Athena.
The pipeline follows the Bronze → Silver → Gold design pattern to clean, enrich, and structure event data for AI workloads:
Diagram: JSON events flow through Bronze → Silver → Gold, becoming AI-ready at each stage.
Each layer is handled by a dedicated AWS Glue job:
Jobs run on-demand with Glue 5.0 (Spark 3.5) using 2 DPUs, with execution time under 2 minutes.
All transformed data is stored in Amazon S3 in Parquet format:
s3://ai-lakehouse-project/raw/
— raw JSONs3://ai-lakehouse-project/bronze/user_events_parquet/
s3://ai-lakehouse-project/silver/user_events/
(partitioned)s3://ai-lakehouse-project/gold/user_features/
Partitioning is applied on Silver and Gold layers for optimized Athena querying.
Each layer is registered via AWS Glue Crawler and made queryable in Athena:
bronze_user_events_parquet
silver_user_events
gold_user_features
Sample Athena queries show sub-second response times with partition pruning enabled for Silver and Gold layers.
A dedicated AWS Glue job runs validations on the Silver Layer before ML or analytics. It checks for nulls, duplicates, and schema mismatches, writing results to both S3 and Markdown.
user_id
+ event_timestamp
)s3://ai-lakehouse-project/reports/
Glue job verifying Silver layer integrity before downstream use.
user_id STRING
last_event_timestamp TIMESTAMP
click_count INT
purchase_count INT
last_feature_hash STRING
Schema inferred and cataloged by AWS Glue from partitioned Parquet data.
SELECT user_id, click_count, purchase_count,
last_event_type, last_event_timestamp
FROM ai_lakehouse_db.gold_user_features
ORDER BY last_event_timestamp DESC
LIMIT 10;
This query fetches recent user activity summaries directly from the Gold layer using partition-aware scanning.