Lakehouse Pipeline Architecture

A breakdown of the medallion-layered ETL pipeline using AWS Glue, S3, and Athena.

Medallion Architecture Overview

The pipeline follows the Bronze → Silver → Gold design pattern to clean, enrich, and structure event data for AI workloads:

Lakehouse Architecture Diagram

Diagram: JSON events flow through Bronze → Silver → Gold, becoming AI-ready at each stage.

ETL Jobs with AWS Glue (PySpark)

Each layer is handled by a dedicated AWS Glue job:

Jobs run on-demand with Glue 5.0 (Spark 3.5) using 2 DPUs, with execution time under 2 minutes.

S3 Storage Layering

All transformed data is stored in Amazon S3 in Parquet format:

Partitioning is applied on Silver and Gold layers for optimized Athena querying.

Metadata & Query Layer

Each layer is registered via AWS Glue Crawler and made queryable in Athena:

Sample Athena queries show sub-second response times with partition pruning enabled for Silver and Gold layers.

Data Quality Layer

A dedicated AWS Glue job runs validations on the Silver Layer before ML or analytics. It checks for nulls, duplicates, and schema mismatches, writing results to both S3 and Markdown.

DQ Glue Job Success

Glue job verifying Silver layer integrity before downstream use.

Example Gold Table Schema

user_id              STRING
last_event_timestamp TIMESTAMP
click_count          INT
purchase_count       INT
last_feature_hash    STRING

Schema inferred and cataloged by AWS Glue from partitioned Parquet data.

Sample Athena Query

SELECT user_id, click_count, purchase_count,
       last_event_type, last_event_timestamp
FROM ai_lakehouse_db.gold_user_features
ORDER BY last_event_timestamp DESC
LIMIT 10;

This query fetches recent user activity summaries directly from the Gold layer using partition-aware scanning.