Explore how the Bronze, Silver, and Gold layers evolve across the Lakehouse pipeline to support clean analytics and machine learning.
Derived from Silver, this layer aggregates user interactions into ML-ready features for personalization and recommendation systems.
days_since_last_event
metrictraining_date
partition for time-based versioningGold job completed in under 2 minutes on AWS Glue 5.0 using 2 DPUs.
Data written to s3://ai-lakehouse-project/gold/user_features/
user_id STRING
last_event_timestamp TIMESTAMP
last_event_type STRING
click_count INT
purchase_count INT
last_feature_hash STRING
days_since_last_event INT
training_date DATE
SELECT user_id, click_count, purchase_count
FROM ai_lakehouse_db.gold_user_features
WHERE training_date = CURRENT_DATE
ORDER BY click_count DESC
LIMIT 10;
Validated with Athena for fast, partitioned queries.
Athena query on gold_user_features
with successful result preview and performance metrics.
The pipeline leverages AWS Glue Workflow to orchestrate Bronze → Silver → Gold transformations. This screenshot shows the full run completed successfully, with DAG-style visual dependencies.
✔️ Workflow execution status: Completed (July 10, 2025)
Silver builds on Bronze, eliminating duplicates, filtering out nulls, and enriching with partitioning for efficient query access.
(user_id, event_type, timestamp)
event_type
and event_date
user_id + event_timestamp
event_timestamp
and extraction of event_date
Silver job performed partitioning and quality filters.
Partitioned data at s3://ai-lakehouse-project/silver/user_events/
user_id STRING
event_type STRING
event_timestamp TIMESTAMP
event_date DATE
feature_hash STRING
The Bronze layer is the raw landing zone. It transforms JSON into schema-aware Parquet, enriched with ingestion metadata and AI-friendly fields.
feature_hash
for AI compatibilityevent_timestamp
to support incremental ingestionBronze ETL job executed on AWS Glue 5.0 with 2 DPUs. Parsed and enriched raw JSON from S3.
Raw Parquet output written to s3://ai-lakehouse-project/bronze/user_events_parquet/
user_id STRING
session_id STRING
event_type STRING
event_timestamp TIMESTAMP
raw_payload STRING
ingestion_ts TIMESTAMP
model_input_flag BOOLEAN
feature_hash STRING
This project was upgraded to support:
event_timestamp
watermarks (Bronze)Each job is optimized for fast execution (under 2 mins) and minimal reprocessing.