Kola Bamgbose

AI & Systems Engineer

Projects

GPT-2 From Scratch: Transformer Failure Modes

Implemented GPT-2 (124M) decoder-only transformer from scratch in PyTorch with manual causal self-attention, weight tying, and fused AdamW. Trained on FineWeb-Edu 10B with DDP across 8x A100 80GB. Profiled FlashAttention vs naive attention; measured 23.7x latency speedup and 25x memory reduction at T=4096.

Writing

Six Ways a Transformer Fails: What the Silent Ones Tell You

Three of six transformer failure modes pass standard training checks while computing the wrong function. The bugs that crash immediately are the lowest-risk ones.