GPT-2 From Scratch: Transformer Failure Modes
Implemented GPT-2 (124M) decoder-only transformer from scratch in PyTorch with manual causal self-attention, weight tying, and fused AdamW. Trained on FineWeb-Edu 10B with DDP across 8x A100 80GB. Profiled FlashAttention vs naive attention; measured 23.7x latency speedup and 25x memory reduction at T=4096.