Attention Residuals: A Drop-In Fix for How Every LLM Stacks Its Layers

Moonshot AI’s Kimi team replaces fixed residual connections in Transformers with learned attention over depth, improving GPQA-Diamond by 7.5 points on a 48B model.
artificial-intelligence
Author

Kabui, Charles

Published

2026-03-18

Keywords

attention-residuals, transformer-architecture, residual-connections, prenorm-dilution, moonshot-ai-kimi