Attention Residuals: A Drop-In Fix for How Every LLM Stacks Its Layers

Kabui, Charles

Moonshot AI’s Kimi team identified a flaw baked into every modern LLM: residual connections add each layer’s output with a fixed weight of one. As layers stack deeper, this uniform accumulation causes hidden states to grow without bound, diluting what each individual layer contributes. Their fix, Attention Residuals (AttnRes), replaces that fixed addition with softmax attention over all preceding layer outputs. Each layer learns input-dependent weights that control how much it draws from earlier representations. The practical version, Block AttnRes, groups layers into blocks and only attends across block boundaries, cutting memory overhead to near zero. Integrated into the Kimi Linear architecture (48B total parameters, 3B active) and pre-trained on 1.4 trillion tokens, AttnRes improved GPQA-Diamond by 7.5 points (36.9 to 44.4), HumanEval by 3.1 points, and MATH by 3.6 points over the standard residual baseline. Scaling law experiments confirmed consistent gains across model sizes.

Residual connections have been essentially unchanged since ResNet in 2015. They were carried into Transformers as-is. AttnRes is a drop-in replacement that requires no changes to training pipelines or inference infrastructure. Block AttnRes matches the loss of a baseline trained with 1.25x more compute, which means existing models could get “free” performance just by swapping their residual wiring.

If this holds up across more architectures, it could become standard in next-generation LLMs. The fact that such a fundamental component went unquestioned for a decade, and that a relatively simple fix yields broad gains, suggests there may be more low-hanging architectural improvements hiding in plain sight.

Sources:

Disclaimer: For information only. Accuracy or completeness not guaranteed. Illegal use prohibited. Not professional advice or solicitation. Read more: /terms-of-service

Reuse

GNU GENERAL PUBLIC LICENSE v3.0(View License)

Citation

BibTeX citation:

@misc{kabui2026,
  author = {{Kabui, Charles}},
  title = {Attention {Residuals:} {A} {Drop-In} {Fix} for {How} {Every}
    {LLM} {Stacks} {Its} {Layers}},
  date = {2026-03-18},
  url = {https://toknow.ai/posts/attention-residuals-moonshot-ai-kimi-drop-in-fix-prenorm-dilution/},
  langid = {en-GB}
}

For attribution, please cite this work as:

Kabui, Charles. 2026. “Attention Residuals: A Drop-In Fix for How Every LLM Stacks Its Layers.” https://toknow.ai/posts/attention-residuals-moonshot-ai-kimi-drop-in-fix-prenorm-dilution/.

Other Formats

Reuse

Citation