MEDS: Teaching RL to Remember Its Mistakes Instead of Repeating Them

Fudan’s MEDS framework clusters recurring error patterns during RL training and penalizes repeated failures more heavily. It improves pass@1 by up to +4.13% across five benchmarks using only logits already computed in the forward pass.
artificial-intelligence
Author

Kabui, Charles

Published

2026-04-26

Keywords

reinforcement-learning, reward-shaping, llm-reasoning, error-clustering, training-diversity