RLSD: Combining Verifiable Rewards With Self-Distillation for Stable Reasoning Training

A new method, RLSD, fixes the information-leakage problem in on-policy self-distillation by letting the environment pick the update direction and the teacher only set the magnitude. It hits a higher convergence ceiling than RLVR or OPSD alone.
artificial-intelligence
Author

Kabui, Charles

Published

2026-04-19

Keywords

reinforcement-learning, self-distillation, rlvr, llm-reasoning, on-policy-distillation