Near-Future Policy Optimization: RL That Learns from the Model’s Own Future Self

A new RL training method replaces the stale fixed reference policy with a near-future checkpoint from the same run, improving Qwen3-VL-8B’s average benchmark score by +5.27% with no architecture changes.
artificial-intelligence
Author

Kabui, Charles

Published

2026-04-30

Keywords

reinforcement-learning, post-training, grpo, near-future-policy, mixed-policy-rl, qwen, vision-language-models, self-taught-rlvr