FIPO: Qwen’s Token-Level Credit Fix That Breaks the 4K Reasoning Ceiling

Alibaba’s FIPO algorithm uses future-KL divergence to weight every token by its downstream impact, pushing chain-of-thought from 4,000 to over 10,000 tokens and hitting 58% on AIME 2024.
artificial-intelligence
Author

Kabui, Charles

Published

2026-04-04

Keywords

fipo, reinforcement-learning, chain-of-thought, credit-assignment, qwen