The intelligent computing team of Alibaba Tongyi Laboratory launched a new algorithm FIPO.
On April 7th, the intelligent computing team of Alibaba's AliTongyi laboratory announced the launch of a new algorithm, FIPO, which introduces the Future-KL mechanism to reward key tokens and solve the difficult problem of "stagnation of reasoning length" in pure reinforcement learning training. According to the team, they have achieved performance surpassing that of o1-mini and DeepSeek-Zero-MATH of the same scale in a 32B pure RL setting.
Latest
2 m ago

