Lates News
On April 7th, the intelligent computing team of Ali Tongyi Lab announced the launch of a new algorithm, FIPO (Future-KL Influenced Policy Optimization), which introduces the Future-KL mechanism to reward key tokens to solve the "inferential length stagnation" problem in pure reinforcement learning (Pure RL) training. According to the team, they were the first to achieve performance surpassing o1-mini and the same scale DeepSeek-Zero-MATH in a pure RL setting with a scale of 32B.
Latest
8 m ago

