Jatin Prakash* (NYU), Anirudh Buvanesh* (MILA) ( order decided through* np.random.randint(2)
)
September 13, 2025
If your base model has zero success rates, performing RL with outcome rewards won’t do anything. What can you do then? TL;DR: simply adding easy samples to your training dataset can unlock RL training.

Figure 1: Approaches such as naïve RL (Dr. GRPO), reward densification (Progress-Reward), credit assignment (VinePPO), and diversity incentives (BoN Aware Finetuning) fail to solve a hard task. A simple intervention of mixing easier samples helps unlock RL training.
<aside>
📌
Takeaways
- When the base model can’t solve a task at all (i.e., outcome rewards are always zero during RL training), we show that a simple data-centric intervention of adding easier instances of the same task in the training set works surprisingly well!
- Choice of easy instances that you add matters! Adding only very easy examples doesn’t help. However, you don’t need to hunt for the “perfect difficulty.” Mixing all the easier instances you have works!
- We benchmark methods that incorporate desirable components to tackle zero outcome rewards such as dense rewards, diversity incentives, and improved credit assignment, and find none of these to be effective in our settings. Since there was no official code for these baselines, we’re releasing (single-file, hackable) code for our implementation: ****https://github.com/rl4reasoning/rl-baselines. Hopefully you’ll find it useful in your own experiments 🙂
- We conclude with a simple and practical recipe for RL practitioners: add all available easier instances of the task that one can get their hands on! We also connect our findings to ideas in skill learning and related prior work.
</aside>
<aside>
⚠️
Disclaimer
We focus our experiments on the graph search problem introduced in Bachmann et al. (2024). While we haven’t yet explored other task types, we believe the insights here are interesting and worth discussing!
</aside>
What can you do when you have zero rewards?
The community has spent enormous compute training LLMs with RL on tasks with verifiable rewards, like math problems and code generation. Surprisingly, this works quite well, even though the reward signals are extremely sparse (outcome based only). Much of this success stems from the strong capabilities that base LLMs already possess due to large-scale pre-training.
However, if a model is unable to solve a task even after thousands of attempts, performing RL will have no effect, since the gradients will be zero (there is nothing to reinforce).
This highlights an inherent assumption when applying RL to language models: the model must have a reasonable probability of solving the task.
This begs the question:
What can one do if there are zero rewards due to no successful rollouts being sampled by the base model?
To address the above problem, one could:
- Option 1: Supervised Finetuning (SFT) on successful traces: One approach is to perform SFT on successful trajectories from a stronger model, giving the base model non-zero success rates before applying RL. While this works well in practice (we tried it on the graph search problem, where it proved helpful), it can be restrictive, since the RL phase is then biased toward the SFT dataset distribution. Still, due to its simplicity and effectiveness, it is widely used! For now, let’s assume we can’t do this.
- Option 2: Densifying Rewards: Alternatively, one could apply reward shaping (Setlur et al. 2025) to obtain dense rewards that provide a learning signal based on the quality of intermediate steps, even when outcome rewards are zero. One could also explore approaches that improve credit assignment (Kazemnejad et al. 2024) for intermediate steps in reasoning.
- Option 3: Encouraging Diversity: One could also modify the objective to incentivize the model to sample diverse responses during RL finetuning (Chow et al. 2025), with the hope that at least one of these responses gets a non-zero reward and thereby kick-starts RL training.
However, both Options 2 and 3 fail on this simple graph task in the zero-outcome-reward scenario—which we will talk about below. ☹️
- Option 4: Use a stronger base model: Of course, you could just start with a bigger or better model that already generates some successful traces, which RL can then build on. But that’s boring 😅, so we’ll assume we can’t do that either.