Jatin Prakash* (NYU), Anirudh Buvanesh* (MILA) ( order decided through* np.random.randint(2))

September 13, 2025

If your base model has zero success rates, performing RL with outcome rewards won’t do anything. What can you do then? TL;DR: simply adding easy samples to your training dataset can unlock RL training.

Figure 1: Approaches such as naïve RL (Dr. GRPO), reward densification (Progress-Reward), credit assignment (VinePPO), and diversity incentives (BoN Aware Finetuning) fail to solve a hard task. A simple intervention of mixing easier samples helps unlock RL training.

Figure 1: Approaches such as naïve RL (Dr. GRPO), reward densification (Progress-Reward), credit assignment (VinePPO), and diversity incentives (BoN Aware Finetuning) fail to solve a hard task. A simple intervention of mixing easier samples helps unlock RL training.

<aside> 📌

Takeaways

<aside> ⚠️

Disclaimer

We focus our experiments on the graph search problem introduced in Bachmann et al. (2024). While we haven’t yet explored other task types, we believe the insights here are interesting and worth discussing!

</aside>

What can you do when you have zero rewards?

The community has spent enormous compute training LLMs with RL on tasks with verifiable rewards, like math problems and code generation. Surprisingly, this works quite well, even though the reward signals are extremely sparse (outcome based only). Much of this success stems from the strong capabilities that base LLMs already possess due to large-scale pre-training.

However, if a model is unable to solve a task even after thousands of attempts, performing RL will have no effect, since the gradients will be zero (there is nothing to reinforce).

This highlights an inherent assumption when applying RL to language models: the model must have a reasonable probability of solving the task.

This begs the question:

What can one do if there are zero rewards due to no successful rollouts being sampled by the base model?

To address the above problem, one could:

  1. Option 1: Supervised Finetuning (SFT) on successful traces: One approach is to perform SFT on successful trajectories from a stronger model, giving the base model non-zero success rates before applying RL. While this works well in practice (we tried it on the graph search problem, where it proved helpful), it can be restrictive, since the RL phase is then biased toward the SFT dataset distribution. Still, due to its simplicity and effectiveness, it is widely used! For now, let’s assume we can’t do this.
  2. Option 2: Densifying Rewards: Alternatively, one could apply reward shaping (Setlur et al. 2025) to obtain dense rewards that provide a learning signal based on the quality of intermediate steps, even when outcome rewards are zero. One could also explore approaches that improve credit assignment (Kazemnejad et al. 2024) for intermediate steps in reasoning.
  3. Option 3: Encouraging Diversity: One could also modify the objective to incentivize the model to sample diverse responses during RL finetuning (Chow et al. 2025), with the hope that at least one of these responses gets a non-zero reward and thereby kick-starts RL training.

However, both Options 2 and 3 fail on this simple graph task in the zero-outcome-reward scenario—which we will talk about below. ☹️

  1. Option 4: Use a stronger base model: Of course, you could just start with a bigger or better model that already generates some successful traces, which RL can then build on. But that’s boring 😅, so we’ll assume we can’t do that either.