Some Perspectives on Reinforcement Learning Research

Authors: Aviral Kumar

A couple of months back, I moved to CMU to start my lab. We are interested in very diverse areas from foundation models to “old-fashioned” RL algorithms to robotic learning. This mix has led to some research at the intersection of these topics, with some unexpected connections between unrelated fields. In this post, I’ll share some of the themes we have explored this year and these connections--both within the lab and (more so) with external collaborators (at Google and elsewhere)--and offer a glimpse into challenges that remain and are exciting. While I’ll skip technical details, I’ll provide links to papers for readers.

Research methodology. Over the past year, our research methodology has shifted towards more empirical understanding, followed by solving the core bottlenecks we observe in a scalable yet simple way. Going forward, I believe that this methodology is important, as unlike traditional ML / RL research, where we were asked to innovate solely on methods using the same training dataset or environment, the modern research paradigm offers flexibility in modifying the environment or dataset. On the one hand, while this flexibility enables innovative methods (e.g., our RL for reasoning work), modifying data collection often also proves to be a simpler approach to solving challenges such that nuanced algorithmic ideas may be rendered meaningless. To give an example, a distribution shift problem that one would typically solve by designing regularizers, could also be solved in a much simpler way by collecting on-policy data to train on in some problem domains. This is in contrast to a lot of RL work focused on distribution shift which is unwilling to assume anything about the data at all, and strives to build general techniques. We are recognizing that data or the environment can drastically change ordering among techniques. As a result, we are increasingly focused on building general principles and insights for solutions, which can then be instantiated by systematically modifying data while keeping the algorithm unchanged or modifying the algorithm while keeping the data same, or both. Our research reflects this philosophy, resulting in work aiming to uncover insights and answering “why” questions to tackle bottlenecks that exist. Rather than building new algorithmic ideas, we are striving towards understanding bottlenecks that exist and developing simple and scalable solutions to tackle these precisely. I will give an overview of some themes we’ve worked on next.

Deep RL Algorithms Research

As we shift towards larger scales in deep RL, addressing computational efficiency, optimization, and scalability is becoming as critical as sample efficiency. One could also take this perspective to the extreme and argue that, at least in some domains, as models of the world get better, the cost of sample efficiency—tied typically to expensive environment interactions—may become less significant and compute efficiency may be the only important factor. That said, this paradigm doesn't apply universally, especially in real-world domains like robotics, where physical constraints remain unavoidable.

Nonetheless it is undeniable that running RL at scale requires algorithms to be scalable and computationally efficient. This in turn means that the “deep learning mechanics” in deep RL cannot be ignored. A key property that we would want is “predictability": we would want strong algorithms to be such that performance benefits from small-scale experiments—with limited data, small model size, and few training steps—reliably extrapolate to large scales. In our past research, we have found it hard to imagine new methods that would satisfy all these desiderata, since it is not fundamentally clear how one could modify deep RL algorithm design to satisfy some of these desiderata that we are attempting to balance. As a result, our work along these lines has gravitated towards understanding work and building simple techniques following up on it. Concretely, we have been working on developing deep RL methods that strike a good balance between compute-efficiency, sample-efficiency, and performance. I will give examples below:

Scaling value function learning objectives

Building on our work from ICLR 2023, which demonstrated how to scale value-based RL to ~100M parameters, we’ve made some progress towards scaling value function learning. Our ICML work, led by Jesse Farebrother shows that a simple yet effective approach to scale up value-based RL is to use a classification cross-entropy loss for training value functions. Since then, we've observed excellent performance with this approach in several follow-up studies. This classification loss scales well, and we have since then been finding it to be an important component for enabling predictable behavior from value-based RL. It also simplifies tuning and interacts quite well with large transformer models that have been co-designed with negative log-likelihood losses. That said, it is also pretty clear that this approach is not the final piece in scaling value-based RL and there’s a lot more to be done, perhaps just in terms of modifying the type of cross-entropy loss function.

Data scaling and bottlenecks

As mentioned above, we should be designing RL algorithms that extrapolate effectively to large scales along many axes. Some of our work last year specifically aims to understand offline RL algorithms along the axis of data scaling. We wanted to ensure that the utility of a new offline RL algorithm isn’t overshadowed by simply collecting N× more expert data. To do so, in a work led by Seohong Park, we attempted to study what bottlenecks current offline RL algorithms. This work uncovered a surprising insight: policy training, not value function training, becomes the bottleneck as data scales with current algorithms. Even though most work (including my PhD work) attempts to fix value function training in RL, we observed that with current offline RL algorithms, policy training slows down performance improvements a lot more, suggesting that the direction of “steepest” improvement in RL may now lie in rethinking policy training. This insight directly inspired some of the ideas on policy learning below.

Scaling policy learning objectives

Given a good value function, the next question is to extract a policy from this value function. In work led by Max Sobol Mark, we show a strong value function (from Cal-QL in this case) can be transformed into a good policy with a new supervised learning update. While the method bears similarities to prior approaches (including some of our earlier work), this supervised learning update introduces a level of scalability and stability. We have been able to obtain one of the first results that can autonomously fine-tune OpenVLA (a 7B generalist robot policy) on a real robot. This underscores the method’s potential for bridging large-scale learning systems with real-world applications. Looking ahead, I believe rethinking policy learning from value functions with scalability as a first-class priority holds significant promise. Simple approaches such as incorporating test-time reranking (inspired from LLM verification) to advanced fine-tuning methods are all quite promising. This rethinking could open new doors for more robust and scalable policy learning frameworks.

RL for Foundation Models

We have also explored how RL can serve as a critical tool for advancing foundation models. Unlike most research in this area, our research has focused on understanding why RL is well-suited for certain challenges (as opposed to simply using supervised fine-tuning), where it breaks, and identifying key areas of innovation for designing RL formulations and tooling. We specifically focus on problems that require learning new capabilities, which are not attained by scaling supervised fine-tuning alone. In this process, we arrived at the following takeaways : (a) as foundation models evolve and we run out of high-quality and diverse text data, training on on-policy data (“synthetic”) data becomes crucial. It is important to use RL methods to effectively train on on-policy data since imitating suboptimal or synthetic data is not enough, and (b) RL especially excels at teaching models procedures for “how” to accomplish tasks rather than spoon-feeding them with functions or “what” is correct, but doing so needs perspectives in multi-turn RL and new formulations of explorations and meta RL. The promise is still about RL converting data into dynamic, adaptive behaviors — a capability that supervised training alone cannot enable. This analysis has also led us to intriguing new research questions, intricately tied to concepts in offline RL, exploration, and meta RL, but viewed from the unique perspective of and challenges with foundation models. I will discuss some of our work below.

Why is RL useful for foundation models?

In a paper led by Charlie Snell, we demonstrated that fine-tuning models to use test-time compute explicitly can be better than scaling up pre-training compute, which was later replicated by HuggingFace. From a broad perspective, our analysis reveals the importance of fine-tuning models to improve performance by effectively using more test-time compute and tokens at test time, given the same total budget on computation, improving over scaling with negative log-likelihood on Internet tokens. This result serves as evidence algorithmic research in fine-tuning algorithms can be useful. It also hints that alternative approaches to training on raw internet token data—beyond next-token prediction—could unlock even greater potential.

Given that fine-tuning is useful, why is RL useful? We have consistently noticed the superiority of RL over imitation learning (behavioral cloning, or BC) on reasoning problems. Although generating a reasoning trace for a given question could essentially be posed as a “one-step” bandit problem with no environment to interact with, we see that treating it as a sequential Markov decision process (MDP) over steps with deterministic dynamics and optimizing each step using tools from sequential RL (value functions, suboptimal data, etc) proves significantly more effective. This is because a solution trace exhibits combinatorial structure over “atomic” steps and is not simply monolithic. In addition, the induced Markov process admits various diverse paths and the dynamics is fully reversible, which tilts the balance in the favor of RL. I believe this general principle will hold in multiple domains beyond reasoning or LLMs.

Concretely, in a work led by Amrith Setlur, we found that offline RL is 8x more data-efficient than supervised fine-tuning (SFT) in terms of the number of prompts required to attain a similar test error rate. Our follow up then extended this idea to use process reward models (PRMs) or value functions with online RL, and attained 3-4x compute efficiency and 6-7x sample efficiency over offline RL (which itself is 8x more data efficient than BC). While the question of how to make RL outperform BC even more and how (with what structural conditions) remains an open question, RL consistently outperforms imitation in many LLM problems (it did not seem obvious previously).

Learning adaptive algorithms or “how” to tackle queries via RL

We also built RL methods for teaching models how to learn and adapt on the fly, which is especially important in open-ended deployment scenarios where we want models to give their best shot, and if they don’t succeed then express their uncertainty gracefully. More recently, this capability has also been referred to as “thinking more”, rather than resorting to using “guesswork”. In two works, one led by Vincent Zhuang, Rishabh Agarwal, and Yi Su at Google DeepMind (SCoRe), and one led by Yuxiao Qu at CMU (RISE), we show that one can use multi-turn RL and DAgger respectively to train models to implement such algorithmic behaviors. In the context of hard reasoning problems, this translates to finding ways to use more tokens (i.e., test-time compute) wisely to improve the solution.

While our specific focus in this line of work has been on self-correction, there are some broad principles that I think will apply to general algorithm learning settings: (1) the need for techniques to incentivize indirect behavior than direct reward maximization (we used reward shaping and cloning suboptimal data, but this could be done better), (2) learning in a multi-stage fashion with DAgger-style imitation learning or with on-policy RL, and (3) most importantly, the need for learning from dynamically-changing data as opposed to learning from static data. There’re still a lot of open questions, including conceptual questions of when learning algorithms is useful as opposed to simply learning to guess, what is a scalable learning paradigm for algorithm learning, etc. that we need to study.

Building RL machinery for foundation models

So far, I focused on new RL formulations that can enable foundation models to do cool stuff. We have also been building RL machinery to optimize these problems at scale. In work led by Yifei Zhou (ArCHer), we took a first-principles approach for adapting ideas from off-policy RL with value functions to language models. This resulted in a hierarchical framework for designing RL tools to fine-tune language models—an approach that extends on-policy RLHF methods while retaining their scalability and tuning benefits. Our method achieves 100x more sample-efficient compared to PPO on LLM agent tasks. This is better than pure token-level RL or simple critic-based re-ranking alone.

However, testing the efficacy of multi-turn RL tooling for fine-tuning foundation models is constrained by the lack of meaningful, real-world benchmarks. Many existing benchmarks are either overly simulated and toy-like, or designed primarily for evaluation rather than realistic application. Therefore, in this line of work, we chose to stress-test and scale our approaches to just one task, but ensured that this task operates at real user scale and is interesting for guiding future algorithmic research. In DigiRL, a project led by Yifei Zhou and Hao Bai, we developed RL algorithms for transforming VLMs into effective policies for Android device control tasks, a real-world domain characterized by non-stationarity and stochasticity. For instance, our Digi-RL agent can complete various tasks specified through natural language instructions, seamlessly interacting with mobile devices and the web. Building on these efforts, we’ve extended our work to fully offline RL approaches, with more on this front coming soon.

That is it! Thank you for reading thus far! I hope you found this post interesting and insightful. If you have any feedback, please feel free to reach out! I would like to thank Yuxiao Qu, Amrith Setlur, Yifei Zhou, Paul Zhou, Seohong Park, Vincent Zhuang, and Yi Su for their feedback on this post.