Training Reasoning Agents in Interactive, Complex Environments

May 12, 2025Michelle Mohney

Chatbots can make quick work of routine e-commerce customer service tasks and information retrieval.

Sephora’s Smart Skin Scan, for example, provides personalized product recommendations, while Lowe’s Mylow answers home improvement questions. The Law Center for Better Housing’s Rentervention service provides aid to Illinois renters in diagnosing their legal housing issues, understanding their rights, and exploring solutions via the virtual assistant Renny.

Unlike static, reactive chatbots, language agent systems are designed for dynamic, goal-directed action, such as coding, planning, teaching, or conducting research. Built on large language models (LLMs), language agents are an emerging class of AI systems that can think out loud, ask clarifying questions, and revise plans over time.

“While reinforcement learning in large language models has enabled major breakthroughs in static tasks like math reasoning, training them as interactive agents that can reason, plan, and adapt to unpredictable feedback during complex, multi-turn scenarios remains a core challenge in AI,” said Zihan Wang, a PhD student in computer science at Northwestern Engineering.

Manling Li (L) and Zihan Wang

Wang, Manling Li, assistant professor of computer science, and a multi-institution research team aimed to bridge that gap through a two-prong approach.

First, they developed a general framework called State-Thinking-Actions-Reward Policy Optimization (StarPO) to enable the stable and scalable training of LLMs through trajectory-level reinforcement learning (RL). StarPO regulates learning objectives, defining how language agents optimize their reasoning process over multiple turns.

Then, they built benchmark system RAGEN to evaluate StarPO in controlled, stochastic environments. RAGEN serves both as the execution backend for StarPO and as a platform for studying stability, generalization, and learning dynamics in training reasoning agents.

“For the broader field, this work opens the door to more interactive, autonomous agents that self-evolve — not just be prompted — to reason,” Wang said. “In the long term, this could power more helpful assistants, collaborative AI tools, and adaptive decision-makers in dynamic environments.”

Through RAGEN, the team evaluated LLMs via three compact, symbolic gaming environments — Bandit, Sokoban, and Frozen Lake — that reflect reasoning challenges, including reward exploration, long-horizon planning with memory, and uncertainty and stochasticity.

Their findings reveal core challenges and design principles for stable agent RL training.

“We found that naïvely applying RL to LLM agents often leads to what we call ‘reasoning collapse,’ where the model stops exploring or thinking deeply,” Wang said.

The team discovered a training failure mode, which they referred to as the “Echo Trap,” where the agent learns to simply repeat earlier thoughts or mimic successful past behaviors, leading to plateaued performance. After identifying this issue, the researchers designed a stabilized variant, called StarPO-S, with modular updates that improved learning robustness, including rollout filtering, gradient shaping, and critic baselining.

“We demonstrated that StarPO-S helps agents stay curious and achieve better reasoning quality and higher task success rates,” Wang said.

Moving forward, the team plans to scale StarPO to train on larger models and longer interactions. They will also expand RAGEN to support more complex tasks, such as web navigation or scientific reasoning. RAGEN is available as an open-source benchmark so other researchers can systematically study multi-turn reasoning.

“This project grew out of a shared interest in reasoning and reinforcement learning across multiple institutions,” Wang said. “We connected with collaborators with unique perspectives — from systems and language modeling to cognitive science. It’s been a highly interdisciplinary and rewarding experience.”

Within three months of code release, RAGEN has been followed by 1,700 users on GitHub, reflecting broad interest in stable language agent training. Li noted that RAGEN has also inspired other RL training frameworks, such as Search-R1 and Agent-R1.

The McCormick School of Engineering research team also includes Yiping Lu, assistant professor of industrial engineering and management sciences; CS PhD students Kangrui Wang, Qineng Wang, and Pingyue Zhang; and CS undergraduate students Eli Gottlieb and Kefan Yu. Project partners include Licheng Liu (Imperial College London); Lijuan Wang and Zhengyuan Yang (Microsoft); Kyunghyun Cho (New York University); Minh Nhat Nguyen (Singapore Management University); Yejin Choi, Li Fei-Fei, Monica Lam, and Jiajun Wu (Stanford University); and Linjie Li (University of Washington).

Training Reasoning Agents in Interactive, Complex Environments

Recent Stories