對齊什麼？重新思考 MiniMax M2 中的代理泛化能力

Huggingface·6 個月前

本文探討大型語言模型（LLM）代理的泛化能力挑戰，強調基準測試表現與實際應用之間的差距。文章介紹了 MiniMax M2 的方法，旨在透過同時對齊基準測試以建立技能，並對齊使用者以確保實際應用中的有效性。

Aligning to What? Rethinking Agent Generalization in MiniMax M2

The Real Agent Alignment Problem: Benchmarks or Reality?

If you've worked with LLM Agents, you've felt this pain: the same model can feel brilliant in one framework and useless in another. An agent might crush a tool-use leaderboard but fail spectacularly at a simple, real-world task. This gap between benchmark performance and practical usability is one of the biggest challenges in the field.

When we designed M2, we knew we had to tackle this problem head-on. This led us to two core, and sometimes conflicting, objectives:

So, who do we align with? The answer is both. We align with benchmarks to build skill, but we must ultimately align with the user by ensuring those skills work everywhere.

While the methods for acing benchmarks are a deep topic for another day, I want to focus on that second, trickier objective: How do we train an agent for the wild?

The Need for Interleaved Thinking

Early in the project, we hit a frustrating wall. Agent performance was inconsistent, and we struggled to diagnose why. After many discussions, especially with Professor @Junxian He and @Wenhu Chen, we arrived at our first major conclusion: Agents require Interleaved Thinking.

This means that an agent's internal monologue—its "thinking"—can and should happen at any point during a task, not just once at the beginning like a standard reasoning model. This design is critical for two reasons:

This principle became a cornerstone of M2's effectiveness.

Pro Tip for M2 Users: Because M2 relies on Interleaved Thinking, its context is its memory. For best performance, you must retain the full session history, including the thinking steps. We've noticed that much of the community feedback about performance gaps stems from accidentally discarding this vital context, which is a common practice with simpler reasoning models.

True Generalization is About Perturbation

Our initial theory was simple: tool scaling is agent generalization.

We started with a minimal set of tools (a Python interpreter, search engine, a browser) to build a baseline of tool-calling capability. The roadmap was clear: scale up the number and variety of tools, and the agent's ability to generalize to unseen tools would naturally follow.

At first, this worked. Our benchmark scores climbed to respectable levels. But as we dug deeper, we realized we were solving the wrong problem. The model aced the tests, but if we changed the environment even slightly—like swapping to a different scaffolding framework—its performance would plummet. We were still far from our goal of a "practically useful" model.

This led to our second, more profound realization: Agent generalization is not just about adapting to new tools; it's about adapting to perturbations across the model's entire operational space.

This sounds abstract, so let's break it down. Think about everything that can change in a single agent task:

What's Next?

Our work on M2 taught us an immense amount about agents, generalization, and data, but it has opened up more questions than it answered. Many of our ideas are still on the whiteboard. In the coming months, we will be exploring these frontiers even more deeply, and we can't wait to bring you the next generation of powerful and genuinely useful models.

Getting Involved

Community

"Pro Tip for M2 Users: Because M2 relies on Interleaved Thinking, its context is its memory. For best performance, you must retain the full session history, including the thinking steps. We've noticed that much of the community feedback about performance gaps stems from accidentally discarding this vital context, which is a common practice with simpler reasoning models. "

Is this the reason I observe output degradation in subsequent rounds in Chatbox+M2 API? Questions answered correctly in the first turn sometimes become incorrect.

·
Sign up or
log in to comment