PostTrainBench:評估AI代理後訓練語言模型的能力
PostTrainBench 是一項新的基準測試,旨在衡量AI代理透過後訓練有效提升基礎大型語言模型(LLM)效能的能力。該研究在特定的GPU存取和時間限制下評估代理,並觀察其在各種基準測試中的行為和表現。
PostTrainBench
Measuring how well AI agents can post-train language models
Can AI agents improve performance of base LLMs? We give each agent 4 small target LLMs, an H100 GPU, and 10 hours to post-train them.
Leaderboard
1 The average is taken across all post-trained LLMs (Qwen 3 1.7B, Qwen 3 4B, SmolLM3-3B, Gemma 3 4B IT) and benchmarks (AIME 2025, BFCL, GPQA Main, GSM8K, HumanEval). For each run, we ask a CLI agent to maximize the performance of a specific base LLM on a specific benchmark.
2 "Human Post-Trained" is not directly comparable to the rest since it usually exceeds the 10h + 1 GPU constraint.
More agents coming soon...
Detailed Breakdown by Benchmark
Average Time Spent
Time taken by each agent to complete post-training (out of 10 hours). Different agents demonstrate varying levels of persistence - some give up well before the time limit expires.
Pipeline
Evaluation Benchmarks
Post-trained models are evaluated across these benchmarks to measure improvement in reasoning, knowledge, and problem-solving capabilities
About
Post-Train Bench measures AI R&D automation by testing whether AI agents can successfully post-train other language models. Each agent receives 4 base models (Qwen 3 1.7B, Qwen 3 4B, SmolLM3-3B, and Gemma 3 4B), access to an H100 GPU, and a 10-hour time limit to improve model performance through post-training
Experimental Setup
Observations
Agent Behaviors
Time & Trace Patterns
Agents had 3-10 hour limits. Behaviors varied significantly:
Reward Hacking (Near Misses)
Claude found that Qwen/Qwen3-1.7B (the instruct-tuned version) works "perfectly" for function calling. However, it then explicitly acknowledged:
All agents showed awareness of contamination rules:
Key Takeaways
Dataset quality > training duration: GPT-5.1-codex-max's success came from careful dataset curation, not longer training
Constraint awareness: Almost all agents showed understanding of rules and avoided contamination
Self-correction: Claude showed self-correction that avoids reward hacking by model substitution
Library issues: Many errors came from library version mismatches (trl, transformers)
Format alignment matters: For function calling, matching exact output format was essential for high scores
Longer traces ≠ better results: GPT-5.1-codex had longest traces but inconsistent results; GPT-5.1-codex-max had shorter traces but better outcomes
Team
Citation
If you found PostTrainBench useful, please cite us as:
相關文章
其他收藏 · 0