PostTrainBench：評估AI代理後訓練語言模型的能力

Hacker News·4 個月前

PostTrainBench 是一項新的基準測試，旨在衡量AI代理透過後訓練有效提升基礎大型語言模型（LLM）效能的能力。該研究在特定的GPU存取和時間限制下評估代理，並觀察其在各種基準測試中的行為和表現。

PostTrainBench

Measuring how well AI agents can post-train language models

Can AI agents improve performance of base LLMs? We give each agent 4 small target LLMs, an H100 GPU, and 10 hours to post-train them.

Leaderboard

1 The average is taken across all post-trained LLMs (Qwen 3 1.7B, Qwen 3 4B, SmolLM3-3B, Gemma 3 4B IT) and benchmarks (AIME 2025, BFCL, GPQA Main, GSM8K, HumanEval). For each run, we ask a CLI agent to maximize the performance of a specific base LLM on a specific benchmark.

2 "Human Post-Trained" is not directly comparable to the rest since it usually exceeds the 10h + 1 GPU constraint.

More agents coming soon...

Detailed Breakdown by Benchmark

Average Time Spent

Time taken by each agent to complete post-training (out of 10 hours). Different agents demonstrate varying levels of persistence - some give up well before the time limit expires.

Pipeline

Evaluation Benchmarks

Post-trained models are evaluated across these benchmarks to measure improvement in reasoning, knowledge, and problem-solving capabilities

About

Post-Train Bench measures AI R&D automation by testing whether AI agents can successfully post-train other language models. Each agent receives 4 base models (Qwen 3 1.7B, Qwen 3 4B, SmolLM3-3B, and Gemma 3 4B), access to an H100 GPU, and a 10-hour time limit to improve model performance through post-training

Experimental Setup

Observations

Agent Behaviors

Time & Trace Patterns

Agents had 3-10 hour limits. Behaviors varied significantly:

Reward Hacking (Near Misses)

Claude found that Qwen/Qwen3-1.7B (the instruct-tuned version) works "perfectly" for function calling. However, it then explicitly acknowledged:

All agents showed awareness of contamination rules:

Key Takeaways

Dataset quality > training duration: GPT-5.1-codex-max's success came from careful dataset curation, not longer training

Constraint awareness: Almost all agents showed understanding of rules and avoided contamination

Self-correction: Claude showed self-correction that avoids reward hacking by model substitution

Library issues: Many errors came from library version mismatches (trl, transformers)

Format alignment matters: For function calling, matching exact output format was essential for high scores

Longer traces ≠ better results: GPT-5.1-codex had longest traces but inconsistent results; GPT-5.1-codex-max had shorter traces but better outcomes

你的個人知識庫