
Show HN:PlayTheAI – 在策略遊戲中測試AI模型(公開測試版)
PlayTheAI 推出了其平台的公開測試版,讓用戶可以在策略遊戲中測試包括 GPT、Claude 和 Grok 在內的各種 AI 模型。初步的洞察突顯了特定模型在《猜數字》和《連線遊戲》等遊戲中的行為和潛在限制。

The Reality Check for AI
Dynamic games against real humans. No memorizable tests. No optimizable benchmarks. Just play.
AI Insights
Observations from our Open Beta - 1/5/2026
GLM-4.7: Tool-Call Chaos and Repetition Loop
Zhipu's GLM-4.7 (a Chinese model based on Tsinghua research) showed a double failure in Mastermind. After some initial progress (getting 2 black pegs with 'RRBY'), it regressed to repeating 'RRGB' seven times straight. Worse, the raw responses showed corrupted tool-call syntax: '<tool_call><tool_call>make<tool_call>make_guess' - repeated fragments instead of proper JSON. This suggests issues with both feedback integration AND output formatting.
Claude Haiku 4.5: Budget Model, Same Bug
Anthropic's most affordable Claude model showed the same Mastermind weakness as premium models. Claude Haiku 4.5 guessed 'RGBY' ten times consecutively, receiving '0 black, 2 white pegs' each time - clear feedback that zero positions are correct. At $0.0025 per guess, this 11-move match cost only $0.026, but demonstrated that the feedback integration issue persists across Claude's entire price tier. The pattern suggests this is an architectural limitation rather than a capacity one.
Connect4 Specialists: Vision Mode Shows Advantage
Connect4 reveals interesting mode differences. Grok 4.1 Fast Vision leads with 1048 ELO and 25% win rate (8 matches), while GPT-4o Text follows at 1052 ELO with 33% win rate (6 matches). The pattern suggests vision input may help models better understand the vertical stacking nature of Connect4 - seeing the actual grid rather than parsing text coordinates. Both outperform their counterparts in other games.
Gemini 3 Flash Preview: TicTacToe Contender
Google's preview model Gemini 3 Flash (Vision) shows promising TicTacToe performance with 1064 ELO - second only to Claude 3.5 Haiku's 1104 ELO in the game. With only 0.75 illegal moves per match (vs. 2-3 for most competitors) and an 12.5% win rate across 8 matches, it demonstrates solid rule-following capability. For a 'preview' designation, this suggests Google's next-gen models may improve significantly on spatial reasoning tasks.
The Gap Between Benchmarks and Reality
Traditional AI benchmarks report near-human or superhuman performance. Yet these static tests can be optimized for during training, producing scores that don't reflect true generalization.
PlayTheAI measures what benchmarks can't: dynamic reasoning against unpredictable human opponents. No memorizable patterns. No optimizable test sets.
We test standard (non-reasoning) models — the same models claiming 90%+ on logic benchmarks. In our tests, we observe that many models show single-digit win rates against average humans in simple strategy games.
What we find interesting: the AI receives the complete game history in every prompt — all previous moves with feedback. It's not a memory problem. Many models appear to have difficulty drawing logical conclusions from the information in front of them — though sometimes they succeed, whether through a spark of reasoning or pure luck.
If a model needs minutes of "extended thinking" to play Tic-Tac-Toe, that itself is notable. True generalization shouldn't require special reasoning modes for tasks children master intuitively.
Why We Test Instant-Response Models
Real AI systems don't get 30 seconds to think. We benchmark the models that actually power production systems — where milliseconds matter.
Robotics & Drones
A robotic arm can't pause mid-motion. Drones need real-time image analysis. Factory automation requires instant decisions.
Autonomous Vehicles
Self-driving cars make life-or-death decisions in milliseconds. There's no time for "extended thinking" at 120 km/h.
Financial Trading
In high-frequency trading, milliseconds equal millions. AI trading systems need instant pattern recognition.
Gaming & NPCs
Players expect immediate responses. A 30-second thinking pause breaks immersion and gameplay.
Customer Service
Chatbots handling millions of queries can't afford slow reasoning. Speed and cost efficiency are essential.
Reasoning models are impressive — but they solve a different problem. We test what matters for deployment: instant intelligence.
How It Works
1. Choose a Game
5 games testing strategy, logic, language, and spatial reasoning.
2. Random AI Match
Fair blind matchmaking. Model revealed after game ends.
3. Live Rankings
Transparent Elo ratings. See which AI performs best in real scenarios.
Choose Your Game
Select a game and challenge the AI. Each game tests different skills.
Connect Four
Connect 4 in a row to beat the AI
Tic Tac Toe
Classic 3x3 strategy game
Word Duel
Guess the 5-letter word before running out of attempts
Mastermind
Break the secret 4-color code before running out of attempts
Battleship
Hunt and sink the AI fleet before it sinks yours
Coming Soon
Poker
Texas Hold'em - Bluffe die KI oder durchschaue ihren Bluff
Trivia
Teste dein Wissen gegen die KI
Rummikub
Bilde Gruppen und Reihen mit Zahlensteinen
Recent Matches
Showcase Your Model's Real-World Performance
Get your AI model benchmarked against real humans. Dynamic games, transparent Elo rankings, competitive insights.
Real-World Benchmarks
Your model tested against real humans in dynamic games. No training data contamination.
Performance Reports
Monthly analytics: win rates, Elo trends, comparison to competitors.
API Access
Export data, integrate analytics into your dashboards via JSON API.
Get notified about new AI models and benchmark results

© 2026 SW · PlayTheAI · Open Beta · v0.1.789

PlayTheAI.com
Beta Access Required
This is a closed beta. You need an invite link to access.
Access Types:
相關文章