探索前沿人工智能的極限

Hacker News·3 個月前

作者反思人工智能，特別是大型語言模型（LLM）和AI代理的快速演變，指出從傳統機器學習轉向API整合，以及提示工程和工具調用等新技術的出現。作者強調AI代理執行複雜任務的能力日益增強，預示著軟體開發的新常態。

Footer

Resources

[email protected]

Pushing Frontier AI to Its Limits

My last post was more than 14 months ago. Right around the time when the LLM hype exploded, AI workflows, AI agents, ... I stayed silent for a while busy watching everyone blowing mind about how LLMs could solve LeetCode problems to 80% hard problem, or how RAG could change how traditional chatbots work. People with ML backgrounds didn't quite accept that AI building is now just OpenAI API integration - something any developer can do. The beauty of data science used to lie in playing with data, feature engineering, model tuning, etc.

But then so many new AI applications became useful. New techniques, new "tasks" emerged around it. Prompt engineering, token optimization, creating MCPs for existing apps, tool calling, etc. The models just got so much better - we gave them tools to push their capacity beyond pure reasoning. People used to complain about LLMs hallucinating on outdated data. Now LLMs without web search or reasoning or MCPs are just ... weird.

You and I can't ignore that anymore. I started building from small stuff, creating UDF calling OpenAI to process a pandas dataset, building an MCP on top of ClickHouse, started using AI agents and building things more seriously. There are thousands of models out there now, from large to small, closed to open weight. The coding agents now really good I could say. I built LLM workflows, played with MCP, deployed vector database, RAG, etc.

Coding agents control the terminal. I'm not writing code or even reading it - I'm watching them work instead. I test their results, tell them what I expect tests to look like to keep them focused, and build skills to teach them specific tasks. This is the new normal, I guess.

I've tried over a hundred models and tools in the past year: GitHub Copilot from the early days, Tabnine, v0.dev, Codex, Claude Code, Cursor, Windsurf, opencode, n8n + AI Agent node, code review tools like CodeRabbit, Greptile, Sourcery, etc. dozens of models from gpt-4o, Claude, Gemini, Grok, Mistral, DeepSeek, Qwen, MiniMax, GLM, etc.
Both free and paid. I can't tell you which one is "the best" because they'll be legacy by next week. When choosing a framework for AI applications, there are tons of options: LangChain, LangGraph, OpenAI Agents SDK, then Claude Agent SDK came along and was better, Cloudflare Agents, Vercel AI SDK. The competition never ends. Maybe 90% of AI projects are just wrapping LLM APIs - most don't ship anything real. A few stand out, some become worth millions and turn into the next big thing, but most of them are just demos or POCs. I have no idea.

While people are still scared of vibe coding, I ship it to production. For me, AI agents are no longer just tools for learning or asking questions about your codebase - they're fully capable of producing production-grade code if you plug them into the right tools and give them good instructions. My top language on WakaTime is now markdown, damn. Things change fast. Your model gets stuck today, but tomorrow someone releases something better. You have an idea, someone builds a product around it, and it gets killed or goes legacy some random morning.

I didn't stop writing, tons of drafts in my obsidian, none published because they became outdated before I could finish them. I want to kick off this first 2026 post as my digital garden - a place to reflect on what I'm thinking and doing in this LLM era. This post will be updated from time to time.

Top on my list

Claude Code

Claude Code is still the king among all the coding agents I've tried. I've used Cursor, Codex, Antigravity, Gemini CLI, Droid, Roo Code, Kilo Code, Kiro, etc. None of them can beat Claude Code in my opinion. But I suggest you try all of them if you can - use a different one for each side project.

It just works - not only for coding, but for understanding complex systems, refactoring, writing docs, doing homework, planning travel, summarizing news, fixing your system, etc. "90% of code in Claude Code is written by itself" - How Claude Code is built. It's a general-purpose AI agent. Interestingly, it wasn't originally designed for coding. It started as Boris's side project.

The idea for Claude Code came from a command-line tool that used Claude to display what music an engineer was listening to at work. It spread like wildfire at Anthropic after being given access to the filesystem. Today, Claude Code has its own fully-fledged team

The shift from Copilot or Cursor (back in early 2025) to coding agents is like going from autocomplete to having other developers on your team. It's more like having teammates who do their own work, not a pair programmer grabbing your keyboard. They work on their own - I just review results, give feedback when asked, and honestly still can't believe this works. Your mindset changes from "I need to write good code" to "I need to write good prompts and build good skills". Most code in my GitHub repos is now generated without me writing a single line. I just prompt, watch, and test.

duyet.net gets updated automatically by Claude Code overnight with a custom Claude wrapper - my experiment to see how far Claude Code can go. Sometimes it researches new designs, sometimes it breaks the website, but it's fun to see. The script looks something like this:

The prompt.md file contains the task list and instructions. Claude reads it, executes, and updates the state for each loop. For more advanced use cases, check out Claude Code + Ralph Loop - it runs non-stop sessions that consume tasks while you can prompt it to read state or a TODO.md file on the fly.

There's no one correct way to use Claude Code. The following sections are for anyone curious about how I use it - skip this if you're already familiar with Claude Code.

Claude Code Setup

I prefer disabling Auto-compact - it's slow, wastes 45.0k tokens (22.5%) for the buffer, and usually loses context. I use sub-agents when possible since they have their own context. Otherwise I run /export to the clipboard, then /clear and paste the previous content back. The export won't include thinking tokens or tool calls, so you save a lot and the model still tracks well.

I always work with --dangerously-skip-permissions - it's not as dangerous as you'd think.

My default list of MCPs are: context7, sequential-thinking, and zread. It depends on the project I'm working on.

History

Parallel agents

Don't just try to generate code, start leading a team of parallel agents and using background tasks for your agents.

I built a team-agents plugin for a coordinated agent team for parallel task execution with leader delegation to senior/junior. I keep the number of roles minimal, but you can add more for specific tasks. High-level architecture for you, try to parallelize work while maintaining quality on the complex parts.

duyet/claude-plugins

https://github.com/duyet/claude-plugins: A collection of plugins I use for Claude Code, including skills, MCPs, commands, and hooks across all my machines and Claude Agent SDK apps. You might find something useful here. The sub-agents and skills in this repo keep results consistent across codebases - I use Claude Code to learn patterns and update them over time.

I started seeing AI engineers on X sharing their commands. I have a list of my own to make the workflow faster. This saves me from repeated prompting - some of the commands I use most:

Plan Mode

Plan mode performs significantly better than just prompting directly. When you give Claude time to think and plan first, the results are way more accurate. Less back-and-forth, fewer mistakes.

Hit shift+tab twice to enter Plan mode. I do this for most tasks and start a new session for each one.
Claude writes a plan file for you to review - keep adjusting until you're happy with it.

Once the plan is solid, Claude usually finishes the whole thing in one shot without asking questions.

Tip: If you're not clear about something, trigger the deep research agent first:

This helps Claude gather context before planning.

With a good plan, I usually don't do much here - just let it run.
You can open another Claude Code session to work on something else while waiting.

If things go off track, inject a prompt mid-way. Claude will catch up and keep going.

You can kick off background agents for specific tasks (research, small changes, refactoring) while working.

The Explanatory output style shows you why Claude made certain choices - useful for learning.

I use agents for review: @code-simplifier cleans up the code, @refactor or @testing for specific checks.

Claude Hooks save time here - auto-format, run linters, or custom verification.

CLAUDE.md, AGENTS.md

First thing Claude does when starting a session is read your CLAUDE.md file. Most people ignore it, but it's actually really important. It keeps things consistent across sessions and saves time - Claude doesn't need to re-investigate your project setup every time.

A few tips:

AGENTS.md serves a similar purpose. If you use both Claude Code and other coding agents (like Codex, Cursor), create a symlink so they share the same instructions:

or put instructions in AGENTS.md (an open standard) and reference it from CLAUDE.md:

Claude Code reads CLAUDE.md, Codex reads AGENTS.md - you only maintain one.

Here's a snippet from my global ~/.claude/CLAUDE.md that applies to every project:

Interview Mode

For complex tasks, try my /interview plugin - it asks clarifying questions before you start planning. It helps catch missing requirements early.

Claude Code + Ralph Loop

The ralph-wiggum plugin is my favorite for long-running tasks or vibe coding on fun projects while I'm asleep. You define a goal condition and let the agent loop until it verifiably reaches that goal. With cheap Z.AI GLM 4.7 tokens, I can let it run 24/7. Run it with --permission-mode=dontAsk or --dangerously-skip-permissions.

z_claude, mi_claude & or_claude

The good thing about Claude Code is that you can use it with alternative providers that offer the same Anthropic API interface. I've created some wrapper scripts for this:

You can start working with claude using Opus, then exit and continue the same session with z_claude --continue. Use mi_claude or or_claude the same way.

Claude Code (+ OpenRouter) on GitHub Actions

The best part is I'm running Claude GitHub Actions with OpenRouter at no cost by using free models. I have an OpenRouter preset that can switch between SOTA free models automatically.

I put together some reusable workflows at duyet/github-actions that other repos can reuse:

Check out the official documentation: Claude Code GitHub Actions.
Some use cases:

This way you can have Claude Code + OpenRouter free or cheap models running 24/7 for you. A lot of automation becomes possible: smart cronjobs, automated refactoring, documentation sync, etc. The AI does the boring stuff while you sleep.

opencode

If you want to try a good coding agent with nice UI/UX - opencode is really solid right now. Fast, simple, and it reads all your Claude config and plugins out of the box.

It connects to a lot of providers: Z.AI, OpenRouter, Codex, Claude, plus some free Zen models from their own provider. When I hit rate limits on one, I just switch to another. When Opus is overkill, I drop down to something cheaper.

You can save and share sessions - handy when you want to show someone how you solved something. They also have a native web UI now.

I suggest trying oh-my-opencode - it adds some powerful workflows on top of opencode:

opencode can also run headless on a remote machine (VM/CI runner/container) and your local CLI connects as a client. Handy for offloading heavy workloads to a beefy VM while you work from a laptop.

Series: Pushing Frontier AI to Its Limits

Reflect on what I'm thinking and doing in this LLM era

— Hacker News