利用合成數據與強化學習訓練用於命令行任務的AI代理

Hacker News·3 個月前

本文闡述如何利用合成數據生成與可驗證獎勵的強化學習，訓練AI代理安全地操作新的命令行界面（CLI）。此方法能讓一個推理模型在無先驗知識的情況下，專精於新的代理任務，例如管理本地伺服器和Dockerfiles。

How to Train an AI Agent for Command-Line Tasks with Synthetic Data and Reinforcement Learning

AI-Generated Summary

AI-generated content may summarize information incompletely. Verify important information. Learn more

What if your computer-use agent could learn a new Command Line Interface (CLI)—and operate it safely without ever writing files or free-typing shell commands?

In Part 1 of our series on building a computer use agent, we built a custom Bash computer-use agent using NVIDIA Nemotron in just one hour. In this sequel, we’ll take it further by teaching the same reasoning model with no prior knowledge to safely operate the LangGraph Platform CLI. This shows how easily a large reasoning model can be specialized to perform new, agentic tasks.Instead of simple file operations, our new agent will learn to start local servers, build containers, and generate Dockerfiles—entirely through a verifiable, human-in-the-loop command interface.

We’ll combine synthetic data generation (SDG) and Reinforcement Learning with Verifiable Rewards (RLVR), optimized via Group Relative Policy Optimization (GRPO), to make training both efficient and safe.

What you’ll build: a specialized agent to run a new CLI tool

You’ll fine-tune an AI agent that can:

Here’s what a typical interaction looks like once the model is trained:

This pattern generalizes: The same workflow can be extended to support new CLI tools and environments.

Why use synthetic data generation and reinforcement learning to teach a new CLI?

Teaching an AI agent to operate a specialized CLI tool presents unique challenges that traditional approaches struggle with:

The data scarcity problem: Most specialized CLI tools lack the massive usage logs needed for conventional training. Unlike common shell commands, tools like LangGraph have specific syntax, flags, and workflows that aren’t well-represented in general training data. Waiting to collect real-world usage examples could take months or years.

The safety-accuracy tradeoff: You want your agent to be creative in understanding user intent, but absolutely precise when generating commands. A single typo or wrong flag could cause system errors or worse. Traditional fine-tuning often produces models that are either too conservative (refusing valid requests) or too permissive (hallucinating dangerous commands).

How SDG + RL solves this:

This approach is particularly powerful for enterprise environments where you might need to quickly adapt agents to proprietary internal tools without waiting for organic data collection.

Prerequisites

For this setup, you’ll need:

Hardware requirements:

Software requirements:

Core components:

Base model:

Check out a video version of this tutorial:

Video 1. Use SDG and RL to produce a LangGraph CLI BASH Agent.

Step 1: Design a synthetic dataset with NeMo Data Designer

Before training, we need data: pairs of natural-language requests mapped to LangGraph CLI invocations.

We’ll use the NVIDIA NeMo Data Designer to programmatically generate this dataset, starting from a handful of seed examples and expanding into hundreds of verified command pairs.

Why use synthetic data generation?

Think of synthetic data generation like teaching someone a new language by showing them a pattern, then having them create variations. Instead of collecting thousands of real examples (which might not exist yet), we:

The dataset structure

Each generated record includes:

The validation process

In Data Designer, we steer diversity with sampling parameters and reject any record that fails validation. For example, we might use a regex pattern like:^langgraph\s+(dev|build|up|dockerfile)\b

This ensures that:

Finally, we export the dataset in OpenAI-style messages format—ideal for RLVR fine-tuning with the open-source NVIDIA NeMo framework.

This validation process matters: It guarantees that the reward verifier (introduced later) will be consistent with the structure and syntax of the training data.

Let’s look at the implementation in NeMo Data Designer.

Step 2: Fine-tune with RLVR (using GRPO)

With clean, verified data in hand, we move to fine-tuning using Unsloth, an open source framework for efficient reinforcement learning that integrates with NeMo Gym training environments

Reinforcement Learning with Verifiable Rewards (RLVR)

Traditional reinforcement learning from human feedback (RLHF) is like having a panel of judges score each output—subjective, expensive, and inconsistent. RLVR replaces human judges with deterministic code-based verification.

Instead of asking humans “Does this command look good?,” we ask code “Does this command pass our validation rules?”

For a CLI agent, the verifier enforces rules such as:

The reward system:

✅ Valid command → +1 reward (encourages this behavior)❌ Invalid command → −1 reward (discourages this behavior)⚪ Ambiguous output → 0 reward (neutral, no reinforcement)

This consistency is crucial: The same output always yields the same reward, making training stable and predictable. And because the verifier is just code, you can adjust constraints anytime without retraining a separate reward model.

Building the training environment with NeMo Gym

NeMo Gym is an open source library for building reinforcement learning training environments for LLMs. It provides the infrastructure to define tools, execute agent actions, and compute verifiable rewards—exactly what we need for training a CLI agent.

The CLI agent environment is implemented as a NeMo Gym resource server, which encapsulates:

When the agent proposes commands, the resource server evaluates correctness and returns reward signals for GRPO training. This clean separation between environment logic and training framework means you can iterate on your CLI tools and validation rules without touching the RL code.To learn more about creating custom environments, see the NeMo Gym documentation and the guide on creating resource servers.

Optimization via Group Relative Policy Optimization (GRPO)

GRPO is a simpler, more memory-efficient alternative to PPO. Instead of training a separate “critic” model to estimate how good each action is, GRPO samples multiple outputs for the same prompt and uses their average reward as the baseline. This cuts the model count in half (no critic needed) and reduces variance by comparing outputs against each other rather than against a learned estimate.

Here’s how it works in practice:

Traditional RL might struggle when most attempts fail. Imagine the model generates 10 command variations for the same prompt:

Standard optimization might get lost in the noise of failures. GRPO instead:

This approach dramatically improves learning efficiency and convergence speed, helping the model quickly learn what makes a command valid.

Let’s see how we’d implement this with Unsloth and NeMo Gym:

Step 3: Human-in-the-loop execution

Once fine-tuned, we embed the model into a runtime loop that always requests human confirmation before execution. This maintains the safety architecture introduced in Part 1, ensuring no command runs without explicit approval.

The safety architecture

This simple line embodies a crucial security principle. By setting shell=False, we ensure:

The complete safety chain

Our multi-layered approach ensures safety at every step:

Even if the model occasionally produces an invalid command despite training, the runtime policy prevents it from being executed.

Why RLVR + synthetic data work for customizing Agentic AI

This combination creates a powerful synergy:

The result: We can teach Nemotron-Nano-9B-V2 to precisely and safely operate a new CLI tool—all without full retraining or compromising on safety.

Closing thoughts

By extending our Bash operator into a LangGraph-aware computer-use agent, we’ve demonstrated how synthetic data generation and RLVR (with GRPO) form a powerful recipe for rapidly specializing large reasoning models to new toolchains.

The workflow generalizes cleanly to any CLI tool:

This pattern lets you turn any capable large language model (LLM) into a domain-specific, verifiably safe computer-use agent—from LangGraph today to your proprietary internal tools tomorrow.

The implications are significant: Instead of waiting months to collect training data or accepting the risks of uncontrolled command generation, you can deploy specialized, safe CLI agents in days. Whether you’re automating DevOps workflows, creating customer support tools, or building internal productivity agents, this approach provides a fast, safe path from idea to production.

Stay up-to-date on NVIDIA Nemotron by subscribing to NVIDIA news and following NVIDIA AI on LinkedIn, X, Discord, and YouTube.

About the Authors

Comments

Create Your Own Bash Computer Use Agent with NVIDIA Nemotron in One Hour

Turbocharging AI Factories with DPU-Accelerated Service Proxy for Kubernetes

Improve AI Code Generation Using NVIDIA NeMo Agent Toolkit

Agentic Autonomy Levels and Security

Deep Reinforcement Learning Agent Beats Atari Games

Multi-Agent Warehouse AI Command Layer Enables Operational Excellence and Supply Chain Intelligence

Inside NVIDIA Nemotron 3: Techniques, Tools, and Data That Make It Efficient and Accurate

How to Build Privacy-Preserving Evaluation Benchmarks with Synthetic Data

NVIDIA Kaggle Grandmasters Win Artificial General Intelligence Competition

Building Scalable AI on Enterprise Data with NVIDIA Nemotron RAG and Microsoft SQL Server 2025

— Hacker News

你的個人知識庫