Dream2Flow：史丹佛大學新AI透過影片生成讓機器人能「想像」任務

Hacker News·4 個月前

史丹佛大學的研究人員開發了Dream2Flow框架，該AI能透過生成概念性影片來讓機器人「想像」任務執行過程，藉此彌合AI的視覺理解與機器人的實體操作能力之間的鴻溝。

This New AI Lets Robots “Imagine” How Objects Will Move Before Acting

The divide between artificial intelligence and physical robotics has long been defined by a frustrating paradox: an AI can generate a photorealistic video of a person folding a blanket or slicing bread, yet a physical robot arm often struggles with the same basic chores.

This difference, known as the embodiment gap, exists because video generation models understand how pixels should look, but they don’t understand the torque, friction, or specific joint movements a robot needs to function in the real world.

With this in mind, a team of researchers from Stanford University has developed a framework called Dream2Flow, which aims to close this gap by teaching robots to use these “imagined” videos as a planning guide for the real world.

Also Read: EngineAI T800: Humanoid Robot Performs Incredible Martial Arts Moves

Rather than trying to force a robot to mimic every pixel of a generated video—which often contains glitches where hands morph or objects flicker—Dream2Flow treats the video as a conceptual guide. When given a task like “put the bread in the bowl,” the system first uses a pre-trained video model to “dream” a visual sequence of the task being completed.

From this mental movie, the framework extracts what researchers call 3D Object Flow. This is essentially a ghostly, mathematical path that captures how the object is expected to move through three-dimensional space, independent of whatever or whoever is moving it in the video.

By focusing on the movement of the object rather than the movement of the “actor” in the video, the AI creates a shared object-centric trajectory that different robots can adapt to their own bodies. It strips away the visual noise of a human hand or a glitched robotic gripper and leaves behind a pure trajectory.

A robotic arm, a four-legged robot dog like Boston Dynamics’ Spot, or even a humanoid robot can then take this 3D path and compute the control actions needed to achieve that result. This allows the robot to bypass the confusion of trying to look like the video and instead focus on the physical goal.

The versatility of this approach represents a significant step forward for open-world robotics. Because the system relies on the vast knowledge embedded in video models like Sora or Kling, the robots can reason about previously unseen task variations and object instances. As the researchers wrote, during testing, Dream2Flow successfully guided various robots through tasks involving rigid mugs, soft loaves of bread, and even granular materials.

Also Read: Robot Learns 1,000 Tasks in a Single Day

The team also tested the same bread-in-a-bowl task under slightly different conditions, changing the objects, the background, and the camera angle. Across these variations, the system continued to produce usable object motion trajectories, meaning that it was learning how objects move rather than memorizing a single scene.

The AI effectively gives the machine a form of spatial imagination, allowing it to visualize a successful outcome and then compute control actions that realize that outcome in the real world.

However, the technology is not without its surreal growing pains. Because it relies on generative AI, the dreams can sometimes turn into nightmares where a piece of bread might suddenly morph into a stack of crackers or an object might simply vanish from the frame.

In their real-world evaluation, the researchers reported 60 total trials, with most failures occurring upstream in the video and perception stages rather than during robot control itself. Out of these trials, 12 failures stemmed from video generation, split evenly between object morphing, where the model substantially altered an object’s shape, and hallucinations, where unrealistic elements were generated in the video, such as the appearance of a new bowl.

Also Read: Scientists Create Robotic Rabbits to Fight Invasive Burmese Pythons in Florida

An additional four failures occurred during 3D object flow extraction, typically due to severe occlusions, challenging object rotations, or incorrect tracking masks that prevented reliable reconstruction.

Even when reasonable 3D object flow was available, four robot execution failures were observed, usually caused by missed or imperfect grasps—cases where the robot failed to pick up or hold the object as intended.

These hallucinations and perception breakdowns remain the primary challenges, but as video models become more grounded in physical reality, Dream2Flow could mature into a more dependable way to guide real robot action.

Sources: arXiv, Dream2Flow

ScienceClock Newsletter

Get science updates, breakthroughs, and insights delivered to your inbox every Tuesday & Friday!

Please check your inbox and confirm your subscription.

Username or Email Address

Password

Remember Me

— Hacker News

你的個人知識庫

Dream2Flow：史丹佛大學新AI透過影片生成讓機器人能「想像」任務

This New AI Lets Robots “Imagine” How Objects Will Move Before Acting

You Might Also Like