實體人工智慧的學術論文與資源精選列表

實體人工智慧的學術論文與資源精選列表

Hacker News·

此篇 Hacker News AI 的文章介紹了一個名為「awesome-physical-ai」的 GitHub 儲存庫,該儲存庫精選了關於實體人工智慧(Physical AI)的學術論文與資源。內容特別聚焦於視覺-語言-動作(VLA)模型、世界模型、具身智慧以及機器人基礎模型等研究領域。

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

To see all available qualifiers, see our documentation.

A curated list of academic papers and resources on Physical AI — focusing on Vision-Language-Action (VLA) models, world models, embodied intelligence, and robotic foundation models.

License

Uh oh!

There was an error while loading. Please reload this page.

keon/awesome-physical-ai

Folders and files

Latest commit

History

Repository files navigation

Awesome Physical AI

Image

A curated list of academic papers and resources on Physical AI — focusing on Vision-Language-Action (VLA) models, world models, embodied intelligence, and robotic foundation models.

Physical AI refers to AI systems that interact with and manipulate the physical world through robotic embodiments, combining perception, reasoning, and action in real-world environments.

Table of Contents

Foundations

Vision-Language Backbones

Core vision-language models that serve as pretrained backbones for Physical AI systems.

CLIP: "Learning Transferable Visual Models From Natural Language Supervision", ICML 2021. [Paper] [Code]

SigLIP: "Sigmoid Loss for Language Image Pre-Training", ICCV 2023. [Paper]

PaLI-X: "PaLI-X: On Scaling up a Multilingual Vision and Language Model", CVPR 2024. [Paper]

LLaVA: "Visual Instruction Tuning", NeurIPS 2023. [Paper] [Project]

Prismatic VLMs: "Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models", ICML 2024. [Paper] [Code]

Visual Representations

Self-supervised visual encoders and perception models used in robotics.

DINOv2: "DINOv2: Learning Robust Visual Features without Supervision", arXiv, Apr 2023. [Paper] [Code]

SAM: "Segment Anything", ICCV 2023. [Paper] [Project]

R3M: "R3M: A Universal Visual Representation for Robot Manipulation", CoRL 2022. [Paper] [Code]

MVP: "Masked Visual Pre-training for Motor Control", arXiv, Mar 2022. [Paper] [Project]

Grounding DINO: "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection", ECCV 2024. [Paper] [Code]

VLA Architectures

End-to-End VLAs

Monolithic models that treat vision, language, and actions as unified tokens in a single architecture.

RT-1: "RT-1: Robotics Transformer for Real-World Control at Scale", RSS 2023. [Paper] [Project] [Code]

RT-2: "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control", CoRL 2023. [Paper] [Project]

OpenVLA: "OpenVLA: An Open-Source Vision-Language-Action Model", CoRL 2024. [Paper] [Project] [Code]

PaLM-E: "PaLM-E: An Embodied Multimodal Language Model", ICML 2023. [Paper] [Project]

VIMA: "VIMA: General Robot Manipulation with Multimodal Prompts", ICML 2023. [Paper] [Project] [Code]

LEO: "An Embodied Generalist Agent in 3D World", ICML 2024. [Paper] [Project]

3D-VLA: "3D-VLA: A 3D Vision-Language-Action Generative World Model", ICML 2024. [Paper] [Project]

Gato: "A Generalist Agent", TMLR 2022. [Paper] [Blog]

RoboFlamingo: "Vision-Language Foundation Models as Effective Robot Imitators", ICLR 2024. [Paper] [Project]

Magma: "Magma: A Foundation Model for Multimodal AI Agents", arXiv, Feb 2025. [Paper] [Code]

RoboVLMs: "Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models", arXiv, Dec 2024. [Paper] [Project]

Modular VLAs

Models that decouple cognition (VLM-based planning) from action (specialized motor modules).

CogACT: "CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action", arXiv, Nov 2024. [Paper] [Project]

Gemini Robotics: "Gemini Robotics: Bringing AI into the Physical World", arXiv, Mar 2025. [Paper] [Blog]

Helix: "Helix: A Vision-Language-Action Model for Generalist Humanoid Control", arXiv, Apr 2025. [Paper]

SayCan: "Do As I Can, Not As I Say: Grounding Language in Robotic Affordances", CoRL 2022. [Paper] [Project]

Code as Policies: "Code as Policies: Language Model Programs for Embodied Control", arXiv, Sep 2022. [Paper] [Project]

SayPlan: "SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Task Planning", CoRL 2023. [Paper] [Project]

Inner Monologue: "Inner Monologue: Embodied Reasoning through Planning with Language Models", CoRL 2022. [Paper] [Project]

Instruct2Act: "Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions", arXiv, May 2023. [Paper] [Code]

TidyBot: "TidyBot: Personalized Robot Assistance with Large Language Models", IROS 2023. [Paper] [Project]

Compact & Efficient VLAs

Lightweight VLA models optimized for fast inference and edge deployment.

TinyVLA: "TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models", arXiv, Apr 2025. [Paper] [Project]

SmolVLA: "SmolVLA: A Small Vision-Language-Action Model for Efficient Robot Learning", arXiv, Jun 2025. [Paper] [Code]

OpenVLA-OFT: "OpenVLA-OFT: Efficient Fine-Tuning for Open Vision-Language-Action Models", arXiv, Mar 2025. [Paper]

RT-H: "RT-H: Action Hierarchies Using Language", arXiv, Mar 2024. [Paper] [Project]

LAPA: "Latent Action Pretraining from Videos", arXiv, Oct 2024. [Paper] [Project]

Action Representation

Discrete Tokenization

Models that convert continuous joint movements into discrete "action tokens".

FAST: "FAST: Efficient Action Tokenization for Vision-Language-Action Models", arXiv, Jan 2025. [Paper] [Project]

GR-1: "Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation", ICLR 2024. [Paper] [Project]

GR-2: "GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge", arXiv, Oct 2024. [Paper] [Project]

ACT: "Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware", RSS 2023. [Paper] [Project] [Code]

Behavior Transformers: "Behavior Transformers: Cloning k Modes with One Stone", NeurIPS 2022. [Paper] [Code]

Continuous & Diffusion Policies

Models that use diffusion or flow matching to generate continuous trajectories.

π₀ (pi-zero): "π₀: A Vision-Language-Action Flow Model for General Robot Control", arXiv, Oct 2024. [Paper] [Project]

π₀.5: "π₀.5: Scaling Robot Foundation Models", arXiv, Apr 2025. [Paper]

Octo: "Octo: An Open-Source Generalist Robot Policy", RSS 2024. [Paper] [Project] [Code]

Diffusion Policy: "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion", RSS 2023. [Paper] [Project] [Code]

RDT-1B: "RDT-1B: A Diffusion Foundation Model for Bimanual Manipulation", arXiv, Oct 2024. [Paper] [Project]

DexVLA: "DexVLA: Vision-Language Model with Plug-In Diffusion Expert", arXiv, Feb 2025. [Paper] [Project]

Diffusion-VLA: "Diffusion-VLA: Scaling Robot Foundation Models via Unified Diffusion and Autoregression", arXiv, Dec 2024. [Paper] [Project]

3D Diffusion Policy: "3D Diffusion Policy: Generalizable Visuomotor Policy Learning via 3D Representations", RSS 2024. [Paper] [Project]

Moto: "Moto: Latent Motion Token as the Bridging Language for Robot Manipulation", arXiv, Dec 2024. [Paper] [Project]

Consistency Policy: "Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation", RSS 2024. [Paper] [Project]

World Models

JEPA & Latent Prediction

Joint-Embedding Predictive Architecture (JEPA) predicts future latent states rather than pixels.

"A Path Towards Autonomous Machine Intelligence", Meta AI, Jun 2022. [Paper]

I-JEPA: "Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture", CVPR 2023. [Paper] [Code]

V-JEPA: "Video Joint Embedding Predictive Architecture", arXiv, Feb 2024. [Paper] [Code]

MC-JEPA: "MC-JEPA: Self-Supervised Learning of Motion and Content Features", CVPR 2023. [Paper]

LeJEPA: "LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics", arXiv, Nov 2025. [Paper]

VL-JEPA: "VL-JEPA: Vision-Language Joint Embedding Predictive Architecture", arXiv, Dec 2025. [Paper]

"Value-guided Action Planning with JEPA World Models", arXiv, Jan 2026. [Paper]

Generative World Models

World models that generate pixels, video, or interactive environments.

World Models: "World Models", NeurIPS 2018. [Paper] [Project]

DreamerV3: "Mastering Diverse Domains through World Models", arXiv, Jan 2023. [Paper] [Project]

Genie: "Genie: Generative Interactive Environments", ICML 2024. [Paper] [Project]

Genie 2: "Genie 2: A Large-Scale Foundation World Model", DeepMind, Dec 2024. [Blog]

Sora: "Video Generation Models as World Simulators", OpenAI, Feb 2024. [Blog]

GAIA-1: "GAIA-1: A Generative World Model for Autonomous Driving", arXiv, Sep 2023. [Paper]

GameNGen: "Diffusion Models Are Real-Time Game Engines", arXiv, Aug 2024. [Paper]

DIAMOND: "Diffusion for World Modeling: Visual Details Matter in Atari", NeurIPS 2024. [Paper] [Code]

3D Gaussian Splatting: "3D Gaussian Splatting for Real-Time Radiance Field Rendering", SIGGRAPH 2023. [Paper] [Project]

"From Words to Worlds: Spatial Intelligence is AI's Next Frontier", World Labs, 2025. [Blog]

Marble: "Marble: A Multimodal World Model", World Labs, Nov 2025. [Project]

RTFM: "RTFM: A Real-Time Frame Model", World Labs, Oct 2025. [Project]

Embodied World Models

World models designed for robotic manipulation, navigation, and physical reasoning.

Structured World Models: "Structured World Models from Human Videos", RSS 2023. [Paper] [Project]

WHALE: "WHALE: Towards Generalizable and Scalable World Models for Embodied Decision-making", arXiv, Nov 2024. [Paper]

"A Controllable Generative World Model for Robot Manipulation", arXiv, Oct 2025. [Paper]

Code World Model: "Code World Model: Learning to Execute Code in World Simulation", Meta AI, Oct 2025. [Paper]

PhyGDPO: "PhyGDPO: Physics-Aware Text-to-Video Generation via Direct Preference Optimization", Meta AI, Jan 2026. [Paper]

"The Essential Role of Causality in Foundation World Models for Embodied AI", arXiv, Feb 2024. [Paper]

MineDreamer: "MineDreamer: Learning to Follow Instructions via Chain-of-Imagination", arXiv, Mar 2024. [Paper] [Project]

Video Language Planning: "Video Language Planning", ICLR 2024. [Paper] [Project]

"Learning Universal Policies via Text-Guided Video Generation", NeurIPS 2023. [Paper] [Project]

SIMA: "Scaling Instructable Agents Across Many Simulated Worlds", arXiv, Mar 2024. [Paper] [Blog]

UniSim: "UniSim: Learning Interactive Real-World Simulators", ICLR 2024. [Paper] [Project]

Reasoning & Planning

Chain-of-Thought & Deliberation

Models implementing "thinking before acting" with explicit reasoning or value-guided search.

Hume: "Hume: Introducing Deliberative Alignment in Embodied AI", arXiv, May 2025. [Paper]

Embodied-CoT: "Robotic Control via Embodied Chain-of-Thought Reasoning", arXiv, Jul 2024. [Paper] [Project]

ReAct: "ReAct: Synergizing Reasoning and Acting in Language Models", ICLR 2023. [Paper] [Code]

ReKep: "ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints", CoRL 2024. [Paper] [Project]

TraceVLA: "TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness", arXiv, Dec 2024. [Paper] [Project]

LLM-State: "LLM-State: Open World State Representation for Long-horizon Task Planning", arXiv, Nov 2023. [Paper]

Statler: "Statler: State-Maintaining Language Models for Embodied Reasoning", ICRA 2024. [Paper] [Project]

RoboReflect: "RoboReflect: Reflective Reasoning for Robot Manipulation", arXiv, 2025. [Paper]

Error Detection & Recovery

Methods for detecting failures and correcting robot actions in real-time.

DoReMi: "Grounding Language Model by Detecting and Recovering from Plan-Execution Misalignment", arXiv, Jul 2023. [Paper] [Project]

CoPAL: "Corrective Planning of Robot Actions with Large Language Models", ICRA 2024. [Paper] [Project]

Code-as-Monitor: "Code-as-Monitor: Constraint-aware Visual Programming for Failure Detection", CVPR 2025. [Paper] [Project]

AHA: "AHA: A Vision-Language-Model for Detecting and Reasoning over Failures", arXiv, Oct 2024. [Paper]

PRED: "Pre-emptive Action Revision by Environmental Feedback", CoRL 2024. [Paper]

Learning Paradigms

Imitation Learning

Behavioral cloning and learning from demonstrations.

CLIPort: "CLIPort: What and Where Pathways for Robotic Manipulation", CoRL 2021. [Paper] [Project] [Code]

Play-LMP: "Learning Latent Plans from Play", CoRL 2019. [Paper] [Project]

MimicPlay: "MimicPlay: Long-Horizon Imitation Learning by Watching Human Play", CoRL 2023. [Paper] [Project]

RVT: "RVT: Robotic View Transformer for 3D Object Manipulation", CoRL 2023. [Paper] [Project] [Code]

RVT-2: "RVT-2: Learning Precise Manipulation from Few Demonstrations", RSS 2024. [Paper] [Project]

DIAL: "Robotic Skill Acquisition via Instruction Augmentation", arXiv, Nov 2022. [Paper] [Project]

Perceiver-Actor: "A Multi-Task Transformer for Robotic Manipulation", CoRL 2022. [Paper] [Project] [Code]

BOSS: "Bootstrap Your Own Skills: Learning to Solve New Tasks with LLM Guidance", CoRL 2023. [Paper] [Project]

Reinforcement Learning

RL-based methods for optimizing VLA policies.

CO-RFT: "CO-RFT: Chunked Offline Reinforcement Learning Fine-Tuning for VLAs", arXiv, 2026. [Paper]

HICRA: "HICRA: Hierarchy-Aware Credit Assignment for Reinforcement Learning in VLAs", arXiv, 2026. [Paper]

FLaRe: "FLaRe: Achieving Masterful and Adaptive Robot Policies with Large-Scale RL Fine-Tuning", arXiv, Sep 2024. [Paper] [Project]

Plan-Seq-Learn: "Plan-Seq-Learn: Language Model Guided RL for Long Horizon Tasks", ICLR 2024. [Paper] [Project]

GLAM: "Grounding Large Language Models in Interactive Environments with Online RL", arXiv, Feb 2023. [Paper] [Code]

ELLM: "Guiding Pretraining in Reinforcement Learning with Large Language Models", ICML 2023. [Paper]

RL4VLA: "RL4VLA: What Can RL Bring to VLA Generalization?", NeurIPS 2025. [Paper]

TPO: "TPO: Trajectory-wise Preference Optimization for VLAs", arXiv, 2025. [Paper]

ReinboT: "ReinboT: Reinforcement Learning for Robotic Manipulation", arXiv, 2025. [Paper]

Reward Design

Automated reward function generation using language models.

Text2Reward: "Text2Reward: Automated Dense Reward Function Generation", arXiv, Sep 2023. [Paper] [Project]

Language to Rewards: "Language to Rewards for Robotic Skill Synthesis", CoRL 2023. [Paper] [Project]

ExploRLLM: "ExploRLLM: Guiding Exploration in Reinforcement Learning with LLMs", arXiv, Mar 2024. [Paper]

Scaling & Generalization

Scaling Laws

Mathematical relationships between model/data scale and robotic performance.

"Neural Scaling Laws for Embodied AI", arXiv, May 2024. [Paper]

"Data Scaling Laws in Imitation Learning for Robotic Manipulation", arXiv, Oct 2024. [Paper] [Project]

AutoRT: "AutoRT: Embodied Foundation Models for Large Scale Orchestration", ICRA 2024. [Paper] [Project]

SARA-RT: "SARA-RT: Scaling up Robotics Transformers with Self-Adaptive Robust Attention", arXiv, Dec 2023. [Paper]

"Scaling Robot Learning with Semantically Imagined Experience", RSS 2023. [Paper]

Cross-Embodiment Transfer

Single policies controlling diverse robot types.

RT-X: "Open X-Embodiment: Robotic Learning Datasets and RT-X Models", ICRA 2024. [Paper] [Project]

GENBOT-1K: "Towards Embodiment Scaling Laws: Training on ~1000 Robot Bodies", arXiv, 2025. [Paper]

Crossformer: "Scaling Cross-Embodied Learning: One Policy for Manipulation, Navigation, Locomotion", CoRL 2024. [Paper] [Project]

HPT: "Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers", NeurIPS 2024. [Paper] [Project]

MetaMorph: "MetaMorph: Learning Universal Controllers with Transformers", ICLR 2022. [Paper] [Project]

RUMs: "Robot Utility Models: General Policies for Zero-Shot Deployment", arXiv, Sep 2024. [Paper] [Project]

URMA: "Unified Robot Morphology Architecture", arXiv, 2025. [Paper]

RoboAgent: "RoboAgent: Generalization and Efficiency via Semantic Augmentations", ICRA 2024. [Paper] [Project]

Open-Vocabulary Generalization

Models that generalize to novel visual appearances and semantic concepts.

MOO: "Open-World Object Manipulation using Pre-trained Vision-Language Models", CoRL 2023. [Paper] [Project]

VoxPoser: "VoxPoser: Composable 3D Value Maps for Robotic Manipulation", CoRL 2023. [Paper] [Project]

RoboPoint: "RoboPoint: A Vision-Language Model for Spatial Affordance Prediction", CoRL 2024. [Paper] [Project]

CLIP-Fields: "CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory", RSS 2023. [Paper] [Project]

VLMaps: "Visual Language Maps for Robot Navigation", ICRA 2023. [Paper] [Project]

NLMap: "Open-vocabulary Queryable Scene Representations", ICRA 2023. [Paper] [Project]

LERF: "LERF: Language Embedded Radiance Fields", ICCV 2023. [Paper] [Project]

Any-point Trajectory: "Any-point Trajectory Modeling for Policy Learning", RSS 2024. [Paper] [Project]

Deployment

Quantization & Compression

Low-bit weight quantization for efficient edge deployment.

BitVLA: "BitVLA: 1-bit Vision-Language-Action Models for Robotics", arXiv, 2025. [Paper]

DeeR-VLA: "DeeR-VLA: Dynamic Inference of Multimodal LLMs for Efficient Robot Execution", arXiv, Nov 2024. [Paper] [Code]

QuaRT-VLA: "Quantized Robotics Transformers for Vision-Language-Action Models", arXiv, 2025. [Paper]

PDVLA: "PDVLA: Parallel Decoding for Vision-Language-Action Models", arXiv, 2025. [Paper]

Real-Time Control

Methods bridging high-latency AI inference and low-latency physical control.

A2C2: "A2C2: Asynchronous Action Chunk Correction for Real-Time Robot Control", arXiv, 2025. [Paper]

RTC: "Real-Time Chunking: Asynchronous Execution for Robot Control", arXiv, 2025. [Paper]

Safety & Alignment

Ethical constraints, safety frameworks, and human-robot alignment.

Robot Constitution: "Gemini Robotics: Bringing AI into the Physical World", arXiv, Mar 2025. [Paper]

ASIMOV: "ASIMOV: A Safety Benchmark for Embodied AI", arXiv, Mar 2025. [Paper]

RoboPAIR: "Jailbreaking LLM-Controlled Robots", ICRA 2025. [Paper] [Project]

RoboGuard: "Safety Guardrails for LLM-Enabled Robots", arXiv, Apr 2025. [Paper]

"Highlighting the Safety Concerns of Deploying LLMs/VLMs in Robotics", arXiv, Feb 2024. [Paper]

"Robots Enact Malignant Stereotypes", FAccT 2022. [Paper] [Project]

"LLM-Driven Robots Risk Enacting Discrimination, Violence, and Unlawful Actions", arXiv, Jun 2024. [Paper]

"Safe LLM-Controlled Robots with Formal Guarantees via Reachability Analysis", arXiv, Mar 2025. [Paper]

Lifelong Learning

Agents that continuously learn and adapt without forgetting prior skills.

Voyager: "VOYAGER: An Open-Ended Embodied Agent with Large Language Models", arXiv, May 2023. [Paper] [Project] [Code]

RoboGen: "RoboGen: A Generative and Self-Guided Robotic Agent", arXiv, Nov 2023. [Paper] [Project]

RoboCat: "RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation", arXiv, Jun 2023. [Paper] [Blog]

LOTUS: "LOTUS: Continual Imitation Learning via Unsupervised Skill Discovery", arXiv, Dec 2024. [Paper] [Project]

DEPS: "Describe, Explain, Plan and Select: Interactive Planning with LLMs for Open-World Agents", NeurIPS 2023. [Paper] [Code]

JARVIS-1: "JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal LLMs", arXiv, Nov 2023. [Paper] [Project]

MP5: "MP5: A Multi-modal Open-ended Embodied System via Active Perception", CVPR 2024. [Paper] [Project]

SPRINT: "SPRINT: Semantic Policy Pre-training via Language Instruction Relabeling", ICRA 2024. [Paper] [Project]

Applications

Humanoid Robots

Foundation models for humanoid robot control.

GR00T N1: "GR00T N1: An Open Foundation Model for Generalist Humanoid Robots", arXiv, Mar 2025. [Paper] [Project]

HumanPlus: "HumanPlus: Humanoid Shadowing and Imitation from Humans", arXiv, Jun 2024. [Paper] [Project]

ExBody: "Expressive Whole-Body Control for Humanoid Robots", RSS 2024. [Paper] [Project]

H2O: "Learning Human-to-Humanoid Real-Time Whole-Body Teleoperation", IROS 2024. [Paper] [Project]

OmniH2O: "OmniH2O: Universal Human-to-Humanoid Teleoperation and Learning", CoRL 2024. [Paper] [Project]

"Learning Humanoid Locomotion with Transformers", arXiv, Mar 2024. [Paper] [Project]

Manipulation

Robot manipulation with foundation models.

Scaling Up Distilling Down: "Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition", CoRL 2023. [Paper] [Project]

LLM3: "LLM3: Large Language Model-based Task and Motion Planning with Failure Reasoning", IROS 2024. [Paper]

ManipVQA: "ManipVQA: Injecting Robotic Affordance into Multi-Modal LLMs", IROS 2024. [Paper]

UniAff: "UniAff: A Unified Representation of Affordances for Tool Usage and Articulation", arXiv, Sep 2024. [Paper]

SKT: "SKT: State-Aware Keypoint Trajectories for Robotic Garment Manipulation", arXiv, Sep 2024. [Paper]

Manipulate-Anything: "Manipulate-Anything: Automating Real-World Robots using VLMs", CoRL 2024. [Paper] [Project]

A3VLM: "A3VLM: Actionable Articulation-Aware Vision Language Model", CoRL 2024. [Paper]

LaN-Grasp: "Language-Driven Grasp Detection", CVPR 2024. [Paper]

Grasp Anything: "Pave the Way to Grasp Anything: Transferring Foundation Models", arXiv, Jun 2023. [Paper]

Navigation

Vision-language models for robot navigation.

LM-Nav: "Robotic Navigation with Large Pre-Trained Models", CoRL 2022. [Paper] [Project]

NaVILA: "NaVILA: Legged Robot Vision-Language-Action Model for Navigation", arXiv, Dec 2024. [Paper] [Project]

CoW: "CLIP on Wheels: Zero-Shot Object Navigation", ICRA 2023. [Paper]

L3MVN: "L3MVN: Leveraging Large Language Models for Visual Target Navigation", IROS 2024. [Paper]

NaVid: "NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation", RSS 2024. [Paper] [Project]

OVSG: "Context-Aware Entity Grounding with Open-Vocabulary 3D Scene Graphs", CoRL 2023. [Paper] [Project]

CANVAS: "CANVAS: Commonsense-Aware Navigation System", ICRA 2025. [Paper]

VLN-BERT: "Improving Vision-and-Language Navigation with Image-Text Pairs from the Web", ECCV 2020. [Paper]

ThinkBot: "ThinkBot: Embodied Instruction Following with Thought Chain Reasoning", arXiv, Dec 2023. [Paper]

Resources

Datasets & Benchmarks

Simulation Platforms

Surveys

Citation

If you find this repository useful, please consider citing this list:

Contributing

We welcome contributions! Please submit a pull request to add relevant papers, correct errors, or improve organization.

Guidelines

About

A curated list of academic papers and resources on Physical AI — focusing on Vision-Language-Action (VLA) models, world models, embodied intelligence, and robotic foundation models.

Topics

Resources

License

Uh oh!

There was an error while loading. Please reload this page.

Stars

Watchers

Forks

Releases

Packages

  0

Footer

Footer navigation

Hacker News

相關文章

  1. 2026年具身AI與機器人領域的12項預測

    3 個月前

  2. 為機器人學、電腦視覺與實體人工智慧創建代理技能庫

    3 個月前

  3. 2024 年大型語言模型研究論文清單

    Sebastian Raschka'S Blog · 超過 1 年前

  4. 為何多模態 AI 需要類型化構件而非臨時 URL

    4 個月前

  5. 將機器人 AI 導入嵌入式平台:數據集錄製、VLA 微調與裝置端優化

    Huggingface · 大約 2 個月前

其他收藏 · 0