基于視覺語言動作模型的空間表征與動作生成方法綜述-資訊-控制網

基于視覺語言動作模型的空間表征與動作生成方法綜述

點擊數：2490 發布時間：2026-02-10 13:36:47
分享到：

視覺語言動作（Vision–Language–Action，VLA）模型正日益成為構建通用具身智能的關鍵技術路徑。本文從二維到三維空間表征演進以及自回歸、擴散與強化學習等動作生成范式兩個維度，對VLA研究進展進行了系統梳理，并介紹了該領域從二維感知向三維空間理解演進的過程，分析了基于自回歸、擴散模型與強化學習等多種范式的動作建模方法在時序建模能力、任務適配性與泛化特性方面的共性與差異，進而對比分析了仿真平臺與真實機器人系統中數據集、評測指標與系統架構的差異及其對模型泛化的影響，最后分析了VLA模型面臨的空間理解、動作規劃、數據效率及真實場景泛化等技術挑戰，并對結構化三維表示、物理一致性動作生成、高效數據利用以及安全控制機制等未來發展方向進行了展望，從而為構建高效、可靠且可擴展的通用具身智能系統提供了參考。

關鍵詞：視覺語言動作模型；具身智能；三維空間增強；動作生成；機器人操作

東北大學吳成東，黃路，莊曜銘，張欣，李暢澳大利亞WesternSydney大學 Hao Wu

摘要：視覺語言動作（Vision–Language–Action，VLA）模型正日益成為構建通用具身智能的關鍵技術路徑。本文從二維到三維空間表征演進以及自回歸、擴散與強化學習等動作生成范式兩個維度，對VLA研究進展進行了系統梳理，并介紹了該領域從二維感知向三維空間理解演進的過程，分析了基于自回歸、擴散模型與強化學習等多種范式的動作建模方法在時序建模能力、任務適配性與泛化特性方面的共性與差異，進而對比分析了仿真平臺與真實機器人系統中數據集、評測指標與系統架構的差異及其對模型泛化的影響，最后分析了VLA模型面臨的空間理解、動作規劃、數據效率及真實場景泛化等技術挑戰，并對結構化三維表示、物理一致性動作生成、高效數據利用以及安全控制機制等未來發展方向進行了展望，從而為構建高效、可靠且可擴展的通用具身智能系統提供了參考。

關鍵詞：視覺語言動作模型；具身智能；三維空間增強；動作生成；機器人操作

Abstract: Vision–Language–Action (VLA) models have emerged as a promising foundation for general-purpose embodied intelligence. This survey provides a structured overview of recent advances in VLA research, focusing on two core aspects: the progression of spatial representations from 2D perception to 3D understanding, and the development of action generation paradigms, including autoregressive modeling, diffusion-based policies, and reinforcement learning. We examine how these paradigms differ in temporal modeling, task suitability, and generalization behavior across diverse embodied scenarios. Furthermore, we compare commonly used datasets, evaluation protocols, and system architectures in both simulation environments and real-world robotic platforms, and discuss how these factors influence model transfer and generalization. Finally, we summarize the key challenges faced by current VLA systems—such as spatial reasoning, long-horizon action planning, data efficiency, and real-world robustness—and outline future research directions, including structured 3D representations, physically grounded action generation, efficient data utilization, and safety-aware control. This survey aims to offer practical insights and guidance for the design of efficient, reliable, and scalable embodied intelligence systems.

Key words: Vision-Language-Action models; Embodied intelligence; 3D spatial enhancement; Action generation; Robot manipulation

在線預覽：基于視覺語言動作模型的空間表征與動作生成方法綜述.pdf

摘自《自動化博覽》2026年第一期暨《2026具身智能專刊》

1.我有以下需求：
得到貴公司產品詳細資料得到貴公司產品的價格信息貴公司產品銷售人員聯系我貴公司技術支持人員聯系我
2.詳細的需求：
*
姓名:	*
單位:
電話:	*
郵件:	*

技術頻道

行業頻道

熱門頻道

技術頻道

行業頻道

熱門頻道

資訊頻道

熱點新聞

推薦產品