Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future

 WorldBench Team

Teaser Image
 Introduction

The pursuit of fully autonomous driving (AD) has long been a central goal in AI and robotics. Conventional AD systems typically adopt a modular "Perception-Decision-Action" pipeline, where mapping, object detection, motion prediction, and trajectory planning are developed and optimized as separate components.

While this design has achieved strong performance in structured environments, its reliance on hand-crafted interfaces and rules limits adaptability in complex, dynamic, and long-tailed scenarios.

This survey reviews Vision-Language-Action (VLA) models — an emerging paradigm that integrates visual perception, natural language reasoning, and executable actions for autonomous driving. We trace the evolution from traditional Vision-Action (VA) approaches to modern VLA frameworks. Charting the evolution from precursor VA models to modern VLA frameworks, we provide historical context and clarify the motivations behind this paradigm shift.



 Definition

Vision-Action (VA) Models

A vision-centric driving system that directly maps raw sensory observations to driving actions, thereby avoiding explicit modular decomposition into perception, prediction, and planning. VA models learn end-to-end policies through imitation learning or reinforcement learning.

End-to-End Models World Models Imitation Learning Reinforcement Learning Trajectory Prediction

Vision-Language-Action (VLA) Models

A multimodal reasoning system that couples visual perception with large VLMs to produce executable driving actions. VLAs integrate visual understanding, linguistic reasoning, and actionable outputs within a unified framework, enabling more interpretable, generalizable, and human-aligned driving policies through natural language instructions and chain-of-thought reasoning.

End-to-End VLA Dual-System VLA Chain-of-Thought Instruction Following Interpretable Reasoning


 Collections

Vision-Action (VA) Models

0 items

Vision-Language-Action (VLA) Models

0 items

Datasets & Benchmarks

0 items



Contributors

Tianshuai Hu
Core Contributor
Xiaolu Liu
Core Contributor
Song Wang
Core Contributor
Yiyao Zhu
Core Contributor
Ao Liang
Core Contributor
Lingdong Kong
Core Contributor, Project Lead
Guoyang Zhao
Contributor, VLA
Zeying Gong
Contributor, VLA
Jun Cen
Contributor, VLA
Zhiyu Huang
Contributor, VLA
Xiaoshuai Hao
Contributor, VLA
Linfeng Li
Contributor, End-to-End Models
Hang Song
Contributor, End-to-End Models
Xiangtai Li
Contributor, End-to-End Models
Jun Ma
Advisor
Shaojie Shen
Advisor
Jianke Zhu
Advisor
Dacheng Tao
Advisor
Ziwei Liu
Advisor
Junwei Liang
Advisor