Vision-language-action (VLA) model — Glossary

A vision-language-action (VLA) model is a robot policy with three input modalities (vision + language + proprioception) and one output modality (action). Most VLAs are built by taking a pretrained vision-language model and adding an action head — actions are either tokenized into the existing vocabulary (RT-2 style) or produced by a separate regression / diffusion head (OpenVLA, π0 style).

The "language" part is what makes VLAs feel different: you can issue a robot a novel instruction in text and it'll often do something reasonable.

See Vision-language-action models, explained.