Glossary
Vision-language-action (VLA) model
A robot policy that takes a camera image + natural-language instruction and emits motor actions.
A vision-language-action (VLA) model is a robot policy with three input modalities (vision + language + proprioception) and one output modality (action). Most VLAs are built by taking a pretrained vision-language model and adding an action head — actions are either tokenized into the existing vocabulary (RT-2 style) or produced by a separate regression / diffusion head (OpenVLA, π0 style).
The "language" part is what makes VLAs feel different: you can issue a robot a novel instruction in text and it'll often do something reasonable.