The autonomous robot stack
A working map of the layers between a sensor packet and a motor command in a modern autonomous robot — and what's changing under each layer.
If you opened an autonomous robot in 2018 and looked at its software, you'd find a stack that looked roughly like this:
Mission planner → Task planner → Motion planner → Controller → Hardware driver
↑ ↑ ↑
Perception Localization State estimator
Each box was a separate module, usually in ROS, with hand-tuned parameters and a deterministic data flow. The whole system was engineered top-down.
Open the same robot in 2026 and the picture is messier. Some of those boxes have collapsed into a single neural network. Others have been replaced by a foundation model that talks to the rest of the system via natural language. Some labs have flattened the entire stack into "pixels in, motor commands out." Most production robots still look closer to the 2018 diagram than the new ideas, but the boundary is moving.
This guide walks through the layers, what each one does, and what's changing.
Hardware abstraction
The bottom layer of the stack: motor drivers, encoder readers, IMU streams. ROS 2's ros2_control is the dominant interface, with a hardware-specific plugin per robot. micro-ROS extends this to microcontrollers (ESP32, Teensy) for actuator-side embedded code.
What's changing: not much. This is the part of the stack least disrupted by AI. The interfaces are good, the libraries are mature, and there's no reason to replace it with a neural network.
State estimation
Fuse IMU, encoders, sometimes camera, sometimes lidar to get a clean estimate of where each joint is, how fast it's moving, and (for mobile bases) where the base is. Extended Kalman Filters and factor-graph optimization (GTSAM, Ceres) still dominate.
What's changing: learned state estimators are starting to appear (LIO-SAM with neural front-ends, Visual-Inertial Odometry with deep features) but most production systems still use hand-engineered filters. The risk of a learned state estimator failing silently has kept adoption slow.
Perception
The biggest AI footprint in the classical stack. Camera images → object detections, segmentation masks, depth, semantic maps. Lidar → 3D point clouds → obstacle lists. Modern perception pipelines run multiple neural networks in parallel (object detection, depth estimation, optical flow), often sharing a backbone.
What's changing fast: open-vocabulary perception. Instead of training one detector per object class, models like SAM, GroundingDINO, and OWLv2 take a natural-language query at inference time and find the matching objects. This collapses a brittle taxonomy problem into a prompt.
Localization and mapping
SLAM (Simultaneous Localization and Mapping) is still mostly classical. ORB-SLAM3, LOAM, Cartographer, RTAB-Map are the workhorses. NeRF/Gaussian-Splatting-based approaches are gaining for offline mapping but online SLAM is largely unchanged.
For mobile bases, Nav2 (ROS 2's navigation stack) is the default. It uses occupancy grids and classical path planners and works.
Planning and decision-making
This is where the stack is changing most. The classical pipeline was:
- A task planner (sometimes PDDL-based, often a hand-written state machine) decides high-level steps.
- A motion planner (OMPL, MoveIt, RRT*) finds a collision-free trajectory from A to B.
- A trajectory optimizer smooths it.
Modern approaches collapse pieces of this:
- LLM-driven task planners. SayCan-style systems use a large language model to suggest plausible next steps, then score each step against a learned affordance model. The LLM contributes commonsense; the affordance model contributes "can this robot actually do that here."
- Learned motion policies. Diffusion Policy, ACT, and VLAs replace explicit motion planning for short-horizon manipulation. The model directly emits the trajectory.
- End-to-end agents. A single foundation model takes the high-level instruction and the camera feed and emits motor commands. No explicit planning layer. Works for short-horizon, fails on long-horizon.
The fault line is roughly: long-horizon = LLM planner, short-horizon = learned policy, with growing overlap.
Control
The bottom of the stack hasn't changed much. PID and model-predictive control still drive joints. Force-impedance control still handles contact. Some research models output actions at the trajectory level and let a low-level controller close the loop at 1 kHz.
What's changing: faster learned policies. RT-2 ran at a few Hz; π0 and OpenVLA can run at ~10 Hz; Helix and similar real-time VLAs aim for control-rate inference. Once a learned policy can run at 100+ Hz, the boundary between "policy" and "controller" disappears.
Skills, primitives, behaviors
In between high-level planning and low-level control sits the skill layer: parameterized behaviors like "pick", "pour", "open drawer". Classical robotics built these by hand. Modern stacks acquire them by learning. The most useful framing is:
- A library of skills the robot can do.
- A policy that picks the right skill given the situation.
- A fallback when the policy is unsure.
Most deployed manipulation robots in 2026 look like this, with a slowly growing skill library and an increasingly capable policy.
The two big shifts
Two things are reshaping the stack:
- Foundation models are absorbing perception + policy. Where you used to have separate object detectors and motion planners, you increasingly have one model that handles both, with classical components as fallback.
- Language is becoming the universal interface. Between layers, between humans and the robot, between the robot and itself. The fact that you can write "place the mug on the coaster" and have it Just Work some of the time is a real shift in what robot APIs look like.
Whether the stack collapses to a single neural network or keeps its modular shape with neural components inside each box is the central architectural question of the field right now.
Where to look next
- What is embodied AI? — the broader framing.
- What is a robot foundation model? — what's eating the middle layers.
- The Robot Brain Index tools tab — software at every layer of the stack.