← All guides

What is embodied AI?

A working definition of embodied AI, what makes it different from text-only AI, and the stack that's emerging beneath it.

Embodied AI is artificial intelligence that perceives, reasons about, and acts in the physical world through a body — usually a robot, but sometimes a simulated one. The "body" part is what makes embodied AI different from the chat assistants that dominate headlines: an embodied system has to deal with photons, friction, latency, and the fact that any wrong move has real consequences.

That distinction looks small. In practice it changes everything: the training data, the model architectures, the deployment surfaces, and the evaluation criteria. A model that can write a sonnet doesn't necessarily know that a glass of water tips over when you push it sideways. Embodied AI is the work of teaching machines the physics, geometry, and social conventions of the world they actually have to operate in.

A working definition

A useful definition has three parts:

  1. Perception. The system takes in raw sensor data — cameras, depth, force, audio, IMU — and turns it into a representation of the world. This used to be hand-written feature extractors; today it's mostly neural perception models.
  2. Reasoning. Given that representation and a goal, the system decides what to do next. "Next" can be a high-level plan ("pick up the mug") or a low-level control signal ("apply 0.3 Nm to joint 4").
  3. Action. The decision is translated into motor commands, the body moves, the world changes, and step 1 begins again. The loop closes.

The interesting questions in embodied AI sit at the boundaries between these three steps. How do you train one neural network that does all three? How do you fall back to a controller when the network is unsure? What happens when the perception model has never seen this kind of lighting before?

Why now

Three things changed in the last few years.

Models got general. Large language models showed that a single architecture, trained on enough data, can do many things at once. Researchers asked the natural follow-up: what happens if you train a similar model on robot data? The answer, increasingly, is "it generalizes." A model that has seen a thousand hours of teleoperated demonstrations across dozens of tasks can usually do a new task with a few examples — sometimes with zero.

Simulation got fast. Modern GPU-based simulators (Isaac Sim, MuJoCo MJX, Genesis) can simulate thousands of robots in parallel at faster-than-real-time. The bottleneck used to be data collection in the real world; now it's how cleverly you can simulate a useful task. The gap between sim and real is shrinking — see simulation-to-real.

Hardware got cheap. A research-grade robot arm cost $200,000 in 2015. Today there are credible 7-DoF arms under $10,000, humanoids in the same ballpark, and quadrupeds for the cost of a high-end laptop. More labs and startups can afford to iterate.

What's in the stack

The emerging "robot brain" stack has six rough layers:

LayerWhat it doesExamples
Foundation modelGeneral-purpose perception + action backboneRT-2, OpenVLA, π0, Helix
Skill policiesTask-specific behaviors (pick, place, navigate)Diffusion Policy, ACT
PlannerTranslates high-level goals into skill sequencesSayCan-style LLM planners
ControllerReal-time motor control, force loopsROS 2 controllers, MoveIt
SimulatorTraining + evaluation in a virtual worldIsaac Sim, MuJoCo, Genesis
DataDemonstrations, videos, simulated rolloutsOpen X-Embodiment, DROID

The Robot Brain Index tracks entries in each of these layers and the relationships between them.

What embodied AI is not

A few things often get confused with embodied AI but sit just outside it:

  • Industrial automation. A welding robot that repeats the same motion millions of times isn't embodied AI — it's a deterministic controller. Embodied AI is what you reach for when the task varies, the environment is unstructured, or specifying the behavior by hand is too expensive.
  • Pure vision models. A camera-based system that identifies objects but doesn't act on them is computer vision. Embodied AI requires the perception → action loop.
  • Chatbots that "control" something. A language model that calls APIs isn't embodied unless the API actually moves a body in the world. Tool-using LLMs are exciting, but they're a different problem.

Where this is going

The shorthand most researchers use is "GPT moment for robots." The bet is that, just as a single transformer architecture absorbed text, code, and images, a similar architecture trained on enough robot data will absorb the long tail of manipulation, locomotion, and human-robot interaction. Whether that bet pays off in three years, ten, or never, the work is being done now.

The Robot Brain Index is our way of tracking what's actually shipping, separating it from what's announced, and making the underlying papers, datasets, and code easy to find.

Tags:foundations