← All guides

Robotics for AI developers

If you're comfortable with PyTorch and transformers but new to robots, here's the shortest viable path to running a real or simulated robot.

If you're an AI developer who can train a transformer in your sleep but has never plugged in a robot, the field looks intimidatingly hardware-flavored. The good news: most of the modern robot stack is software you already know how to use. ROS, simulators, and learning frameworks all run as Python packages. You can do meaningful work without ever buying a physical robot.

This guide is the shortest viable path from "I can train models" to "I can train a robot to do something useful."

The mental model swap

The biggest adjustment isn't technical. It's accepting that:

  1. Real-time matters. A 200 ms inference latency is fine for a chatbot; it's catastrophic for a robot that's trying to catch a falling cup. Plan your model size for the loop rate, not just the headline metric.
  2. Failure modes are different. A wrong token is a bad sentence. A wrong action is a broken object, an injured person, or a service call. Safety isn't an afterthought — it's a constraint you design around from the start.
  3. You can't iterate on prod. You can't deploy a half-trained model to a real robot the way you can to a chatbot. Sim is your iteration loop.
  4. Data is precious. Pretraining a language model uses trillions of tokens. Training a robot policy uses tens of thousands of episodes if you're lucky. Every data point matters.

A 1-week starter path

Day 1-2: Get a simulator running.

Install MuJoCo (pip install mujoco) and run their built-in environments. Or install Isaac Lab if you have an NVIDIA GPU. Get a simulated arm to wave around. Don't worry about doing anything useful yet.

Day 3: Read a behavior cloning paper.

Diffusion Policy or ACT. Both have public code, both train in under an hour on a single GPU, both produce a working arm policy on simulation. Reproduce one.

Day 4: Try an open-source VLA.

OpenVLA is the most accessible. The weights are on HuggingFace, the inference code is well-documented, and the training pipeline is reproducible. Run it on one of the BridgeData V2 evaluation tasks. Get a feel for what a model that "follows language instructions" actually does in practice.

Day 5: Look at real data.

Download a sub-dataset from Open X-Embodiment or DROID. Open the trajectories in a Jupyter notebook. Scrub through the camera feeds. Plot the actions. See how noisy real data is. This will recalibrate your expectations of what models can learn from this.

Day 6-7: Build something tiny.

Fine-tune OpenVLA (or train a small Diffusion Policy from scratch) on a curated subset for a new task. Doesn't have to work well. The goal is to feel the end-to-end loop: collect/curate data, train, evaluate, identify failure modes, iterate.

By the end of the week you'll know whether you want to keep going.

The hardware question

If you decide to buy a robot, the cheap-and-cheerful options are:

  • SO-100 / SO-101 arms from the LeRobot project. About $200 in parts; fits on a desk; works with the LeRobot stack out of the box.
  • WidowX-250s — the arm BridgeData V2 was collected on. Around $2K; well-supported in Isaac Lab and ROS.
  • Franka Research 3 — the de facto standard research arm. $30K+. Most published manipulation work uses this or a Franka 7.
  • Unitree Go2 for quadrupeds. About $3K and runs the Unitree SDK out of the box.

You don't need real hardware to learn. You do need it eventually if you want to deploy.

ROS, briefly

You will run into ROS (Robot Operating System), specifically ROS 2. It's not so much an OS as a publish-subscribe message bus with a set of conventions. Two things to know:

  1. It's how everything talks. Sensor drivers publish to topics. Controllers subscribe to topics. The standard interfaces are well-defined.
  2. You don't need it for learning research. Most simulator-based research uses Python directly without going through ROS. You'll touch ROS when you start thinking about deploying to real hardware.

If you want to learn ROS, do the official tutorials for ROS 2 Jazzy. Stop when you've understood topics, services, and actions. You can come back for the rest.

Things that look like AI problems but aren't

A few traps:

  • "Just fine-tune a VLM on robot data." Several teams have tried; the action heads and embodiment-specific quirks matter more than they appear. Use an existing VLA architecture; don't roll your own.
  • "We'll bootstrap with synthetic data." Synthetic image data can help but it doesn't solve the embodiment gap. Mix it with real demonstrations.
  • "The model can plan." Reactive policies don't plan. If you need long-horizon behavior, add a planner — usually an LLM that decomposes the task.
  • "Sim-to-real is just domain randomization." Sometimes. For locomotion, usually. For contact-rich manipulation, you'll need more than that.

Where to look next

Tags:getting-started