robotic foundational models vs LLMs

Darshan Gandhi

Aug 13, 2025

Darshan Gandhi

Aug 13, 2025

Darshan Gandhi

Aug 13, 2025

A comparitive analysis between RFMs and LLMs

With language models, scale is simple: feed them more text, bigger networks, and more compute, and they usually get better.

Robotic foundation models (RFMs) are built a bit different. They live and operate in the physical world with hardware, sensors, moving parts, and real human contact.

The result is a red curve that jumps and dips instead of climbing smoothly (in case of LLMs) as you can see above.

A big reason is that every dataset is tied to a particular robot or more precisely, to specific use cases (as we explored in the last report), such as a warehouse arm, a quadruped, a drone, or a humanoid. They all move, sense, and act in their own ways.

Without a shared “language” for actions, data from one robot often can’t be used by another. For example, a grip from a parallel gripper doesn’t translate well to a five-fingered hand.

Robots also take in many kinds of inputs: cameras, depth sensors, touch pads, joint encoders, microphones, and language instructions. Each type of sensor works on its own clock like security cameras and microphones which record at different frame rates. If they aren’t in sync, the robot ends up with an incomplete picture, sometimes worse than not having that input at all.

Data diversity > quantity. For eg. in a search engine, crawling a billion copies of the same webpage doesn’t improve results. What matters is covering as many unique sites as possible. Similarly in robotics: repetitive free-space motion logs won’t help as much as varied actions that include contact (where the robot is actually touching and interacting with an object), edge cases, and different sensor inputs.

There are major three tiers of the quality of data. They are:

Top tier data can be sourced from channels like:
- directly collected robot sensor data eg. a robot arm logging its exact grip force while packing boxes
- expert demos with tactile/force data
- realistic verified simulations
Mid tier comes from:
- human video (in first person) adapted to robot actions eg. GoPro footage of a chef cooking, converted into steps a kitchen robot can follow
- play data re-labeled for object affordances and skills
Low tier majorly includes:
- long motion logs with no goals/contact eg. hours of a drone flying around without touching or interacting with anything
- old 2010s datasets which add very little to modern generalised models

RFMs also have to deal with something language models don’t: turning a simple plan into smooth, real-world movement. For example, saying “pick up the cup” is easy, but actually doing it means making dozens of tiny adjustments in grip and position. Even being off by a few millimetres can make the cup slip and that can be costly.

Testing is hard as well. Language models can be easily be benchmarked in minutes.

RFMs actually need thousands of physical trials, which are slow, messy, and expensive. Simulations can definitely help narrow down the choices, but they can't fully capture things like how slippery a surface is or account for those rare failures that might show up once in a hundred runs.

Grouping tests by challenge, like new objects, strange lighting, different tools, or unfamiliar robots can help spot problems earlier.

I think it's also worth understanding when does scaling helps vs hurts 👇

Helps when:

Robots move the same way (eg: similar warehouse arms)
Data has real contact + mistakes (eg: drops, slips)
Sensors line up in time (eg: video + touch match)
Training mixes methods + clear steps (eg: find → reach → grip)
Prompts use pics, arrows, or skill names

Hurts when:

Robots are too different (eg: legs vs wheels)
Data in "free-space motion" (eg: drone in empty field)
Touch data missing or out of sync
No step breakdown (eg: “set table” as one task)
Data skewed (eg: eg: mostly apple picking examples)
Only text prompts for hand work
No feedback to fix repeat mistakes

So the main question is: what can really help move the red curve upwards making it smooth?

more context bandwidth: motion sketches, hints on object use, skills you can steer
pre-train → post-train: start broad, then fine-tune for specific jobs
cheaper & standard hardware: developments from players like Unitree, Figure, Optimus
better evaluation: test on real gaps, simpler rankings, live metrics
fleet ops + teleop: keep the data loop running with human assist when needed

Let's probably understand a few of them more deeply👇

→ Context bandwidth

motion prompts like drawing a line on a map for a delivery driver; the robot sees the path it should follow and can try again without “relearning” the whole route.
affordance prompts like showing someone where to hold a shopping bag so it doesn’t tear; the robot gets hints on where or how to grip, making it steadier in new situations.
skill prompts like giving a chef a list of recipes they already know; the robot reuses skills such as “pick up cup” or “turn knob” and combines them for new tasks.

→ Training phases

Think of RFM training as recurring loop across these three major stages:-

pre-training broadly on diverse sources such as web images and text, actions from multiple robots, human first-person video, and realistic simulation (e.g: shapes from online pics, moves from warehouse arms)
post-training narrowly with short demos on the target robot, remote corrections, and fine-tuning from human feedback (e.g. shelf picking, steering grip, rating gentle handling)
after deployment working on logging failures, recreating tricky scenarios in simulation, and learning from moments when a human takes over (e.g. cup drops, dark rooms, human saves)

The takeaway

Richer interfaces can help smooth the rough ups and downs in performance by giving closer supervision, clearer guidance, and more control over how skills are applied.

But in robotics, RFMs deal with problems that LLMs don’t. Making models bigger or adding more data isn’t enough. Differences in hardware, gaps in supply, and unpredictable real-world conditions can still cause drops in performance.

For steady progress, focus on hardware, operations, and data feedback as much as on the model itself. Use standard parts, have backup suppliers, and feed real-world data back into training for eg: like a F1 team that adjusts the car, replaces parts, and reviews race data after every run. Over time, this can make performance more stable and reliable.

crypto card flows