robotic foundational models vs LLMs
A comparitive analysis between RFMs and LLMs
With language models, scale is simple: feed them more text, bigger networks, and more compute, and they usually get better.
Robotic foundation models (RFMs) are built a bit different. They live and operate in the physical world with hardware, sensors, moving parts, and real human contact.
The result is a red curve that jumps and dips instead of climbing smoothly as you can see above.
A big reason is that every dataset is tied to a particular robot or better to suited better particular use-cases (like we explored in the last report), for example a warehouse arm, a quadruped, a drone, or a humanoid.
They all move, sense, and act in their own ways.
Without a shared “language” for actions, data from one cannot really be used easily into other settings. A grip from a parallel gripper, for example, does not map cleanly to a five fingered hand.
Then you have the inputs: cameras, depth sensors, touch pads, joint encoders, microphones, and language instructions. They each run at their own speed and resolution. When those streams are not in sync, the model gets patchy vision, and those gaps can be worse than missing a data source entirely.
Data diversity > quantity
There are major three tiers of the quality of data. They are:
Top tier data can be sourced from channels like:
target-hardware tellers
expert demos with tactile/force data
validated simulations with realistic contact
Mid tier comes from:
first-person human video adapted to robot actions
play data relabeled for object affordances and skills
Low tier majorly includes:
large motion logs without goals or contact events
older datasets from the 2010s which add very little to modern day generalization
RFMs also have to bridge a gap that language models never face: turning a tidy plan into real, continuous motion. Say for eg: saying “Pick up the cup” is easy to say, but in practice it is dozens of micro adjustments in force and position. Even if it misses by a few millimetres and the whole thing can fall apart and can be a costly endeavour.
Testing is hard as well. Language models can be easily be benchmarked in minutes.
RFMs actually need thousands of physical trials, which are slow, messy, and expensive. Simulations can definitely help sort through options, but they cannot fully model surface friction or those rare failures that might show up once in a hundred runs.
Grouping tests by challenge, like new objects, strange lighting, different tools, or unfamiliar robots can surface weaknesses earlier.
I think it's also worth understanding when does scaling helps vs hurts 👇
Helps when:
robots have similar controls and movements
data includes real contact and failure cases
sensors are synced
training uses multiple learning methods and a structured controller
prompts go beyond text to include drawings, object-use hints, or skill names
Hurts when:
robots differ with no mapping between them
data is mostly easy, free-space motions
touch sensors are missing or out of sync
the system is one big block without structure
data weighting is poor
language only prompting is used for dexterity
no real feedback loop
So, what can really help move the red curve upwards? Making it more smooth.
context bandwidth: motion sketches, affordances, steerable skills
pre-train / post-train split: generalist → targeted finetune
hardware standardization + cost drops: unitree, figure, optimus trends)
better eval systems: gengap axes, simpler ranking, production metrics
fleet ops + teleop flywheels: to keep the data loop intact
Let's probably understand a few of them more deeply👇
→ Context bandwidth
motion prompts show the robot the path to follow so it can retry without retraining.
affordance prompts highlight where or how to grip before acting, improving stability in new situations.
skill prompts name reusable skills that can be combined for different tasks like pick up a cup or turn knob
Richer interfaces can help reduce jaggedness by
making supervision denser
guidance clearer
generalization more controllable
→ Training phases
Think of RFM training as recurring loop across these three major stages:-
pre-training broadly on diverse sources such as web images and text, actions from multiple robots, human first-person video, and realistic simulation
post-training narrowly with short demos on the target robot, remote corrections, and fine-tuning from human feedback
after deployment working on logging failures, recreating tricky scenarios in simulation, and learning from moments when a human takes over
Closing thoughts
So overall in robotics, models alone cannot sustain performance.
Unlike LLMs, where scaling size and data usually brings steady gains, RFMs are also constrained by external physical factors. Long-term progress depends on standard components, robust supply chains, and structured operations.
Standardizing parts, dual-sourcing critical hardware, and feeding field data back into training can reduce real-world variability.
With these fundamentals in place, the red curve could become less jagged and begin to close the gap with LLM-type smooth scaling.