robotic foundational models vs LLMs

Darshan Gandhi

Aug 13, 2025

Darshan Gandhi

Aug 13, 2025

Darshan Gandhi

Aug 13, 2025

A comparitive analysis between RFMs and LLMs

With language models, scale is simple: feed them more text, bigger networks, and more compute, and they usually get better.

Robotic foundation models (RFMs) are built a bit different. They live and operate in the physical world with hardware, sensors, moving parts, and real human contact.

The result is a red curve that jumps and dips instead of climbing smoothly as you can see above.

A big reason is that every dataset is tied to a particular robot or better to suited better particular use-cases (like we explored in the last report), for example a warehouse arm, a quadruped, a drone, or a humanoid.

They all move, sense, and act in their own ways.

Without a shared “language” for actions, data from one cannot really be used easily into other settings. A grip from a parallel gripper, for example, does not map cleanly to a five fingered hand.

Then you have the inputs: cameras, depth sensors, touch pads, joint encoders, microphones, and language instructions. They each run at their own speed and resolution. When those streams are not in sync, the model gets patchy vision, and those gaps can be worse than missing a data source entirely.

Data diversity > quantity

There are major three tiers of the quality of data. They are:

  • Top tier data can be sourced from channels like:

    • target-hardware tellers

    • expert demos with tactile/force data

    • validated simulations with realistic contact

  • Mid tier comes from:

    • first-person human video adapted to robot actions

    • play data relabeled for object affordances and skills

  • Low tier majorly includes:

    • large motion logs without goals or contact events

    • older datasets from the 2010s which add very little to modern day generalization

RFMs also have to bridge a gap that language models never face: turning a tidy plan into real, continuous motion. Say for eg: saying “Pick up the cup” is easy to say, but in practice it is dozens of micro adjustments in force and position. Even if it misses by a few millimetres and the whole thing can fall apart and can be a costly endeavour.

Testing is hard as well. Language models can be easily be benchmarked in minutes.

RFMs actually need thousands of physical trials, which are slow, messy, and expensive. Simulations can definitely help sort through options, but they cannot fully model surface friction or those rare failures that might show up once in a hundred runs.

Grouping tests by challenge, like new objects, strange lighting, different tools, or unfamiliar robots can surface weaknesses earlier.

I think it's also worth understanding when does scaling helps vs hurts 👇

Helps when:

  • robots have similar controls and movements

  • data includes real contact and failure cases

  • sensors are synced

  • training uses multiple learning methods and a structured controller

  • prompts go beyond text to include drawings, object-use hints, or skill names

Hurts when:

  • robots differ with no mapping between them

  • data is mostly easy, free-space motions

  • touch sensors are missing or out of sync

  • the system is one big block without structure

  • data weighting is poor

  • language only prompting is used for dexterity

  • no real feedback loop

So, what can really help move the red curve upwards? Making it more smooth.

  • context bandwidth: motion sketches, affordances, steerable skills

  • pre-train / post-train split: generalist → targeted finetune

  • hardware standardization + cost drops: unitree, figure, optimus trends)

  • better eval systems: gengap axes, simpler ranking, production metrics

  • fleet ops + teleop flywheels: to keep the data loop intact

Let's probably understand a few of them more deeply👇

→ Context bandwidth

  • motion prompts show the robot the path to follow so it can retry without retraining.

  • affordance prompts highlight where or how to grip before acting, improving stability in new situations.

  • skill prompts name reusable skills that can be combined for different tasks like pick up a cup or turn knob

Richer interfaces can help reduce jaggedness by

  • making supervision denser

  • guidance clearer

  • generalization more controllable

→ Training phases

Think of RFM training as recurring loop across these three major stages:-

  • pre-training broadly on diverse sources such as web images and text, actions from multiple robots, human first-person video, and realistic simulation

  • post-training narrowly with short demos on the target robot, remote corrections, and fine-tuning from human feedback

  • after deployment working on logging failures, recreating tricky scenarios in simulation, and learning from moments when a human takes over

Closing thoughts

So overall in robotics, models alone cannot sustain performance.

Unlike LLMs, where scaling size and data usually brings steady gains, RFMs are also constrained by external physical factors. Long-term progress depends on standard components, robust supply chains, and structured operations.

Standardizing parts, dual-sourcing critical hardware, and feeding field data back into training can reduce real-world variability.

With these fundamentals in place, the red curve could become less jagged and begin to close the gap with LLM-type smooth scaling.

Polaris Fund © 2025

Polaris Fund © 2025

Polaris Fund © 2025