comparative analysis of robotic foundational models

Darshan Gandhi

Aug 26, 2025

Darshan Gandhi

Aug 26, 2025

Darshan Gandhi

Aug 26, 2025

We have tried to put down an exhaustive list of RFMs across the years and draw parallels between them across different parameters

Large language models improve in a predictable way. More text, more parameters, and more compute mostly lead to steady improvements in performance.

But RFMs don’t really follow the same pattern. As we discussed in the last report as well, their progress + progression is highly uneven and shaped not just by "data" but also by many external factors such as body of the robot, the sensors it carries, and the type of data used in training.

Overall, the RFMs can be segregated into different groups basis really the type of task, movement and navigation that’s required for the specific use-case. On a high level (ofc this is a non-exhaustive list), but here are a few broad categories I’ve seen:

  • Generalists: RT-1, RT-2, RT-X, PaLM-E, Gato (these are built to cover a wide range of tasks and robot types with images and language)

  • Manipulation: RFM-1, π0.5, ManipLLM, ManiFoundation, DYNA-1 (they are focused on handling objects such as picking, placing, gripping, and fine control)

  • Navigation: ViNT, NoMaD, PACT, CroCo v2 (they are built for robots that need to move through new spaces, reach targets, or explore unknown terrains)

  • Humanoid/coordination: GR00T N1.5, Helix, Skild Brain, Humanoid World Models (these are aimed at whole-body motion with many diff joints working together)

  • Hybrids: Gemini Robotics (these combine planning with direct action,using simple coded steps to guide tasks)

In the table below, I’ve just really tried to map/ compare these models side-by-side. Do take a look 👇🏻

Some high level observations are:

  • Model size ranges from very small to extremely large, but bigger models do not necessarily always deliver better results

  • Inputs may vary across models, with mostly multi-modal inputs like many using only images and language/ text while others also include signals such as action history, robot state, touch, or LiDAR

  • Outputs for each of the cases are handled differently, as some models generate step-by-step commands while others produce continuous motions/ full action sequences for navigation

  • The design of each model reflects the type of robot it is built for

    • generalists are trained across many robots

    • manipulation models focus on object handling

    • navigation models specialise in movement

    • humanoid models manage whole-body control

    • hybrids mix these approaches

  • Each model is defined by a key feature, such as cross-robot training, warehouse data, reasoning traces, contact synthesis, or code-based control, and these kind of shape their strengths and limitations

  • Data quality often proves more important than dataset size, since smaller datasets with detailed contact and failure cases are always far more valuable than generalised large datasets

  • Evaluation remains a challenge because there is no single benchmark, meaning that most models still depend on real-world testing to show reliability and versatility

Overall, I feel robotics does not really yet have that “universal foundation model” which works well for all use-cases as each task demands a different data point.

I think this will be the final post in the robotics series for now. Over the last few months, we’ve tried to dive deep into different aspects of robotics from the stack, to the bill of materials, to the flywheel and how they really together.

Thanks for following along, and I’m excited for what comes next!

Polaris Fund © 2025

Polaris Fund © 2025

Polaris Fund © 2025