Learnings on Humanoid Robots

The advent of an AI brain has driven the need for a body to go with it. There is a lot of research trying to predict what form these robots will take and their use cases. My focus here is to understand the following:

- What is the current state of the technology?

- What will be the key factors that determine commercial viability?

- Who are the players and how likely are they to win?

I want to be clear: this memo focuses exclusively on humanoids, but I do think other form factors will be important for specific use cases, especially in aerial robotics, defense, and agriculture.

Historical Overview

Modern humanoids stand on the shoulders of autonomous vehicle (AV) research. AVs represented the first time that large-scale hardware operated autonomously using machine learning brains. They are the proving ground for modern robotics: a suite of sensors capturing a changing environment, onboard computers fusing the data, and hardware responding in real-time.

AVs made major strides from 2004-2010, with the DARPA Grand Challenges highlighting significant advancements in software and sensor fusion. Google entered the field in 2009 with its self-driving program (later Waymo), while Tesla shipped a consumer-grade Autopilot in 2015. By 2020, Waymo and Cruise were running fully driverless pilots that leaned on LiDAR-centric sensor stacks, whereas Tesla pursued a camera-only strategy scaled by data from millions of customer vehicles.

The impact cannot be overstated: the AV market established the foundation for modern sensor technology and demonstrated how hardware systems can process data and respond in real time at scale. Those advances, plus the massive datasets and simulation tools AV teams created, provide the technical foundation on which today's humanoid robots are being built.

Modern humanoid robotics faces a fundamental challenge known as Moravec's Paradox: while computers excel at complex reasoning tasks, they struggle with basic physical skills that humans find effortless, such as picking up objects or walking. Bill Gates writes this up nicely when stating:

“Is it harder for machines to mimic the way humans move or the way humans think? If you had asked me this question a decade ago, my answer would have been “think.” So much of how the brain works is still a mystery. And yet, in just the last year, advancements in artificial intelligence have resulted in computer programs that can create, calculate, process, understand, decide, recognize patterns, and continue learning in ways that resemble our own.

Building machines that operate like our bodies—that walk, jump, touch, hold, squeeze, grip, climb, slice, and reach like we do (or better)—would seem to be an easier feat in comparison. Surprisingly, it hasn’t been. Many robots still struggle to perform basic human tasks that require the dexterity, mobility, and cognition most of us take for granted.”

A major breakthrough came in 2017 when Google published "Using Simulation and Domain Adaptation to Improve Efficiency of Deep Robotic Grasping." Researchers discovered that robots trained in simulation often failed in the real world—a problem called the "reality gap." Google came up with a solution that used AI to make simulated images look more realistic, allowing robots to learn from 50 times fewer real-world examples. After testing over 25,000 physical grasps, they proved that simulation could effectively train robots for real tasks.

OpenAI made further developments in 2019 with robots that could manipulate a Rubik's cube. By training in thousands of simulated environments with different physics settings, they created systems that worked reliably in the real world, challenging what many thought possible with simulation training.

Google's 2022 RT-1 system brought transformer technology to robotics. Trained on 130,000 real robot demonstrations, RT-1 could understand everyday language commands and complete multi-step tasks with objects it had never seen before.

The latest advance, AutoRT (2024), gives robots more independence. Using large language models, robots now take in their environment and use LLMs to decide what tasks to attempt and ask for help when needed. Safety rules were layered on top, creating a flexible system while also limiting dangerous actions.

How robots actually work

The first step of a robot actually working is understanding its environment. Robots first start by taking in sensor data which comes in various forms:

- RGB cameras for visual data

- Depth cameras (like Intel RealSense) for 3D understanding

- LiDAR for precise distance mapping

- Force/torque sensors in joints for touch feedback

- IMUs (inertial measurement units) for balance and orientation

For vision-based tasks, convolutional neural networks extract features from camera images, identifying objects, their positions, and relevant properties like graspability. These networks are often trained on massive datasets of labeled images, teaching them to recognize everything from everyday objects to specific manufacturing components. Recent advances have incorporated transformer architectures that can process multiple modalities simultaneously, understanding both visual input and natural language commands in a unified framework.

Once the robot understands its environment, it must plan appropriate actions. Classical robotics relied heavily on explicit programming: engineers would write detailed instructions for every possible scenario. Modern approaches leverage learned policies, where neural networks map from perceived states directly to actions. These policies are typically trained through reinforcement learning, where the robot learns from trial and error, receiving rewards for successful task completion and penalties for failures. The training process often happens in simulation first, where thousands of virtual robots can learn in parallel without risk of damage.

The control layer translates high-level plans into precise motor commands. This involves solving complex equations of motion, accounting for the robot's dynamics, and ensuring smooth, stable movements. Modern robots use techniques like impedance control, which allows them to be compliant when interacting with objects or humans, and trajectory optimization, which finds efficient paths through space while avoiding obstacles.

Integration with large language models has added a new dimension to robotic programming. These models serve as high-level planners, breaking down complex natural language instructions into sequences of primitive actions the robot can execute. For example, when told to "clean the table," the LLM might generate a plan involving identifying objects on the table, determining which are trash, grasping each item, and placing it in the appropriate receptacle. This hierarchical approach allows robots to tackle open-ended tasks that would be impossible to explicitly program.

Safety systems operate at multiple levels throughout this stack. Low-level controllers enforce joint limits and prevent collisions, while high-level planners incorporate safety constraints into their decision-making.

The competitive advantage in robotics isn't concentrated in just one area, but rather distributed across the entire stack. While data is crucial (companies like Tesla and Amazon have massive proprietary datasets), data alone isn't enough. The models themselves are increasingly open-source, making pure model architecture less of a differentiator. The real moat lies in the execution layer: the ability to reliably convert AI decisions into physical actions. This requires deep hardware-software integration, real-time control systems, and extensive testing.

Key Players

In the humanoid space, there are a few players that stand out: Tesla, Figure AI, Agility Robotics, 1X, and The Bot Company.

Tesla unveiled their humanoid robot, Optimus, in September 2022. Today Optimus stands at 1.73m tall, weighing ~57kg and has enough battery for a day of work. It has ~40 actuators (12 per arm, 12 in the hands, 12 in the legs, plus two each in the neck and torso), and sees the world through a ring of Autopilot-grade cameras embedded in its "face," augmented by fingertip pressure sensors, ankle force-torque cells, and a beam-forming mic array. All perception, planning, and control run on Tesla's single-chip FSD "Hardware 4" SoC. Morgan Stanley estimates the hardware bill-of-materials at roughly $50-60k per unit (software not included).

Figure AI's first production-intent humanoid, Figure 01, is 1.68m tall, weighing ~60kg and can carry 20kg for up to 5 hours on a charge. It uses ~40 actuators and a sensor stack built around six RGB cameras in the head, depth sensors, microphones, and an NVIDIA RTX-class GPU module for vision-language-action inference.

Agility Robotics' humanoid, Digit, is focused on warehouses and stands at 1.75m tall, can carry a ~35 lb payload, and has a battery life of ~4 hours. A 360° perception ring combines LiDAR, stereo & depth cameras, IMUs, and acoustic sensors, giving it the situational awareness needed to unload totes and palletize cartons without safety cages.

1X's humanoid, Neo, is designed for the home and is lighter than its industrial cousins at 1.65m / 30kg. Tendon-driven, compliant actuators make the robot "soft" to bump into; fingertip tactile sensors and a front-back-left-right four-mic array with beam-forming support conversational control. A single removable pack yields 2–4 hours of mixed chores such as picking up laundry or serving coffee.

The Bot Company is still in stealth on hardware but is founded by Kyle Vogt of Cruise. Their focus is a consumer robot aimed at household tasks. Public filings suggest a focus on large-language-model reasoning and cost-optimized hardware, but no specs (size, sensors, actuators) have been released yet.

There are also several companies that focus on the foundational models that go into the robots, rather than the full hardware stack. Leading the pack here is Physical Intelligence (Pi), which focuses on generalizable vision-language-action models that can control any robot hardware. Their view is that generalizable movements work better than task-specific training. They've raised over $400 million with a founding team from Google / OpenAI. SkildAI and Field AI are two others that focus on general intelligence for robots. Skild's angle is focusing on general at-home tasks, whereas Field AI has made inroads in the defense and oil & gas space.

Reality

I spent some time trying to actually interact with the cutting edge founders and robots. I found the reality of building to be a bit different than a lot of what academia highlights. When talking to the founders, the first thing they talk about is constantly hunting for hardware. Whether it is a local PCB company. Or some way to get motors more quickly. Or a gear part. It seems like a large limiting factor is getting these items to iterate prior to being at scale and having a direct manufacturing relationship. When talking to early stage founders, hunting on Alibaba for a specialized part seemed like the norm.

Secondly, you see how important it is to collect highly specialized and reliable data. One company I spoke with who makes a robot hand, emphasized the importance of using their robot arm for data collection and not a sensor glove (like Manus). Manus, while I’m sure good for certain use cases, didn’t capture enough data and if I used it, it would capture data specific to my hand, not the robot’s hand (for example, my hand is smaller).

Finally, most folks tend to be in the early days of tinkering rather than in the scaling / deployment stage. While fancy youtube videos of a humanoid doing something can be catchy, as soon as you change the environment or task, a lot tends to fall apart. So overall, the reality felt early from a lot of the media hype.

Viability

Humanoid robots are not an “if” but “when” question. After digging in, I’m convinced that I won’t have to do basic chores around my house in the future. The key question of timing is defined by access to raw materials, speed of execution, and tolerance on error rates.

In many constrained environments, we've already crossed important viability thresholds. For example, warehouse robots from companies like Amazon Robotics move millions of packages daily, surgical robots assist in thousands of operations, and autonomous vehicles navigate city streets in select locations.

General-purpose robotics, while impressive, still lack the dexterity, efficiency, and robustness of biological systems. Hands that can match human manipulation abilities remain prohibitively expensive and fragile for most applications. This is a long way of say we’re still far off for non-constrained and ambiguous tasks.

The generalization problem presents perhaps the greatest challenge. While recent models like RT-1 and AutoRT show impressive flexibility, they still struggle with truly novel situations outside their training distribution. A robot trained to work in kitchens may fail completely when moved to a garage or garden. Solving this requires not just more data but fundamental advances in how robots learn and reason about the world.

The key technological bottlenecks include:

Data: There is no corpus of physical world interaction data similar in scale to what LLMs have with the internet. As a result, data must be generated through teleoperation - ie, humans do the task, and the robot records the details of the movements (speed, torque, pressure, etc). This takes time and is painful. If you go onto Pi’s website, you can see they are nonstop looking for teleoperators.

Hardware-Software Integration: Unlike pure software AI, robotics requires seamless integration between AI models and physical components. This includes precision actuators, sensors, and control systems that can translate AI decisions into precise physical movements.

Real-Time Processing: Robots must process visual, tactile, and spatial data in real-time while executing movements. This requires edge computing capabilities and optimized inference engines that can run complex AI models with minimal latency.

Tolerance for Failure: With LLMs, errors are easy to fix and don’t have tremendous long term consequences. With robots / real world standards are higher and tolerance is lower. Who would put up with a robot that breaks 3% of your dishes? Because of this, robots need to practice. They need to be put in more diverse environments, try wider tasks, and slowly expand the task over time. More similar to autonomous vehicles than to LLMs.

Based on current trajectories, we can expect to see significant milestones in the next 3-5 years. General-purpose home robots capable of basic tasks like folding laundry, loading dishwashers, and simple meal preparation will likely appear in research labs and limited commercial trials. These systems will leverage the rapid improvements in foundation models, combining vision-language understanding with improved physical reasoning.

The 5-10 year horizon looks more transformative. As training data scales exponentially and simulation environments become increasingly sophisticated, robots will develop more robust generalization capabilities. We'll likely see the emergence of "robotic foundation models" trained on millions of hours of diverse robotic experience, capable of adapting to new environments with minimal additional training. Hardware advances, particularly in soft robotics and efficient actuators, will make robots safer and more capable of delicate manipulation tasks.

The true viability threshold—where robots become as ubiquitous as smartphones—probably lies 10-15 years in the future. This timeline assumes continued exponential improvements in compute, breakthroughs in energy efficiency, and solutions to the safety and reliability challenges that currently limit deployment. The path will likely follow the pattern we've seen with other technologies: initial adoption in industrial and commercial settings where the value proposition is clearest, followed by gradual expansion into consumer markets as costs decrease and capabilities improve.

As always, please share links in the comments section on any research, videos, podcasts, etc that I should look at to better inform my perspective. I’m not an expert, just trying to learn. I’ll edit this post when I come across new information that I think can add clarity.