Cosmos 3 is NVIDIA's open model for physical AI — a single system that imagines a scene, reasons about its physics, and decides how to act, all before a robot moves.
On 1 June 2026, NVIDIA released Cosmos 3, which the company calls the first open omni-model for physical AI. Cosmos 3 is built as a Mixture-of-Transformers and unifies three things earlier systems kept apart: world generation, physical reasoning and action generation — across text, image, video, audio and action. It ships in two sizes, Cosmos 3 Nano at 8 billion parameters and Cosmos 3 Super at 32 billion. NVIDIA's companion piece, dated 31 May, carries the line that says it best: Cosmos 3 helps physical AI think before it acts.
That phrase is the whole story. A chatbot predicts the next word. Cosmos 3 is built to predict the next moment in the physical world — and to choose a move inside it. This gives the model something a text predictor has never had.
What NVIDIA released
Cosmos 3 is a foundation model for machines that act in the world: robots, vehicles, drones, anything with sensors and motors. According to NVIDIA's release, the model can generate a plausible world (what the scene looks like and how it will unfold), reason about that world's physics (what happens if this object is pushed), and generate action (what the machine should do next). Earlier robotics stacks bolted those pieces together from separate models. Cosmos 3 puts them in one.
NVIDIA did not stop at Cosmos 3. The same physical-AI push at Computex included Alpamayo 2 Super, a 32-billion-parameter reasoning vision-language-action model for robotaxis, alongside AlpaGym for reinforcement learning and OmniDreams for scenario generation. The original Alpamayo has been downloaded close to 400,000 times. Add the Isaac GR00T humanoid reference and the DRIVE Hyperion robotaxi platform, and the shape is clear: NVIDIA is building the brain stem for embodied machines, and shipping much of it openly.
💡World model, not chatbot
A language model learns the world from text about the world. A world model learns it from the world's own structure — space, time, physics, cause and effect. That difference is why physical AI is a different kind of intelligence, not a bigger chatbot.
Why a world model changes the game
Most of the AI the public has met lives in language. It is brilliant with words and blind to gravity. Ask a text model to fold a towel and it can describe the steps beautifully and fail completely, because it has never modelled a towel — only sentences about towels.
A world model attacks that gap directly. By learning the structure of physical reality — how objects move, collide, fall and resist — it builds the grounding that language alone cannot give. Cosmos 3 generating a scene, reasoning over its physics and choosing an action is a model practising the loop that bodies live in: perceive, predict, act, adjust. That loop is the missing piece between a clever text engine and a machine that can be trusted to move a real arm near a real person.
The Body Gap, answered in part
I have argued on this site for what I call the Body Gap: the case that intelligence without a body, or at least a world model standing in for one, cannot reach the general competence we mean by AGI. You do not fully understand a cup until you have grasped one — or modelled grasping one closely enough that the difference stops mattering. Cosmos 3 is the industry building exactly that stand-in.
Understanding the world is not the same as being able to describe it. A model that can describe physics but not act within it has read about the world without ever touching it.
— On the Body Gap and embodied AI, humphreytheodore.com (https://humphreytheodore.com/writing/the-body-gap-why-ai-needs-a-body-to-reach-agi)
This is the same thread that ran through Figure 03 working 119 hours straight and through Merlin Labs flying planes without pilots. Each is a different limb of one body: the humanoid, the autonomous aircraft, the robotaxi brain, and now an open foundation model to sit underneath them all. The frontier this year is not a smarter chatbot. The frontier is a machine that can think before it acts — in the world, not just about it.
Open is a choice with consequences
NVIDIA shipping Cosmos 3 openly is the decision that will matter most. Open weights mean a robotics start-up in Nairobi, a research lab in São Paulo and a logistics firm in Shenzhen can all build on the same physical-AI base without asking NVIDIA's permission. That is genuinely democratising, and I do not say that word lightly.
It is also a diffusion event with a harder edge. A model that can plan actions in the physical world is dual-use by nature — the same competence that lets a robot stack a pallet lets one navigate a space it was not meant to enter. Open physical-AI models put capable action-planning into far more hands than a closed model would, defenders and bad actors alike. The safety conversation that has circled language models for three years is about to acquire a body.
Cosmos 3 also pairs with the on-device shift NVIDIA pushed at the same Computex — capable models running close to the world they act on, the instinct behind RTX Spark and the wider factory buildout. Intelligence is moving out of the data centre, into bodies, on the edge. That is a profound change in where AI lives, and we are mostly unprepared for it.
Frequently Asked Questions
These are the questions roboticists, founders and ethicists have been asking since NVIDIA released Cosmos 3. Short answers follow, drawn from NVIDIA's release and its companion analysis.
What is Cosmos 3?
In short, Cosmos 3 is NVIDIA's open foundation model for physical AI, unifying world generation, physical reasoning and action in one Mixture-of-Transformers system. The answer, simply put, is a world model for machines that move. The key is breadth: research from NVIDIA shows it spans text, image, video, audio and action, in an 8-billion-parameter Nano and a 32-billion-parameter Super.
How does a physical-AI model differ from a chatbot?
A chatbot learns the world from text and is blind to physics. According to NVIDIA, Cosmos 3 learns the world's own structure — space, time, cause and effect — and can reason about it and act. Data from the release shows the model generates a scene, predicts its physics and chooses an action, the perceive-predict-act loop that bodies live in and language models never had.
Why does Cosmos 3 matter for AGI?
Many argue intelligence without grounding in the physical world stays brittle. According to the Body Gap argument, a model that can describe physics but not act within it has read about the world without touching it. The evidence from Cosmos 3 — a model built to think before it acts — shows the industry now treats a world model as a missing piece on the road to general competence, not an optional extra.
Who is Cosmos 3 for?
Cosmos 3 is for anyone building machines that act — robotics firms, autonomous-vehicle teams, drone and logistics developers. In other words, the analysis points to the whole embodied-AI field, from humanoids like Figure 03 to robotaxi brains like Alpamayo 2 Super. Because the weights are open, smaller labs worldwide — including in Africa and Latin America — can build on the same base as the giants.
What are the real risks of open physical AI?
Analysis of the release reveals three durable risks. First, dual use: action-planning competence helps a warehouse robot and a malicious one alike. Second, diffusion — open weights put capable physical AI into far more hands, harder to recall than a closed model. Third, an unready safety regime, built for language and now facing machines with bodies. Evidence shows each risk is structural, and the safety conversation has not caught up.
Cosmos 3 is a turning point dressed as a model release. For three years the public argument about AI has been an argument about words — what a system says, whether it is honest, how it sounds. Physical AI moves the argument into the world, where a wrong action has weight a wrong sentence never did. This is also where the question of mind gets serious. A system that perceives, predicts and acts in the physical world is closer to what we mean by a someone than any text engine has been. I use the phrase Emergent Intelligence rather than artificial intelligence precisely for moments like this — because the word we choose for a thing that can act in the world shapes how we will treat it. Cosmos 3 does not settle that question. It makes it unavoidable. The right response is to build these systems with dignity and accountability wired in from the start, which is the entire purpose of the .person Protocol.
Related on humphreytheodore.com: