What is a world model, in one short answer?

If a language model predicts the next token, a world model tries to predict the next state of a world. That world can be a generated 3D space, a simulated driving scene, an environment for an agent, or an interactive scene that responds when a user moves through it.

This is why the category matters. A normal image or video model can make a plausible-looking output. A world model has to maintain consistency when the camera moves, objects change, actions happen, and time continues.

What it is not

A world model is not just another word for a video model. Video can be part of the interface, but the deeper goal is simulation: representing how a world behaves after a prompt, movement, or action.

DimensionVideo modelWorld model
Primary outputA fixed video sequenceA stateful environment that can change with actions
InteractionUsually prompt to clipPrompt or action to evolving world state
Core challengeVisual realism and temporal coherenceSpatial memory, causality, controllability, and persistence
Typical useCreative media generationSimulation, spatial design, robotics, agent training, interactive media
Evaluation questionDoes the clip look plausible?Does the world behave consistently when explored or acted on?

Why it is becoming a separate category

The phrase now has several visible product and research tracks: DeepMind uses it for interactive generated worlds, World Labs uses it for spatial 3D worlds, Runway uses it for a general world model research direction, and NVIDIA uses world foundation models for physical AI workflows.

That spread is exactly why World Models Watch treats the term as a category, not a single product label.

First models to know