For two years, language models have dominated technological news, transformed uses, disrupted the industry and established the idea that a new software era had just dawned. However, behind this spectacular wave, another movement is taking shape in global research, potentially more decisive.
It is based on the idea that, for artificial intelligence to truly progress, it must go beyond text and learn to understand the world. This is precisely the role of Joint Embedding Predictive Architectures (JEPA) and World Modelstwo approaches defended in particular by Yann LeCun, future former scientific manager of AI at Meta.
Thus the limit of current models would now be clearly identified. An LLM, no matter how good it may be, only learns to predict the next word and has neither structural memory of the world, nor internal representation of objects, nor understanding of elementary physical dynamics. When he describes a movement, he does not mobilize any mechanical intuition and when he answers a complex question, he does not rely on any causal modeling. It simply manipulates linguistic correlations, and not laws of reality. This architecture condemns it to remain a reactive system, undoubtedly brilliant in producing text, but incapable of planning, anticipating or reasoning in a robust manner.
The JEPAs introduce a conceptual break in the field. Their objective is no longer to reconstruct an image, a sentence or a segment of data, but to predict, in a latent space, the representation of the future state of a scene. The machine therefore no longer seeks to imitate its input exactly but to anticipate what will happen. This difference, apparently subtle, profoundly modifies the nature of learning, because instead of reproducing visual or linguistic details, architecture learns to identify the stable elements, the regularities, the implicit laws which structure a situation.
THE World Modelsfor their part, extend this logic by constructing a true “internal simulator” of reality. AI no longer acts as a reflex system but operates as an agent with a coherent representation of the world. She can thus imagine several scenarios, compare their consequences and choose the most relevant sequence of actions. This capacity for anticipation, which is at the heart of human behavior, today constitutes one of the most visible limits of generative AI that World Models precisely aim to fill.
This transition is not theoretical and responds to an empirical observation, often recalled by Yann LeCun: a four-year-old child, without texts or explanations, accumulates more information about the world than an LLM trained on the entire Internet. And above all, he learns by observing the effects of his actions. It is this perception–action–correction loop that structures human cognitive development. Absent from current LLM models, it is on the other hand at the center of JEPA and predictive architectures.
The issue goes well beyond the academic framework, and the industrial applications are concrete. In robotics, only a system capable of predicting the consequences of its movements can reliably manipulate objects. In logistics, anticipation of disruptions becomes a condition of performance. In energy, modeling materials and chemical reactions requires a detailed understanding of microscopic dynamics. In enterprise software, multi-step planning will become a prerequisite for all complex tasks. Everything that LLMs do today by linguistic approximation will have to be recreated using prediction and simulation mechanisms.
JEPA and World Models should therefore not be considered as a marginal optimization of deep learning, but as a paradigm shift with the ambition to build systems capable of reasoning, understanding the mechanisms of the physical world and acting in open environments. They embody the transition from an AI that speaks to an AI that thinkat least in the operational sense of the term. This is the bet undertaken by Yann LeCun but also Jeff Bezos and his new startup Prometheus.