The debate on the performance of artificial intelligence models is no longer limited to a simple comparison of scores on linguistic benchmarks. What now distinguishes an efficient AI of another is no longer only the quality of the model, but its ability to register in a structured and measurable action sequence. In other words, his ability to act, not just to predict.
This change devotes a concept that is still little publicized: the scaffolding. Behind this term borrowed from software architecture hides the keystone of agentic AI. It designates all the structures and components that allow an LLM (large Language Model) to perform real tasks: organize its actions, access tools (browser, terminal, API), persist a memory, iterate on its errors. Without scaffoldinga model generates text. With, it becomes a autonomous agent capable of producing concrete results.
The Benchmark Paperbench, published by Openai two days ago, illustrates this rocking with acuity. The objective: to assess AI agents on their ability to Reply from scientific publications to the machine learning. It is no longer a question of answering questions, but of reading a paper, understanding the experiences, writing the corresponding code, performing it, validating the results … then submit a complete reproduction. A task that usually mobilizes several days of human work.
The results? Claude 3.5 SONNETwith a well orchestrated agent, reaches 21 % success. GPT-4, however deemed more powerful, caps 4 % without scaffolding adapted. It is therefore not the raw power of the model that makes the difference, but the quality of the agent architecture that surrounds it.
This observation has concrete implications for companies. Today, investing in AI no longer means choosing the “best model”. We must design a complete systemin which the model is integrated into a logic of execution, control, and continuous learning. This implies thinking in terms of action flow, modularity, planning, and persistent interfaces. The agent becomes a Autonomous work unitwhose performance depends as much on its supervision as on its native intelligence.
As such, Paperbench marks a break. It does not test a theoretical capacity. He measures operational competence: the ability to transform a complex setpoint into reproducible results. And this approach is part of a broader movement: that of an AI that comes out of the experimental field to enter into business uses – drafting, automation, support, analysis, production.
For companies, this involves reviewing their reading grids. It is no longer enough to compare models according to their linguistic benchmarks. It is now necessary to assess the ability to integrate, orchestrate, and make one Truption and iterative agent. The creation of value is no longer in the prediction, but in theExecution guided by software architecture.
The real measure of artificial intelligence, in 2025, is no longer what the “know” model, but what it can do – and redo, independently. And this capacity is based on the invisible architecture given to it: the scaffolding.