ChatGPT 5.4: real product leap, or packaging benchmark?

OpenAI has just made available a new generation of model called GPT-5.4, described as “our highest performance and most efficient frontier model for professional work”. The company claims to have brought together in a single architecture its recent advances in reasoning, programming and software agents capable of interacting with IT tools and environments.

Beyond the product announcement, the OpenAI publication is based on a dense set of benchmarks and evaluations intended to demonstrate a qualitative leap compared to previous generations. It remains to be understood what these figures really say: do they demonstrate a significant technological change or do they participate in a now classic exercise in staging the performance of artificial intelligence models?

A model designed to produce work

GPT 5.4 is not presented as a simple conversational improvement but as a tool intended to produce professional deliverables: financial spreadsheets, presentations, legal analyzes or structured documents.

In the official announcement, the company summarizes the goal of the model as follows:

🚨 SMARTJOBS

MISTRAL – Account Executive, Enterprise, France – Paris
ANTHROPIC – Startup Partnerships – France & Southern Europe
CONTEXT – HR Director – Human Resources Director
ECOLE POLYTECHNIQUE – Director/Deputy Director of International Relations (F/M)
CLAROTY — Sales Development Representative
FRACTTAL — Account manager (France)
BRICKSAI — Founding Growth Manager

👉 Find all our offers on the DECODE MEDIA Jobboard

📩 Are you recruiting and want to strengthen your employer brand? Discover our partner offers

“GPT-5.4 brings together the best of our recent advances in reasoning, programming and agentic workflows into a single frontier model. »

In other words, the promise is based on the convergence of three dimensions: reasoning ability, writing code and executing tasks via software tools.

The benchmark GDPvalused by OpenAI to evaluate the production of professional deliverables in forty-four professions, is one of the indicators highlighted. GPT 5.4 achieves this 83% winning results or equivalent to those of professionalsagainst 70.9% for GPT-5.2.

For Brendan Foody, CEO of Mercor:

“GPT-5.4 is the best model we have tested so far. It now occupies first place in our APEX-Agents benchmark, which measures the performance of models on professional services tasks. »

Visible progress in the use of tools and workflows

One of the areas where the gains appear most significant concerns the use of tools and the execution of complex workflows. OpenAI indicates for example that GPT-5.4 obtains 75% success on OSWorld-Verifieda benchmark measuring an agent’s ability to operate a computing environment via screenshots and keyboard-mouse actions. GPT-5.2 only reached 47.3%. On BrowseCompa test evaluating multi-step web search, GPT-5.4 achieved 82.7%against 65.8% for GPT-5.2.

According to OpenAI:

“GPT-5.4 performs better for agentic web search. It is able to continue the search over several cycles in order to identify the most relevant sources. »

The company is also introducing a feature called tool searchallowing the model to dynamically identify the relevant tool in a large catalog of APIs without loading all definitions in the initial context. In an evaluation carried out on the MCP Atlas benchmark, OpenAI indicates that this approach would reduce by 47% the number of tokens used without loss of precision.

More nuanced gains on certain technical benchmarks

While certain indicators are clearly progressing, the results appear more moderate in other areas, particularly in programming.

On SWE-Bench Proreference benchmark for evaluating models on real software development problems, GPT-5.4 achieved 57.7%against 56.8% for GPT-5.3-Codex And 55.6% for GPT-5.2. Progress exists, but it remains relatively limited compared to the gains observed in agent workflows or web navigation. On certain specialized tests, the previous generation even maintains a slight advantage. On Terminal Bench 2.0GPT-5.3-Codex remains for example slightly ahead of GPT-5.4.

This situation illustrates a trend observable for several generations of models: new systems seek less to dominate each isolated benchmark than to improve overall versatility.

A measured improvement in reliability

OpenAI also claims to have reduced the model’s factual error rate. According to the company, on a set of anonymized prompts where users had reported errors:

“Individual statements generated by GPT-5.4 are 33% less likely to be false, and its complete answers are 18% less likely to contain errors. »

If these figures suggest progress, the announcement does not specify the absolute error rate or the detailed composition of the corpus used.

Who is this new generation of model for?

GPT-5.4 is explicitly aimed at the enterprise, OpenAI describes it as a model “designed for professional work”capable of producing deliverables comparable to those of a junior analyst or consultant.

Three categories of users appear in the distribution strategy.

The first concerns professions in consulting, finance, law or strategy. The examples used in the evaluations (financial models, legal contracts or presentations) correspond precisely to these uses.

The second target is that of developers. GPT-5.4 is deployed in the API and in Codex with several features intended for building software agents capable of using external tools, automating workflows or interacting with software interfaces.

Finally, OpenAI explicitly targets organizations. GPT-5.4 is available in offers Team, Enterprise and Eduwith planned integrations into productivity tools like Excel.

Access terms and pricing

GPT-5.4 is distributed in several options. In ChatGPT it appears as GPT-5.4 Thinkingaccessible to subscribers ChatGPT Plus, Team and Pro. A more efficient version, GPT-5.4 Prois reserved for subscriptions Pro and Enterprise. The model gradually replaces GPT-5.2 Thinking, which will remain accessible for a few months in the “Legacy Models” section.

For developers, GPT-5.4 is available in the API under credentials gpt-5.4 And gpt-5.4-pro.

The announcement is also accompanied by a price increase. In the API, the entry price increases from $1.75 to $2.50 per million tokenswhile the exit price reaches 15 dollarsagainst 14 dollars for GPT-5.2. OpenAI justifies this increase by better efficiency of the model, which would use fewer tokens to solve a given task.

Competition now structured

The launch of GPT-5.4 comes in a market where several major players are now competing for leadership in AI models.

Anthropic has established itself as one of the most credible competitors with the family Clauderenowned for its long document analysis capabilities and its security-oriented approach.

Google is developing for its part Geminiintegrated into the Google Workspace ecosystem. The advantage of this approach lies in direct access to productivity tools (Gmail, Docs or Sheets) and the group’s research infrastructure.

Microsoft is pursuing a different strategy with Copilotintegrated directly into the Office suite and into development tools like GitHub. AI no longer appears as an autonomous application but as a native functionality of the software used.

Faced with these competitors, ChatGPT retains several advantages: a large user base, an API widely adopted by developers and a versatile model capable of covering a large number of uses. But the competition is now less about the raw performance of the models than about their integration into work environments.

Between real progress and technological narrative

The announcement of GPT-5.4 thus illustrates the recurring ambiguity of AI model launches.

The progress measured, particularly in the use of tools, web navigation and the execution of complex tasks, appears real. At the same time, the presentation of performances is based on a set of benchmarks whose reading remains partial without access to the complete protocols.

In this context, GPT-5.4 seems less to mark a spectacular break than a further step in the progressive integration of language models at the heart of professional uses.