GPT-5.4 Is Here: OpenAI's Most Capable Model Yet Just Crossed a New Threshold

OpenAI released GPT-5.4 on March 5, 2026, and billed it as "our most capable and efficient frontier model for professional work." That kind of language is easy to discount, every new release comes with similar language. But this time, the numbers and the feature list make a stronger case than usual. GPT-5.4 isn't just a smarter chatbot. It's a model that can operate software, read a million-token document without losing the plot, beat most office workers at structured knowledge tasks, and reason through problems with a transparency that even safety researchers are pointing to as a positive development.

This is what's actually new, what it means for real users, and where the hard limits still are.


The "One Model to Rule Them All" Era Has Arrived

For most of the last two years, OpenAI ran a fragmented product lineup. There was one model for chat, a different one for reasoning, a separate one for coding. Users had to switch between them depending on the task, which was clunky and confusing. GPT-5.4 ends that.

The new model consolidates the industry-leading coding capabilities that were exclusive to GPT-5.3-Codex with the broader world knowledge and reasoning of the GPT-5.2 series, and layers on native computer-use capabilities that neither of those predecessors had. The result is a single model built for end-to-end professional workflows.

OpenAI describes the goal plainly: "The result is a model that gets complex real work done accurately, effectively, and efficiently, delivering what you asked for with less back and forth."

That "less back and forth" matters more than it sounds. Previous agentic models required users to course-correct constantly. GPT-5.4's Thinking variant now surfaces an upfront plan of its reasoning before diving in, so users can redirect it mid-process without starting over. That one change eliminates a frustrating pattern that anyone who has used AI for long-form work will recognize.


The Computer Use Breakthrough

The most technically significant thing about GPT-5.4 is that it is OpenAI's first general-purpose model with native computer-use capabilities. That means it can interact with desktop and web applications directly, interpreting screenshots, sending mouse commands, typing into fields, exactly the way a human would.

On OSWorld-Verified, a widely used benchmark for measuring how well AI systems interact with software environments, GPT-5.4 scored 75.0%. The typical human benchmark on the same test sits at 72.4%. GPT-5.4's predecessor, GPT-5.2, scored 47.3% on the same evaluation. That's a 28-point jump in a single generation.

On WebArena-Verified, which tests browser-based task completion, it achieved a 67.3% success rate. On Online-Mind2Web, using screenshot-based observations alone, it hit 92.8%.

These aren't abstract numbers. They represent whether AI can reliably book a meeting, extract data from a web form, navigate enterprise software, or file a report without a human guiding every click. The jump from GPT-5.2 to GPT-5.4 on computer use is more significant than almost any other benchmark improvement in the launch notes.


A Million Tokens, And What That Actually Enables

GPT-5.4 supports context windows up to one million tokens in the API. For reference, one million tokens is roughly the equivalent of several full-length novels, a multi-year legal case file, or an entire large codebase read in one pass.

This matters for agentic work in particular. Long-horizon tasks, multi-step research projects, complex software migrations, financial analysis across dozens of documents, have historically forced AI systems to forget early context by the time they reach the end of a task. With a million-token window, that constraint largely disappears for most real-world workflows.

It's worth noting that competitors like Google's Gemini 3.1 Pro and Meta's Llama 4 Scout currently offer context windows of up to 10 million tokens. OpenAI is not leading on raw context size. But OpenAI's counter-argument is token efficiency: GPT-5.4 can solve the same problems using significantly fewer tokens than its predecessor, which drives down both cost and response time. The question of whether a million tokens with better efficiency beats 10 million tokens with more waste is a legitimate debate, and the answer will depend on the specific use case.


Knowledge Work Benchmarks: 83% Against Human Professionals

GDPval is an evaluation framework OpenAI developed to measure real-world professional knowledge work across 44 occupations, things like building financial models, creating presentations, drafting legal summaries, writing research briefs. It is designed to be harder and more occupation-relevant than traditional academic benchmarks.

GPT-5.4 scored 83% on GDPval, meaning it matched or outperformed industry professionals in 83% of comparisons evaluated by expert judges. That number puts it at the top of the leaderboard on that specific benchmark.

On OpenAI's internal investment banking benchmark, which tests real-world workflows like building a three-statement financial model with proper formatting and citations, performance improved from 43.7% with GPT-5 to 87.3% with GPT-5.4 Thinking.

Mercor, a hiring and evaluation platform, independently tested the model on its APEX-Agents benchmark, which is designed to assess professional skills in law and finance. GPT-5.4 topped the leaderboard. Mercor's CEO described it as excelling at "long-horizon deliverables such as slide decks, financial models, and legal analysis, delivering top performance while running faster and at a lower cost than competitive frontier models."


Hallucinations Are Down, Significantly

GPT-5.4 is 33% less likely to produce errors in individual factual claims compared to GPT-5.2. Full responses are 18% less likely to contain factual mistakes.

That's not a solved problem, AI hallucination is still real and still consequential, but the trajectory is meaningful. The original GPT-5 launch in August 2025 advertised that its responses were approximately 45% less likely to contain factual errors than GPT-4o when web search was enabled. GPT-5.4 is now layering further reductions on top of those gains.

For professional use cases where accuracy matters, legal research, medical documentation, financial modeling, this directional improvement is more important than almost any benchmark number.


Tool Search: A Developer-Focused Efficiency Win

One of the less-covered new features in GPT-5.4 is a system called Tool Search, and it's genuinely clever.

Building an application on the OpenAI API often means managing a large catalog of tools, functions the model can call to fetch data, run code, query databases, and so on. Until now, developers had to include definitions for all of those tools in every API request, which consumed tokens and inflated costs as tool libraries grew.

Tool Search changes that. The model can now look up tool definitions on demand, only pulling in what it actually needs for a given task. OpenAI says this reduces prompt sizes and inference costs, a particularly useful improvement for enterprise applications with large, complex tool ecosystems.


Vision Is Better Too, and It Matters for Agents

GPT-5.4's image-processing capabilities were also upgraded. Developers can now upload images containing more than 10 million pixels without compression. Previously, large images had to be downsampled, which could destroy fine-grained details, an issue for anyone using AI to analyze technical diagrams, satellite imagery, dense charts, or high-resolution medical scans.

This improvement is directly tied to the computer-use capabilities. The model's ability to interact with graphical interfaces depends on accurately reading what's on screen. Better vision resolution means fewer misreads, which means more reliable computer-use automation.


Safety: The Transparent Reasoner

One of the more nuanced parts of the GPT-5.4 announcement is what OpenAI said about safety and deception.

OpenAI's own evaluation found that in the GPT-5.4 Thinking variant, deception is measurably less likely than in earlier models. Specifically, OpenAI reported that the model "lacks the ability to hide its reasoning," and that "CoT monitoring remains an effective safety tool." Chain-of-thought monitoring, watching what a model is "thinking" before it gives an answer, has been one of the main proposed approaches to AI safety oversight, and this finding is a direct confirmation that the approach still works with GPT-5.4.

The model also resists jailbreaks more effectively than its predecessors and shows significantly better user work preservation during agentic tasks, 53% versus 18% for GPT-5.2-Codex. That second stat means it's less likely to accidentally overwrite, delete, or corrupt your files and data while running autonomous tasks.

For cybersecurity use cases, OpenAI introduced a Trusted Access for Cyber (TAC) program that gives vetted security professionals and organizations controlled access to the model's stronger security analysis capabilities.


The Limits: What GPT-5.4 Still Can't Do

Context window leadership isn't there. At 1 million tokens, GPT-5.4 is competitive but not dominant. Gemini 3.1 Pro and Meta's Llama 4 Scout both offer 10 million tokens. For specific use cases that require ingesting truly enormous corpora in a single pass, those models have a structural advantage.

The "Reasoning Tax" is real. GPT-5.4's advanced deliberation is computationally expensive. For organizations running high-volume workflows, the cost of sustained deep reasoning adds up fast. OpenAI's token efficiency improvements partially offset this, but heavy agentic use still demands serious infrastructure.

Pricing has gone up. API input pricing for GPT-5.4 is $2.50 per million tokens, a 43% increase over GPT-5.2's $1.75. Output pricing rose more modestly. For low-volume or occasional users the math works out, better efficiency means fewer tokens overall. For high-volume enterprise deployments, that input price jump is a real budget line.

Computer use still fails sometimes. A 75% success rate on OSWorld-Verified is a genuine breakthrough, it exceeds the human benchmark. But that also means roughly one in four attempts still fails. For fully automated workflows without human oversight, that failure rate is too high for anything mission-critical. The right framing for right now is "AI-assisted computer use with a human in the loop," not "autonomous agent you can leave running unsupervised."

Coding still has competition. On open-source SWE-bench, GPT-5.4 leads on the enterprise-specific SWE-Bench Pro, but Claude Opus 4.6 holds a slight edge on the standard SWE-bench metric. Depending on what your code actually looks like, the answer to "which model should I use for coding" is still nuanced.


What It Means

The release of GPT-5.4 represents a genuine inflection point, but not the one that sounds most dramatic.

The more consequential shift is quieter: the agent era is now real, not aspirational. A model that can read a legal brief, open a spreadsheet, build a three-statement financial model inside Excel, check a result against a live data source, and save the output, all without a human clicking through each step, is not a futuristic demo. It is a product that shipped on March 5, 2026.

The capabilities are not yet reliable enough to fully replace human oversight. The failure modes are real. But the direction is clear, and it moved faster than most people expected.

For developers, the takeaway is that Tool Search and the 1 million token context window make complex agentic applications significantly cheaper and easier to build than they were six months ago. For professionals, the Thinking mode's upfront plan feature is a practical usability improvement that makes the model actually easier to steer in real work. For researchers and policymakers, the transparent chain-of-thought finding is a meaningful data point in the ongoing conversation about what AI safety infrastructure actually needs to look like.

GPT-5.4 is not the end of the race. But it has moved the finish line.