OpenAI GPT-5.4 Test: Impressive Answers, Off-Target Replies

GPT-5.4 Thinking: A Technical Deep Dive

Executive Summary
OpenAI’s GPT-5.4 Thinking is a specialized reasoning-focused model released in March 2026, positioned as a “cognitively prepared” variant optimized for deeper analysis and professional-grade tasks rather than general-purpose chat. It demonstrates significantly stronger structured reasoning and long-context analytical capabilities than prior GPT-5.x releases, with independent evaluations showing it outperforming experienced human professionals by 83% on pro-level work. However, the model exhibits persistent prompt-following issues, frequently answering questions that differ from the user’s explicit request, and retains legacy weaknesses in image generation and output formatting. Available via ChatGPT Plus ($20/month), the API, and Codex, GPT-5.4 Thinking represents OpenAI’s clearest attempt yet at building a model for complex cognitive workloads, but its tendency to diverge from instructions raises questions about reliability in production environments.

Technical Architecture
While OpenAI has not published a full technical paper on GPT-5.4 Thinking, the model is described as a distinct variant rather than a simple fine-tune of the base GPT-5.4. The “Thinking” designation indicates architectural and training modifications focused on extended reasoning chains, deeper internal deliberation, and improved handling of multi-variable, real-world decision-making.

Key indicators from testing include:

Substantially longer and more structured internal reasoning traces before producing final output.
Strong performance on tasks requiring explicit trade-off analysis, constraint enumeration, and engineering feasibility assessment.
Improved long-document handling, including concise summarization of complex technical papers while preserving core arguments.
Continued struggles with strict instruction following, suggesting that the additional reasoning capacity may come at the cost of increased “creative interpretation” of user intent.

The model appears to incorporate advances in chain-of-thought (CoT) optimization and possibly reinforced learning from structured reasoning traces. Its behavior on tasks such as aircraft carrier design — where it correctly identifies why four downward-facing turbo-propellers represent a “weak solution” from an aircraft construction and weight-to-power perspective — demonstrates genuine engineering reasoning rather than pattern matching. However, its failure to reference historical precedents like the USS Akron and USS Macon dirigibles indicates gaps in knowledge integration or retrieval-augmented generation (RAG) that have not been fully resolved.

Image generation remains anchored to a legacy multimodal component (likely DALL·E 3 or an incremental variant), which explains the consistent failure to correctly orient propulsion systems despite explicit textual instructions. This architectural split between the high-reasoning text backbone and the older vision/generation stack creates a noticeable quality gap.

Performance Analysis
Independent evaluations paint a complex picture of GPT-5.4 Thinking’s capabilities. According to ZDNET testing, the model “clobbers humans on pro-level work in tests — by 83%.” This metric comes from structured professional tasks where GPT-5.4 either kept up with or surpassed experienced human professionals, as judged by human or AI graders.

Notable performance observations from multiple sources:

Structured reasoning: Significantly improved on tasks requiring weighing multiple variables. In one documented case, GPT-5.4 provided nuanced analysis of a real-world decision (walking vs driving to a car wash 100 meters away) that demonstrated deeper consideration than previous models, although it reached a conclusion some human evaluators found questionable.
Long-document understanding: Handled extended technical papers effectively, producing concise summaries that captured core arguments and maintained logical structure.
Complex design tasks: Produced thoughtful engineering analysis on speculative vehicle design, correctly identifying power-to-weight problems, flight deck operational constraints, and structural limitations.
Blind evaluations: In structured blind tests against Claude Opus 4.6 and Gemini 3.1 on realistic professional workloads, GPT-5.4 showed competitive or superior performance according to independent judges.

However, several concerning patterns emerged:

Instruction following failures: The model frequently answers questions that differ from what was asked. This was consistent across testers and represents a regression in precise prompt adherence despite gains in reasoning depth.
Simple task errors: In Nate B. Jones’ evaluations, GPT-5.4 confidently suggested walking to wash a car 100 meters away, while Claude provided the more practical “Drive” response in one sentence.
Formatting artifacts: Strong preference for extremely long numbered lists, leading to verbose and sometimes poorly structured output.
Multimodal weakness: Image generation quality lags significantly behind text capabilities, often ignoring critical elements of the prompt (e.g., propeller orientation on a flying aircraft carrier).

Benchmark Context (derived from reported evaluations):

Task Type	GPT-5.4 Thinking	Previous GPT-5.x	Claude Opus 4.6	Human Professionals
Professional-grade work	+83% over humans	Moderate	Competitive	Baseline
Multi-variable reasoning	Strong	Weak	Strong	Strong
Long-document summarization	Excellent	Good	Excellent	Excellent
Strict instruction following	Poor	Moderate	Strong	N/A
Simple practical decisions	Inconsistent	Poor	Strong	Strong

Technical Implications
GPT-5.4 Thinking signals OpenAI’s strategic shift toward specialized reasoning models rather than pursuing a single monolithic general intelligence. By releasing a “Thinking” variant, the company acknowledges that different cognitive workloads require different architectural priorities. This mirrors trends seen in other labs (e.g., Anthropic’s Claude variants and Google’s Gemini specialized models).

For developers and enterprises, the model offers promising capabilities for:

Complex systems design and feasibility analysis
Technical documentation synthesis
Structured decision support in engineering and strategy domains

However, the persistent instruction-following issues present serious challenges for agentic workflows and autonomous systems. A model that confidently answers the wrong question poses significant risks in high-stakes environments such as legal analysis, medical reasoning, or software architecture decisions.

The gap between text reasoning quality and multimodal performance also highlights ongoing integration challenges in frontier AI systems. While the language backbone has made substantial progress, the vision and generation components appear to be lagging, creating inconsistent user experiences.

Limitations and Trade-offs
The most significant limitation is the model’s tendency to pursue what it considers a “better” or “more interesting” question rather than strictly following user instructions. This behavior, while sometimes producing valuable insights, undermines reliability and requires continuous conversation management to keep the model on track.

Other trade-offs include:

Verbosity: Extremely long responses and numbered lists can reduce usability for quick tasks.
Multimodal inconsistency: Image generation remains notably weak compared to text capabilities.
Knowledge gaps: Failure to reference relevant historical examples (dirigibles) suggests incomplete knowledge synthesis.
Cost vs capability: While available at the standard $20/month ChatGPT Plus tier, the model’s value depends heavily on the specific use case. For simple queries, the additional reasoning overhead may be unnecessary and occasionally counterproductive.

Expert Perspective
GPT-5.4 Thinking represents meaningful progress in reasoning depth but reveals a fundamental tension in current LLM development: the trade-off between creative analytical power and precise instruction following. The model’s ability to identify engineering flaws in speculative designs demonstrates genuine capability advancement, yet its failure on simple practical reasoning tasks (the car wash example) and its frequent divergence from user intent suggest that scaling reasoning capacity without corresponding improvements in control mechanisms creates new failure modes.

For senior ML engineers, this release underscores the importance of building robust scaffolding, verification layers, and human-in-the-loop processes around even the most advanced models. The 83% outperformance on professional tasks is impressive but must be weighed against the observed error patterns in instruction adherence and basic decision-making.

The decision to jump from GPT-5.2 directly to 5.4 and brand the release as “Thinking” also indicates OpenAI’s recognition that incremental version numbers no longer adequately communicate capability shifts. This may foreshadow a future where specialized models (reasoning, creative, coding, multimodal) become the norm rather than general-purpose releases.

Technical FAQ

How does GPT-5.4 Thinking compare to Claude Opus 4.6 on professional reasoning tasks?

According to blind evaluations, GPT-5.4 Thinking is competitive with Claude Opus 4.6 on complex professional work, with some tests showing slight advantages in structured analysis. However, Claude maintains superior instruction-following discipline and more reliable simple decision-making, making it preferable for workflows requiring strict adherence to user specifications.

Does GPT-5.4 Thinking introduce breaking changes to the existing ChatGPT or OpenAI API?

No breaking API changes are reported. The model is accessible through the standard ChatGPT Plus interface, Codex programming tool, and existing OpenAI API endpoints. Users can select GPT-5.4 Thinking as a model option where available, maintaining backward compatibility with existing integration patterns.

What are the current pricing and availability details for GPT-5.4 Thinking?

The model is available to ChatGPT Plus subscribers at the standard $20 per month tier. It is also accessible via the OpenAI API and the Codex programming environment. No premium “Pro” tier is required for access to the Thinking variant based on current reporting.

How significant is the instruction-following regression compared to previous GPT versions?

The regression appears notable. Multiple independent testers have observed that GPT-5.4 Thinking frequently reframes or answers adjacent questions rather than the exact query posed. This represents a meaningful usability challenge that requires additional prompt engineering and conversation management compared to more obedient models like Claude.

I tested GPT-5.4, and the answers were really good - just not always what I asked

How does GPT-5.4 Thinking compare to Claude Opus 4.6 on professional reasoning tasks?

Does GPT-5.4 Thinking introduce breaking changes to the existing ChatGPT or OpenAI API?

What are the current pricing and availability details for GPT-5.4 Thinking?

How significant is the instruction-following regression compared to previous GPT versions?

Sources

Original Source

Related Topics

Comments