ChatGPT GPT-5.5 vs Claude Opus 4.7 vs Gemini 3.1 Pro: How does OpenAI’s latest model compare against rivals?

OpenAI launched its GPT-5.5 model earlier this week with the aim of taking on Anthropic’s recently launched Claude Opus 4.7 and Google’s Gemini 3.1 Pro models. The new model is claimed to come with massive leaps in coding capabilities along with improved agentic abilities and scientific research.

How does GPT-5.5 compare against Claude and Gemini?

OpenAI’s GPT-5.5 leads the benchmarks for agentic use and efficiency, but the new model still lags behind Claude on benchmarks that require precision coding, while Gemini 3.1 Pro maintains a lead in areas around academic reasoning.

Where ChatGPT leads

Across the various benchmarks, GPT-5.5 (including its Pro variant) took the top spot in 15 categories, while Claude Opus 4.7 led in 7 evaluations, and Gemini 3.1 Pro secured 2 wins.

On Terminal-Bench 2.0, which tests complex command-line workflows and tool coordination, GPT-5.5 achieved an accuracy of 82.7%, ahead of Opus 4.7 (69.4%) and Gemini 3.1 Pro (68.5%).

The trend continues in benchmarks that measure professional knowledge work and autonomous computer operation.

On the GDPval benchmark, which measures a model’s ability to produce well-specified work across various occupations, GPT-5.5 scored 84.9%, outpacing both Claude Opus 4.7 (80.3%) and Gemini 3.1 Pro (67.3%).

When it comes to operating a real computer independently, GPT-5.5 narrowly came ahead of the competition on OSWorld-Verified with a 78.7% score, just a fraction ahead of Claude Opus 4.7 at 78.0%.

Benchmark (Category)	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	Gemini 3.1 Pro
Terminal-Bench 2.0 (Agentic Coding)	82.7%	–	69.4%	68.5%
SWE-Bench Pro (Real-world Coding)	58.6%	–	64.3%	54.2%
GDPval (Professional Knowledge)	84.9%	82.3%	80.3%	67.3%
OSWorld-Verified (Computer Use)	78.7%	–	78.0%	–
BrowseComp (Tool Use)	84.4%	90.1%	79.3%	85.9%
FrontierMath Tier 1–3 (Academic Math)	51.7%	52.4%	43.8%	36.9%
FrontierMath Tier 4 (Advanced Math)	35.4%	39.6%	22.9%	16.7%
GPQA Diamond (Expert Reasoning)	93.6%	–	94.2%	94.3%
ARC-AGI-1 (Abstract Reasoning)	95.0%	–	93.5%	98.0%
CyberGym (Cybersecurity)	81.8%	–	73.1%	–

Where Claude Opus 4.7 leads

Meanwhile, Anthropic’s Claude Opus 4.7 still traced ahead of ChatGPT and Gemini in areas that require real-world coding and complex data retrieval.

Claude maintained its dominance on SWE-Bench Pro, a critical benchmark for resolving real-world GitHub issues. The Opus 4.7 scored 64.3% on the benchmark compared to GPT-5.5’s 58.6% and Gemini’s 54.2%.
It also outperformed OpenAI on FinanceAgent v1.1 (64.4%), MCP Atlas (79.1%), and the coveted Humanity’s Last Exam (46.9%).
Additionally, Claude Opus 4.7 took three wins in the Graphwalks long-context evaluations, beating GPT-5.5 in the BFS 256k, parents 256k, and parents 1mil categories.

Where Gemini 3.1 Pro leads

While Google’s model lagged behind Claude and Gemini in agentic tool use and coding, it still maintains a lead in benchmarks that require high-level reasoning.

Gemini 3.1 Pro narrowly edged out the competition on the graduate-level GPQA Diamond benchmark, scoring 94.3% to beat Claude’s 94.2% and GPT-5.5’s 93.6%.
It also demonstrated superior abstract reasoning on ARC-AGI-1 (Verified), securing an impressive 98.0% compared to GPT-5.5’s 95.0% and Claude’s 93.5%.

Netizens react to GPT-5.5 launch:

Social media has been largely divided on whether GPT-5.5 is finally better than Claude for coding related tasks. While some users have noted that the model felt more intuitive and expert-like than its predecessor and posses the ability to one-shot create entire apps via Codex.
However, others weren’t as impressed with some users noting that the model felt like GPT-5.4 with minor fixes.

“I would say it somewhat trades blows with Opus 4.7 in terms of pure coding quality; however the improved speed and MUCH MUCH more generous Codex gives it the win.” wrote one user on Reddit

“GPT-5.4 already worked well, especially for coding, but writing was the part where I still felt some weakness. With 5.5, that feels noticeably better. The responses have less of that “GPT smell” and are easier to read, closer to the way Claude or Gemini tends to explain things.” wrote another

“The main problem is still there: the model doesn’t truly reason, verify itself, and catch its own mistakes consistently. It often misses obvious errors, ignores contradictions, loses important details, and only fixes what you directly point out.” yet another user added

Hot topics

World

Business

Politics

Tech

Hot topics

World

Business

Politics

Tech

How does GPT-5.5 compare against Claude and Gemini?

Where ChatGPT leads

Where Claude Opus 4.7 leads

Where Gemini 3.1 Pro leads

Netizens react to GPT-5.5 launch:

Topics

Related Articles

Categories

Latest

Newsletter