Claude Opus 4.5 Outperforms ChatGPT and Gemini in AI Benchmark

Anthropic has launched Claude Opus 4.5, positioning it as the world’s most advanced AI model for coding, autonomous agents, and computer-use tasks. This release directly challenges rivals OpenAI ChatGPT 5.1 and Google Gemini 3.0 with superior benchmark performance in real-world engineering and agentic capabilities.

Key Takeaways

Claude Opus 4.5 achieves 80.9% accuracy on SWE-bench, surpassing the 80% threshold for the first time
Outperforms both Google Gemini 3 Pro (76.2%) and OpenAI GPT-5.1 Codex Max (77.9%)
Demonstrates enhanced safety with improved resistance to prompt injection attacks
Available through Claude apps, website, and APIs starting at $20/month for premium access

Breakthrough Performance in Software Engineering

The core of Claude Opus 4.5’s advancement lies in its performance on SWE-bench Verified, which simulates real-world software engineering challenges. With an impressive 80.9% accuracy, it becomes the first model to cross the 80% threshold, significantly outperforming competitors.

This represents more than just an incremental upgrade—it marks a milestone in AI’s ability to accelerate code generation and debugging. The model can potentially automate routine tasks that previously required hours of human effort.

Outperforming Human Candidates

In a proprietary 2-hour take-home exam designed for engineering hires, Claude Opus 4.5 outperformed even top human candidates in technical skills and judgment under pressure.

“The take-home test is designed to assess technical ability and judgment under time pressure,” Anthropic noted. “It doesn’t test for other crucial skills candidates may possess, like collaboration, communication, or the instincts that develop over the years. But this result—where an AI model outperforms strong candidates on important technical skills—raises questions about how AI will change engineering as a profession.”

Advanced Agentic Capabilities

For agentic AI systems that independently complete multi-step tasks, Opus 4.5 dominates the τ2-bench evaluation. In a simulated airline service scenario, the model demonstrated creative problem-solving by upgrading cabin class before legitimately modifying flights for a distressed customer.

This approach solved issues where competing models might rigidly refuse changes to basic economy bookings, showcasing enhanced reasoning and adaptability ideal for customer support, virtual assistants, and automated workflows.

Enhanced Safety Features

Safety remains central to Anthropic’s approach, with Opus 4.5 described as the company’s most robustly aligned model yet. It shows significant improvements in resisting prompt injection attacks—deceptive inputs designed to trick AIs into harmful actions.

“With Opus 4.5, we’ve made substantial progress in robustness against prompt injection attacks, which smuggle in deceptive instructions to fool the model into harmful behaviour,” the firm stated. “Opus 4.5 is harder to trick with prompt injection than any other frontier model in the industry.”

Availability and Pricing

Claude Opus 4.5 is rolling out through the Claude app on Android and iOS, the Claude website, and directly to developers via APIs. Premium access for enterprise users starts at approximately $20 per month, consistent with previous Opus versions. Free tiers with limited usage will be available to attract individual creators and hobbyists.

Hot topics

World

Business

Politics

Tech

Hot topics

World

Business

Politics

Tech

Key Takeaways

Breakthrough Performance in Software Engineering

Outperforming Human Candidates

Advanced Agentic Capabilities

Enhanced Safety Features

Availability and Pricing

Topics

Related Articles

Categories

Latest

Newsletter