DeepSeek is back: China’s AI claims to surpass ChatGPT and Gemini in key benchmarks

Chinese AI startup DeepSeek has officially released preview versions of its highly anticipated DeepSeek-V4 models. The much-awaited update from DeepSeek comes more than a year after its R1 and V3 models went viral last year and broke all notions of US supremacy in the AI race.

The latest model from DeepSeek comes with significant architectural upgrades, multiple reasoning modes, and a massive one-million-token context window.

DeepSeek’s new AI model:

The new DeepSeek-V4 series of models is split into a Pro and Flash model. The flagship DeepSeek-V4-Pro features a massive 1.6 trillion total parameters, while the V4-Flash is a smaller model with 284 billion parameters.

Both models support an ultra-long context length of one million tokens (approximately 750,000 words).

The new DeepSeek-V4 models come in three reasoning modes: Non-think, Think High, and Think Max. DeepSeek says the Non-think mode is aimed at daily tasks and low-risk decisions, while Think High is for questions that require complex problem-solving and planning. Meanwhile, Think Max is for handling the hardest coding and math problems.

On a Hugging Face page for the model, DeepSeek says that the V4 Pro Max and V4 Pro “significantly advance the knowledge capabilities of open-source models, firmly establishing [them] as the best open-source model available today.” It adds that the model achieves top-tier performance in coding benchmarks and significantly bridges the gap with leading closed-source models on reasoning and agentic tasks.

DeepSeek vs ChatGPT vs Gemini vs Claude:

DeepSeek also revealed benchmark data for its new model against existing models from rivals such as OpenAI’s GPT-5.4, Anthropic’s Claude Opus 4.6, and Google’s Gemini 3.1 Pro.

DeepSeek-V4-Pro-Max leads in coding and mathematical performance, topping the Apex Shortlist, a benchmark focused on high-difficulty reasoning and problem-solving, with a score of 90.2%. It also achieves a Codeforces rating of 3206, which shows strong real-world competitive programming ability, and ties for first place on SWE Verified, a benchmark that evaluates performance on practical software engineering tasks.

However, the model lags behind its American counterparts in general knowledge and broader reasoning. Gemini 3.1 Pro leads on SimpleQA-Verified, a benchmark designed to test factual accuracy and question answering, while GPT-5.4 ranks highest on Terminal Bench 2.0, which measures how effectively models can use tools and operate in agent-like environments.

DeepSeek says the V4-Pro-Max achieves these results while being far more efficient, using nearly 10 times less memory than its V3.2 model when handling long inputs.

Benchmark (Category) DeepSeek-V4-Pro Max GPT-5.4 xHigh Claude Opus 4.6 Max Gemini 3.1 Pro High
Codeforces Rating (Coding) 3206 3168 3052
Apex Shortlist (Math/Coding) 90.2% 78.1% 85.9% 89.1%
SWE Verified (Agentic Coding) 80.6% 80.8% 80.6%
MMLU-Pro (Knowledge) 87.5% 87.5% 89.1% 91.0%
SimpleQA-Verified (Accuracy) 57.9% 45.3% 46.2% 75.6%
GPQA Diamond (Reasoning) 90.1% 93.0% 91.3% 94.3%
Terminal Bench 2.0 (Agentic) 67.9% 75.1% 65.4% 68.5%
Toolathlon (Tool Use) 51.8% 54.6% 47.2% 48.8%

Notably, DeepSeek’s new model launch comes just hours after OpenAI launches its latest GPT-5.5 model which is seen as the company’s answer to Claude’s dominance in the coding world. The popularity of DeepSeek early last year had led to a trillion-dollar stock market selloff since its open-source AI model was built at a fraction of the cost compared to the American rivals.

Latest

AI smart glasses will help visually impaired runners take on the London Marathon

AI smart glasses will help visually impaired runners take on the London Marathon

You can now ask ChatGPT to find cheap flights with the new Skyscanner integration: step-by-step guide

Skyscanner has launched its app within ChatGPT allowing users in India and globally to search for flights using conversational prompts inside the chatbot

Did Anthropic ‘dumb down’ Claude Code? Post-mortem reveals the three bugs that crippled performance

Anthropic has acknowledged complaints regarding Claude Code's performance, attributing issues to three updates that affected coding quality.

IPhone 18 Pro, iPhone 18 Pro Max and iPhone Ultra complete design changes revealed in new leak

A new leak has via iPhone dummy models has revealed the designs of the iPhone 18 pro, iPhone 18 Pro Max and iPhone Ultra.

Finance Minister Nirmala Sitharaman raises alarm on bank security risk due to Anthropic Mythos AI

Finance Minister Nirmala Sitharaman held a high-level meeting with heads of banks on Thursday to discuss the potential risks with Anthropic’s Claude Mythos, a

Topics

Michael Box Office Collection: Jaafar Jackson film breaks records with $12.6M US previews despite poor reviews

Lionsgate's Michael Jackson biopic 'Michael' is heading for a record-breaking opening weekend with $12.6 million in US previews and $18.5 million internationall

Khal Nayak is back: Sanjay Dutt unveils teaser, revives iconic role in new Jio Studios film

Sanjay Dutt and Aksha Kamboj have acquired rights to the 1993 film Khal Nayak, with Jio Studios set to produce a new project. The move signals a revival of the

AI smart glasses will help visually impaired runners take on the London Marathon

AI smart glasses will help visually impaired runners take on the London Marathon

Iran’s FM Abbas Araghchi to visit Pakistan, confirms Iranian state media

The US logistics and security team have already reached Islamabad, Reuters reported citing government sources.

Explained: Why Iran is not ready to compromise with US despite pressure

US-Iran conflict: Tensions between Washington and Tehran remain on edge as diplomatic efforts to secure a truce show no signs of progress. Earlier this week,

Situation in Iran remains serious, Embassy providing assistance to Indian nationals: MEA

Earlier this week, US President Donald Trump unilaterally extended the ceasefire with Iran indefinitely, hours before it was to expire, even though Tehran refus

No China, no Gulf, Dhurandhar 2’s box office supremacy still unabated after 5 weeks

Dhurandhar: The Revenge has completed 36 days in theatres and is closing in on Baahubali 2's global total. Its run stands out because it has crossed Rs 1,766 cr

Raghav Chadha along with two other Rajya Sabha MPs officially join BJP

Earlier today, Raghav Chadha held a press conference along with other Rajya Sabha MPs, where he announced his resignation from the AAP. Additionally, he also an
spot_img

Related Articles

Popular Categories

spot_imgspot_img