DeepSeek is back: China’s AI claims to surpass ChatGPT and Gemini in key benchmarks

Chinese AI startup DeepSeek has officially released preview versions of its highly anticipated DeepSeek-V4 models. The much-awaited update from DeepSeek comes more than a year after its R1 and V3 models went viral last year and broke all notions of US supremacy in the AI race.

The latest model from DeepSeek comes with significant architectural upgrades, multiple reasoning modes, and a massive one-million-token context window.

DeepSeek’s new AI model:

The new DeepSeek-V4 series of models is split into a Pro and Flash model. The flagship DeepSeek-V4-Pro features a massive 1.6 trillion total parameters, while the V4-Flash is a smaller model with 284 billion parameters.

Both models support an ultra-long context length of one million tokens (approximately 750,000 words).

The new DeepSeek-V4 models come in three reasoning modes: Non-think, Think High, and Think Max. DeepSeek says the Non-think mode is aimed at daily tasks and low-risk decisions, while Think High is for questions that require complex problem-solving and planning. Meanwhile, Think Max is for handling the hardest coding and math problems.

On a Hugging Face page for the model, DeepSeek says that the V4 Pro Max and V4 Pro “significantly advance the knowledge capabilities of open-source models, firmly establishing [them] as the best open-source model available today.” It adds that the model achieves top-tier performance in coding benchmarks and significantly bridges the gap with leading closed-source models on reasoning and agentic tasks.

DeepSeek vs ChatGPT vs Gemini vs Claude:

DeepSeek also revealed benchmark data for its new model against existing models from rivals such as OpenAI’s GPT-5.4, Anthropic’s Claude Opus 4.6, and Google’s Gemini 3.1 Pro.

DeepSeek-V4-Pro-Max leads in coding and mathematical performance, topping the Apex Shortlist, a benchmark focused on high-difficulty reasoning and problem-solving, with a score of 90.2%. It also achieves a Codeforces rating of 3206, which shows strong real-world competitive programming ability, and ties for first place on SWE Verified, a benchmark that evaluates performance on practical software engineering tasks.

However, the model lags behind its American counterparts in general knowledge and broader reasoning. Gemini 3.1 Pro leads on SimpleQA-Verified, a benchmark designed to test factual accuracy and question answering, while GPT-5.4 ranks highest on Terminal Bench 2.0, which measures how effectively models can use tools and operate in agent-like environments.

DeepSeek says the V4-Pro-Max achieves these results while being far more efficient, using nearly 10 times less memory than its V3.2 model when handling long inputs.

Benchmark (Category)	DeepSeek-V4-Pro Max	GPT-5.4 xHigh	Claude Opus 4.6 Max	Gemini 3.1 Pro High
Codeforces Rating (Coding)	3206	3168	–	3052
Apex Shortlist (Math/Coding)	90.2%	78.1%	85.9%	89.1%
SWE Verified (Agentic Coding)	80.6%	–	80.8%	80.6%
MMLU-Pro (Knowledge)	87.5%	87.5%	89.1%	91.0%
SimpleQA-Verified (Accuracy)	57.9%	45.3%	46.2%	75.6%
GPQA Diamond (Reasoning)	90.1%	93.0%	91.3%	94.3%
Terminal Bench 2.0 (Agentic)	67.9%	75.1%	65.4%	68.5%
Toolathlon (Tool Use)	51.8%	54.6%	47.2%	48.8%

Notably, DeepSeek’s new model launch comes just hours after OpenAI launches its latest GPT-5.5 model which is seen as the company’s answer to Claude’s dominance in the coding world. The popularity of DeepSeek early last year had led to a trillion-dollar stock market selloff since its open-source AI model was built at a fraction of the cost compared to the American rivals.

Hot topics

World

Business

Politics

Tech

Hot topics

World

Business

Politics

Tech

DeepSeek’s new AI model:

DeepSeek vs ChatGPT vs Gemini vs Claude:

Topics

Related Articles

Categories

Latest

Newsletter