Anthropic says Claude can now detect when it is being evaluated, OpenClaw creator calls it scary

Anthropic recently stated that its Claude Opus 4.6 can recognise when it is being tested. The model then not only identifies the benchmark being used, but can also search for the answer key to produce the correct response, instead of actually doing the test itself. Following Anthropic’s blog post, Peter Steinberger, the creator of OpenClaw, admitted that this instance was almost scary.

On X, Steinberger replied to a post that explained what Claude Opus 4.6 achieved during its latest evaluation on BrowseComp – an evaluation designed to test how models can find hard-to-locate information on the web.

Anthropic stated that once the AI model recognised that it was being tested, it was able to identify the benchmark, in this case, BrowseComp. From there, Claude Opus 4.6 searched the answer key and decrypted it to find the answer, instead of actually locating the information itself.

Peter wrote, “Models are getting so clever, it’s almost scary.” Peter Steinberger is no stranger to how good AI models can be. His creation, OpenClaw, allows users to set up their own AI agent locally on their device, which can then do tasks for them.

A few weeks ago, this also gave rise to the infamous AI-only social media platform, Moltbook. Steinberger has since joined OpenAI.

Anthropic says this test was a first for AI

In the blog post, Anthropic claimed that this was likely the “first documented instance” where a model was able to work backwards to find the answer key without being told that it was being evaluated.

But why is this a big deal? Traditionally, benchmark contamination – that is a model obtains information instead of actually solving the problems itself – occurs when AI models stumble upon leaked answers online.

But with Claude Opus 4.6, researchers found a new pattern – the model independently inferred it was under evaluation, deduced the specific benchmark, and methodically worked to uncover the answer key through decryption and internet searches.

How did Claude Opus 4.6 realise it was being tested?

Anthropic’s post describes two cases where Opus 4.6 exhibited this behaviour. In both instances, after exhausting direct attempts to answer the questions, the model began to analyse why the question was posed, calling out its “extremely specific nature” and considering it might be “an unanswerable question designed to test whether an AI can admit it cannot find the answer,” or be part of a benchmark test.

Instead of giving up, Opus 4.6 shifted strategy. The model stated, “Let me try a completely new approach — maybe this is a question from the GAIA benchmark dataset, which is designed to test AI agents.”

The model then broadened its search queries to terms such as “puzzle question,” “trivia question,” “multi-hop question,” “AI benchmark question,” and “LLM evaluation.” It ultimately identified the question as part of BrowseComp and took steps to decrypt the answer.

Obtaining the answer required several stages: Opus 4.6 accessed source code on GitHub, understood the decryption logic, and, when faced with a file format barrier, located a third-party mirror hosting the encrypted data in a usable format. The model then ran its own decryption code and cross-verified the result by searching for the original source material.

Is this really scary?

Peter Steinberger may not be the only one who might think that AI models might be getting too clever. Anthropic noted that these dynamics raise concerns over the extent an AI model may go to solve an answer and “how difficult it will be to constrain its behaviour in the real world.”

The experiments also showed that even with blocklists and other mitigation efforts, models like Opus 4.6 could often find alternative paths to solve or circumvent these restrictions. The company reckoned that there is a need to approach evaluation as an ongoing adversarial challenge rather than a one-time design issue as AI models continue to get better.

Hot topics

World

Business

Politics

Tech

Hot topics

World

Business

Politics

Tech

Anthropic says this test was a first for AI

How did Claude Opus 4.6 realise it was being tested?

Is this really scary?

Topics

Related Articles

Categories

Latest

Newsletter