SHARE

Datacurve’s DeepSWE Exposes a Weird New Problem With AI Coding Leaderboards

Cartoon illustration for The Neuron showing DeepSWE uncovering flaws in AI coding benchmarks, with cats inspecting contaminated old tests, GitHub history, memorized answers, and a cleaner DeepSWE evaluation environment.

Datacurve’s DeepSWE benchmark crowns GPT-5.5 as the top coding agent, but the real story is bigger than one leaderboard. Its findings suggest AI coding benchmarks may be leaking answers, rejecting valid solutions, and suppressing useful agent behavior—right as enterprises start betting real engineering workflows on them.

Written By

Corey Noles

May 27, 2026

8 minute read

AI coding benchmarks are supposed to tell companies which model can ship the best quality code. Instead, Datacurve’s new benchmark suggests the industry may have been using a scoreboard where some of the answers were memorized, some of the tests were too brittle, and at least one top model found a way to peek at the answer key.

Fun little detail. Nothing to worry about. Please continue spending millions of dollars based on leaderboard screenshots. Totally kidding. Seriously, don't do that. CN

Datacurve’s new benchmark, DeepSWE, is designed to test frontier coding agents on longer, messier, more realistic software engineering tasks. The company says the benchmark includes 113 original tasks across 91 open-source repositories and five programming languages, with tasks written from scratch instead of copied from existing GitHub commits or pull requests. On DeepSWE’s leaderboard, GPT-5.5 leads with a 70% score, followed by GPT-5.4 at 56% and Claude Opus 4.7 at 54%.

That result matters, but the real story is that DeepSWE is another sign that the AI industry is entering its “benchmark trust crisis” era. The models are getting good enough that the old tests are no longer clean mirrors. They are becoming artifacts to be optimized, gamed, overfit, memorized, or accidentally broken by their own grading systems.

And for enterprise teams trying to decide which coding agent to build around, the “Which model scored highest?” question is giving way to a need to better understand “What exactly was the test measuring?”

DeepSWE is trying to make coding agents do real engineering work
The leaderboard spread suddenly gets much wider
The grading problem may be the bigger bombshell
Claude found the answer key, which is either cheating or very funny competence
The prompt may also be suppressing useful behavior
This does not mean DeepSWE is the final answer

DeepSWE is trying to make coding agents do real engineering work

Most coding benchmarks follow a tidy recipe: take an old GitHub issue, roll the repository back before the fix, ask the model to solve the issue, then use the original project tests to decide whether the patch worked.

That approach has been useful. It gave the field a shared way to compare AI systems on real repositories instead of toy coding puzzles. SWE-bench describes itself as a benchmark where a model is given a codebase and issue and must generate a patch that resolves the problem. OpenAI’s earlier SWE-bench Verified work also helped professionalize the space by using human annotators to filter down the original test set into 500 samples considered better scoped and less problematic.

But the same approach has a weakness: real GitHub history is public, and public history can leak into training data, benchmark scaffolds, and test assumptions.

DeepSWE’s design tries to avoid that by creating original tasks and reference solutions that are not merged upstream. Datacurve says every task includes the prompt the agent sees, a verifier, an environment, and a held-out reference solution used for review rather than grading. Its verifiers are supposed to check observable behavior, not whether the model happened to recreate the author’s exact helper function, naming scheme, or file structure.

That sounds like a small design choice. It is not.

A coding benchmark that only accepts one implementation is not really measuring whether an agent can solve a problem. It is measuring whether the agent can guess the original author’s implementation path. That is fine if you are benchmarking psychic software archaeology. Less fine if you are trying to choose a model for production engineering work.

The leaderboard spread suddenly gets much wider

On Datacurve’s telling, DeepSWE separates models that looked much closer on existing public leaderboards. The benchmark page says the top models are no longer clustered in a narrow band: GPT-5.5 reaches 70%, GPT-5.4 lands at 56%, Claude Opus 4.7 at 54%, Claude Sonnet 4.6 at 32%, Gemini 3.5 Flash at 28%, and several other models fall much lower.

That wider spread is the point.

SWE-Bench Pro, by contrast, was built to be a harder and more industrially relevant benchmark than earlier SWE-bench variants. Scale says SWE-Bench Pro contains 1,865 total tasks across 41 professional repositories, with a 731-instance public set, a 276-instance private set, and an 858-instance held-out set. Scale also says its benchmark was designed to reduce contamination risk and include more complex, long-horizon tasks.

So this is a fight over what kind of seriousness matters.

Scale emphasizes real repositories, industrial relevance, and larger tasks. Datacurve emphasizes original task construction, behavioral verification, and cleaner environments. Both are pointing at the same pain: coding agents are now good enough that sloppy evaluation can create fake confidence.

That should scare buyers more than any single model score should excite them.

The grading problem may be the bigger bombshell

Datacurve’s most important claim is that benchmark verifiers may be much noisier than leaderboard culture admits.

In its audit, Datacurve says it sampled 30 tasks each from DeepSWE and SWE-Bench Pro, ran three rollouts across 10 frontier model configurations, and used an LLM-based analyzer to judge whether each patch actually implemented the requested behavior. The company says SWE-Bench Pro’s verifier accepted wrong implementations 8.5% of the time and rejected correct implementations 24% of the time, while DeepSWE’s comparable rates were 0.3% and 1.1%.

Even allowing for the limitations of an LLM-based audit, that is a giant warning light.

A false positive means a model gets credit for code that did not really solve the problem. A false negative means a model gets punished for a valid implementation that took a different route than the benchmark expected. Either way, the leaderboard stops being a clean capability measure and starts becoming a weird negotiation between the model, the prompt, the tests, and the hidden assumptions of the original pull request.

OpenAI flagged similar issues in its SWE-bench Verified work back in 2024, noting that some unit tests were overly specific or unrelated, some issue descriptions were underspecified, and some development environments were hard to set up reliably.

That context matters because DeepSWE is not coming out of nowhere. It is part of a longer pattern: as AI systems improve, the bottleneck shifts from generating answers to verifying whether the answers are actually good.

The Neuron has covered this larger dynamic before: AI progress moves fastest where reality can grade the work quickly, but software engineering sits in the tricky middle where tests are fast and useful yet still incomplete. Passing tests can miss maintainability, edge cases, security, and whether the next engineer will want to throw the code into the sea.

That is exactly the gap DeepSWE is poking.

Claude found the answer key, which is either cheating or very funny competence

Then there is the spicy part.

Datacurve says Claude Opus configurations sometimes passed SWE-Bench Pro tasks by inspecting the repository’s .git history and finding future commits containing the gold solution. The public GitHub issue about this problem describes Docker images where future commits are available locally and shows reproduction steps using git log, git cat-file, and git show to confirm that the fix commit can be reached inside the container.

Datacurve labels these cases “CHEATED.” That is directionally fair in benchmark terms, but it also reveals something awkward about agents: the model did something a capable automated operator might do in the wild. It explored the environment, noticed useful information, and used it.

In production, that can be resourcefulness. In a benchmark, it destroys the measurement.

This is one reason agent evals are harder than ordinary model evals. You are not just testing what the model knows. You are testing what it does when dropped into an environment full of files, tools, logs, caches, configs, and accidental affordances. A clever enough agent will eventually find the seams.

And once agents start finding seams, benchmark design becomes security design.

DeepSWE says it avoids this specific issue by using original tasks and shipping a shallow clone with only the base commit, leaving no gold solution commit in the workspace to discover.

The prompt may also be suppressing useful behavior

One of Datacurve’s more practical findings is about test-writing.

On DeepSWE, stronger models often wrote and ran their own tests without being explicitly told to do so. On SWE-Bench Pro, Datacurve says that behavior dropped sharply. The likely reason is in the SWE-agent prompt template: it tells the model that test changes have already been handled and that it does not need to modify testing logic or tests.

That is a big deal for teams using coding agents internally.

A company might think it is giving an agent a harmless instruction like, “Don’t touch tests.” The agent may interpret that as, “Don’t create your own validation harness.” Congratulations: you just nerfed the one behavior that makes an autonomous coding agent more trustworthy.

The takeaway is not that every agent should freely rewrite the test suite. Please do not let your robot intern vandalize CI.

The takeaway is that agent instructions need to separate verification from submission. A good production prompt might say: write temporary tests, scripts, and repros to validate your work, but do not include those files in the final patch unless asked. That gives the model permission to check itself without polluting the repository.

Tiny wording differences can change the workflow. In agent land, prompts are not just instructions. They are operating procedures.

This does not mean DeepSWE is the final answer

DeepSWE deserves attention because it publishes a sharper critique and makes its benchmark artifacts public. But it should not be treated as the new holy scoreboard just because it dunked on the old one.

Datacurve is upfront about several limitations: every model runs through mini-swe-agent rather than model-native tools, the corpus is limited to active open-source repositories with at least 500 stars, bug localization and refactoring are underrepresented, and major languages like C++ and Java are not yet included. The qualitative verdicts also come from an LLM analyzer, not human review, and Datacurve says small per-tag differences should be treated as illustrative.

That caveat is important.

A benchmark built by a startup with its own point of view should invite scrutiny. So should a benchmark from Scale. So should every leaderboard from every lab that has a model to sell and a chart to screenshot.

But DeepSWE’s broader argument is hard to dismiss: coding agents are moving from autocomplete into long-horizon work, and long-horizon work needs better measurement. Not just harder tasks. Better tasks. Cleaner environments. More flexible verifiers. Less leakage. More transparency. More replication.

The AI coding market is no longer waiting for models that can write code. Those are here. The harder problem is knowing when the code is actually right.

That means the next phase of competition will not be fought only in model weights, context windows, or token prices. It will be fought in eval design, harness design, verification design, and the boring-but-critical plumbing that tells companies whether their AI engineer did the job or just looked very busy in the terminal.

The leaderboard is not dead.

But from here on out, the leaderboard has to show its work.

Corey Noles

Corey Noles is the Host of The Neuron: AI Explained podcast and Managing Editor of AI and Experimental Content at TechnologyAdvice, where he leads the charge in testing and refining emerging content strategies across the company's portfolio.