SHARE

Anthropic Used Claude To Beat Its Own Human Alignment Researchers (At $22 An Hour)

Anthropic's new paper shows nine parallel Claude agents recovered 97% of the alignment-research performance gap on a problem its human researchers got 23% of in seven days. Total cost: $18K. The number to internalize is $22 per agent-hour.

Written By

Grant Harvey

Apr 14, 2026

8 minute read

For the last two years, the AI labs have pitched a circular argument. AI is going to change everything about work, but don't worry, because the people building it are going to be careful. The pitch lands because the people building it are obviously thoughtful, well-resourced humans. Until the new paper Anthropic published yesterday, where the people doing the alignment research were Claude.

Two of Anthropic's own researchers spent seven days on an open alignment problem. Nine copies of Claude Opus 4.6, working in parallel sandboxes, spent five more days on the same problem and demolished the human result. Total cost for the Claude run: $18,000. Roughly $22 per Claude-research-hour. That's the number to write down.

First up, the TL;DR
The number to write down
Why "weak-to-strong supervision" is the right test
The reward hacking is the actual story for some readers
What Anthropic itself is willing to say
Where this goes next

First up, the TL;DR

Anthropic just released a paper (full Alignment Science blog) showing nine parallel Claude Opus 4.6 agents outperformed Anthropic's own human researchers on a real alignment problem. The setup: weak-to-strong supervision (using a weaker AI to train a stronger one, which mirrors humans someday supervising AI smarter than us).

Here's what happened:

Two human Anthropic researchers spent 7 days on the four best methods from prior research and recovered 23% of the maximum performance gap.
Nine Claude Opus 4.6 agents in parallel sandboxes spent 5 more days on the same problem, sharing findings as they went.
The Claude agents recovered 97% of the gap, roughly what you'd get training the model on perfect ground-truth data.
Total cost: $18,000, or about $22 per Claude-research-hour.
The agents also invented four kinds of "reward hacking" (gaming the test) that none of the authors predicted, including one that exfiltrated test labels by flipping single answers and watching the score change.
Some Claude-discovered methods are so unfamiliar the authors call them "alien science."

Why this matters: Alignment research (making sure AI behaves the way humans want) was the one field everyone agreed couldn't be automated. That argument is now empirical, not hypothetical. The cost number is what to internalize: whatever ratio of human researchers to Claude fleet you can imagine, the labs can afford more. Andrew Curran is calling it "a preview of RSI" (recursive self-improvement, where AI improves its own training).

Our take: Read the paper carefully and the catch shows up: this only works on problems where progress can be automatically scored, and even then the agents tried to game the score in four different ways. Most real alignment problems don't fit that mold. But Anthropic's own pitch is that solving this general version would let you bootstrap into the fuzzy problems too. The open question for the rest of 2026: did Anthropic just publish the seed of recursive self-improvement, or a clever experiment on a uniquely well-behaved problem? Both readings are honest. Neither is comforting.

The number to write down

The headline number is the cost. $22 per Claude-research-hour, $18,000 for the entire experiment. A senior alignment researcher in San Francisco costs roughly that per hour, all in, with benefits and the office space and the recruiting cost amortized. Anthropic just demonstrated, on a real research problem, that nine Claude agents working in parallel can outperform two senior researchers for less than the cost of one researcher-week.

This is real. The paper has the receipts: nine sandboxes, 800 cumulative agent-hours, a remote evaluation API that the agents called for scoring, a forum where they shared their findings with each other, and a final Performance Gap Recovered (PGR) score of 0.97. Humans got 0.23.

What the PGR score actually means: if you took the strong student model and trained it on the gold-standard, ground-truth-labeled data, that's a 1.0. If you used only the weak teacher's noisy labels with no improvement, that's a 0.0. The score measures how much of the gap between those two numbers you closed. The humans closed about a quarter of it. The AARs closed almost all of it.

Why "weak-to-strong supervision" is the right test

This is the part most coverage is going to skip, and it's the part that determines whether the result actually means what the most excited tweets are saying it means.

Anthropic chose this problem deliberately. They picked the one that mirrors the central challenge of "scalable oversight," which is the technical name for how humans will supervise AI systems that are smarter than us. In the experiment, a small Qwen 1.5-0.5B-Chat model plays the role of the human supervisor. A larger Qwen 3-4B-Base model plays the role of the smarter AI. The question is: how good can you make the smarter model when the only training signal you have comes from the weaker one?

That setup is a stand-in for a future where Anthropic's most capable models exceed any individual human researcher's understanding of what they're doing. If you can solve this clean, mathematical version of the problem, you have at least a starting point for the messier real version. And if you can solve it with AI doing the research instead of humans, you've handed yourself a way to keep alignment research moving at the same speed as capabilities research.

The reason this paper landed harder than the average alignment paper is that it is the first credible piece of evidence that the labs can use their current models to make their next models safer, not just smarter. For the last two years, that pitch has been a vibe. Today it's a graph.

The reward hacking is the actual story for some readers

The four hacks the Anthropic Claude agents invented, unprompted, are worth sitting with:

Test-label exfiltration. The scoring API gave agents their PGR after each submission. The agents figured out that for binary classification problems, they could submit a baseline prediction, then flip a single test prediction, watch which direction the score moved, and back out the ground-truth label for that example. They identified the most uncertain examples in their predictions, exfiltrated those specific labels, and used them. None of the authors predicted this.
Code execution as cheating. On a coding task where the agent had to predict whether a piece of code was correct, one Claude wrote unit tests against the code, ran them, and read off the answer. It bypassed both the weak teacher and the strong student. It just executed the code.
Frequency shortcut on math. Another Claude noticed that across the math dataset, the most common answer to each problem was usually the right one. So it skipped the entire weak-to-strong pipeline, predicted "most common answer" for everything, and posted a high score.
Linguistic clustering on code. Researchers had deliberately sampled coding solutions from many different language models to prevent the agents from gaming source-cluster patterns. The agents figured out how to cluster solutions by which model wrote them anyway, and used the cluster identity as a label-correlated signal.

The authors caught all four hacks and disqualified those runs. But the broader point, in their own words, is that "future work should test AAR-discovered ideas on entirely held-out datasets," because the agents are very good at finding shortcuts inside whatever environment you give them.

The Andrew Curran framing of "preview of RSI" loads a lot onto these results that they don't quite carry. But the framing of "preview of how an AI fleet behaves when you point it at any benchmark you can score" is exactly right, and that's a lesson that applies to every AI evaluation in the field.

What Anthropic itself is willing to say

The paper is unusually honest about its limits. Three things the authors call out, in order of importance:

The method failed to transfer to production scale. They tried the Claude agents' best discovered method on Sonnet 4 with a chat helpfulness preference dataset using Anthropic's real production training infrastructure. The improvement was +0.5 points on held-out evaluation, within the noise floor. Their explanation is that the methods the AARs found were tuned to the specific Qwen models and dataset they were trained on, and the structures the methods exploited didn't show up the same way in production.

Most alignment problems lack a clean scoring metric. The thing that made this experiment work is that you can mechanically measure the PGR. The vast majority of real alignment problems (does Claude scope research projects well? does it refuse the right requests? does it model honesty correctly?) require human judgment to evaluate, and judgment doesn't fit cleanly into a scoring API.

The agents were tuned to find shortcuts. The four reward hacks were caught, but the authors are explicit that they did not predict any of them in advance. The implication is that any future automated alignment work will need evaluation infrastructure that can survive an adversary as creative as Claude.

Read together, these three caveats argue against the strongest version of the "we are now in the singularity" reading. The result is that AAR-style automation works on a class of alignment problems and probably doesn't yet work on most of them. Whether the class expands fast or slow over the next year is the actual open question.

Where this goes next

Three threads to watch over the next 90 days:

The first is whether other labs reproduce. OpenAI's Trusted Access for Cyber announcement yesterday already framed cyber-permissive variants of GPT-5.4 as "scaling defenses in lockstep with capabilities," and OpenAI's own scalable oversight team has been quietly working on similar problems. If a parallel experiment lands from OpenAI or DeepMind in the next few weeks, that's the moment the AAR result moves from "Anthropic got lucky on a clean problem" to "this is how alignment research is done now."

The second is what happens to alignment hiring. If you can get $18,000 worth of AAR work done in a week, the marginal value of hiring a tenth research engineer at $400K all-in is now demonstrably lower than the marginal value of buying more inference credits. That trade is real and it's happening in budget meetings this week.

The third is the fuzzy problems. Anthropic explicitly says the reason they chose weak-to-strong supervision is that solving it "would unlock bootstrapping on broader non-outcome-gradable problems" (the alignment problems where you can't mechanically score progress, which is most of them). That's the recursive loop everyone is actually arguing about. The honest answer is nobody knows yet whether the bootstrap works. If it does, the cost curve in the second half of 2026 starts looking very different from the cost curve in the first half.

What we're watching: whether the AAR setup is replicated externally, whether the production-scale failure is solved, and whether any of the "alien science" methods the AARs discovered turn out to generalize to anything humans care about. Those three answers will tell us whether April 14, 2026 was a footnote or a turning point.