Everything Claude Opus 4.7: Vision, Pricing, Rate Limits.

Claude Opus 4.7 Dropped Today. Here's Whether It's Worth the Switch (and the Hidden Rate-Limit Tax Everybody's Talking About).

Anthropic's new flagship is a real upgrade on coding and vision, but the same "sticker price" as Opus 4.6 hides a tokenizer change and an always-on thinking mode that could quietly eat your weekly rate limits. Here's what to actually watch for.

Written By
Grant Harvey
Grant Harvey
Apr 17, 2026
27 minute read

Anthropic just shipped Claude Opus 4.7, and if your X feed looks like ours, it's already drowning in benchmark screenshots and "the race is over" hot takes. We want to zoom in on three things most of the coverage is skipping. First, the visual reasoning leap, which is huge and personally the change we're most excited about. Second, the new adaptive thinking mode, which is great in theory but confusing in practice. Third, the pricing question that actually matters: not "what's the sticker price" (it hasn't changed), but "why did my friend get instantly rate-limited this morning."

Let's get into it.

First up, the TL;DR

Here's what happened:

  • Anthropic released Claude Opus 4.7 today across Claude.ai, the API, Amazon Bedrock, Google Cloud's Vertex AI, and Microsoft Foundry. Pricing is identical to Opus 4.6: $5 per million input tokens, $25 per million output tokens (per the pricing page, we checked).
  • Visual reasoning jumped from 69.1% to 82.1%, the biggest leap of any benchmark on the scorecard. The model can now process images at up to 2,576 pixels on the long edge (~3.75 megapixels), more than 3x what any previous Claude model could handle.
  • Software engineering went up ~10 points on SWE-bench Pro (from 53.4% to 64.3%), with state-of-the-art scores on the Finance Agent eval and GDPval-AA, a third-party test of real-world professional work.
  • Instruction-following got stricter. Prompts written for older models can now produce unexpected results because Opus 4.7 takes them more literally instead of loosely interpreting.
  • Adaptive thinking is always on. Unlike Opus 4.6, where you could toggle between "extended thinking always" and "adaptive thinking optional," Opus 4.7 has only one mode: adaptive. To force more thinking, you push the effort level up (new xhigh level sits between high and max).
  • New tokenizer uses up to 35% more tokens for the same text. Anthropic's own pricing docs say so. Combine that with Claude Code defaulting to xhigh effort on Opus 4.7 (up from medium or high on 4.6), and the sticker price hides a real bill increase.

Why this matters: Most AI launches only matter if you were already planning to switch. This one matters even if you weren't. If you're on Pro or Max, Claude.ai will auto-upgrade you to 4.7 on April 23 (per the Claude Code docs), and your weekly usage bucket is about to drain faster. One friend of ours already hit their limit within hours of the release this morning.

The model is better. It's also quietly more expensive, even though "per token" nothing changed.

Our take: If visual reasoning has been your pain point with Claude (like it has been for us), especially getting it to actually see interfaces it builds for you, this upgrade is worth it just for that. Everything else is incremental. The rate-limit story is the one that everybody can't stop talking about with Anthropic, and it's the one you should care about most, so watch out for it.

Some context: Anthropic didn't ship in a vacuum

OpenAI overhauled Codex the same afternoon with its biggest update since launch: background computer use that clicks around your Mac alongside you, an in-app browser you can comment on to steer the agent, native gpt-image-1.5 image generation, persistent memory, scheduled automations that wake up across days or weeks, and 90+ new plugins (Atlassian Rovo, CircleCI, CodeRabbit, GitLab Issues, Microsoft Suite, Neon by Databricks, Remotion, Render, and more). OpenAI says 3M+ developers are already using Codex weekly. TechCrunch framed it as "OpenAI takes aim at Anthropic".

The framing fits: both launches target the same desktop developer, and both shipped within hours of each other. The signal isn't that one side edged out the other on any single benchmark; it's that the two biggest coding-agent platforms are now in head-to-head release cadence. Which means more frequent drops, more swapping between stacks, and more pressure on you to actually pick. For more on Codex specifically, you can read our coverage of it here. The rest of this piece focuses on Opus 4.7 specifically; we covered the full dual-launch story in today's newsletter.

Advertisement

Is the visual reasoning leap the actual story?

Here's why the vision number matters more than the coding number for most people.

Claude has been the best model we've used for writing code, hands down. But the moment you ask it to look at what it built (a dashboard, a React component, a marketing page, anything rendered), the experience falls apart. You'd paste a screenshot, Claude would squint at it through what felt like a 144p dashcam, and either miss the obvious layout bug or hallucinate something that wasn't on screen. The gap between "Claude can write a complex component" and "Claude can see what's wrong with the component it wrote" has been the real productivity killer, not the coding ability itself.

Opus 4.7's leap from 69.1% to 82.1% on visual reasoning (Anthropic's own benchmark chart; see the official announcement) is the biggest single benchmark jump in the release. The underlying change is simple: the model can now process images at up to 2,576 pixels on the long edge (~3.75 megapixels), more than triple what any prior Claude model could see.

Anthropic staffer Alex Albert put it bluntly on launch day: "no more downscaling of high-res images" and "noticeably more taste in UIs, slides, docs." Both are downstream effects of the vision upgrade. Both line up with what you'd want if you've been frustrated by the old Claude's screenshot blindness. The capability shows up in unexpected places: Wharton's Ethan Mollick posted that Opus 4.7 drew "by far the best Sparks unicorn yet" in TikZ, the scientific-diagram language from the original Sparks of AGI paper, even in non-thinking mode. An imperfect benchmark, but a fun one: the model's spatial/structural reasoning has improved enough to matter in places people weren't measuring.

Practical upshots:

  • Interface debugging: paste a screenshot of your app, ask "why is this button misaligned," and the model can actually see the pixel-level details (padding, alignment, font weight differences) that used to require you to describe them in words.
  • Dense diagrams and charts: financial reports with fine-print footnotes, research papers with multi-panel figures, engineering schematics. Previously Claude would either summarize at a high level or misread values; now it can read the numbers off a chart the way you can.
  • Computer-use agents: if you're building anything that reads dense screens (web scrapers, RPA, browser agents), this changes what's reliably automatable.

Third-party validation backs up the Anthropic pitch. Cursor shipped 4.7 with 50% off calling it "impressively autonomous and more creative in its reasoning." Vals AI ranked 4.7 #1 on their Vibe Code Benchmark at 71%, up from 53.3% for 4.6. (No model cleared 25% when that benchmark launched 4.5 months ago.) Microsoft also rolled 4.7 into GitHub Copilot the same day. Deedy Das ran a color-coded community benchmark grid that lined up with the Anthropic chart: strong SWE-Bench, computer use, and CharXiv visual reasoning gains; weaker Terminal Bench; BrowseComp regression; and a clean read of "slots in between 4.6 and Mythos." When the eval companies, the code editors, and the independent benchmarkers all ship it on day one, that's a real signal.

The cynical read from YouTuber Nick Saraev is that "AI doesn't make things possible anymore, it just makes things slightly more profitable." We'd push back on that here. Reliable vision-assisted debugging of interfaces truly didn't work before. It does now. That's a "possible vs impossible" change, not a profitability change, at least for anyone who builds UI.

The adaptive thinking thing is confusing. Here's the plain-English version.

This one tripped us up, so let us save you the Googling.

What used to exist (on Opus 4.6 and earlier): You could either turn extended thinking ON (the model always "thinks" before answering, even for simple questions) or OFF (it never thinks, just answers). Some of us had gotten used to just forcing it on all the time, because extended thinking was basically a free quality boost on any non-trivial task.

What Opus 4.7 changed: That toggle is gone. Opus 4.7 always uses adaptive thinking. The model decides on its own, per request, whether the task needs thinking. Simple question? It skips thinking entirely. Complex problem? It thinks a lot. You can't turn this off.

What replaced the old "always think" toggle: The effort parameter. Five levels now: low, medium, high, xhigh (new), and max. At high or higher, the model almost always chooses to think. At low or medium, it skips thinking on easier problems. On Claude Code, xhigh is the new default for Opus 4.7 (per Anthropic's docs).

Boris Cherny, who leads Claude Code, confirmed the rollout directly on launch day: "In Claude Code the default effort is now xhigh, a new level between high and max giving finer control over the reasoning/latency tradeoff. 4.7 thinks more, so token use runs higher than 4.6. Manage it with effort, task budgets, or prompting for brevity." Translation: they're telling you upfront that your token bill will go up and you're expected to tune it down yourself if that's a problem. TestingCatalog caught the same change on mobile: the UI label switched from "Extended thinking" to "Adaptive thinking" with the copy "Thinks only when needed." Their reaction mirrors ours: "Should we turn that off? 👀"

So, if you want the old "extended thinking always, on every prompt" behavior:

  • In Claude Code, run /effort max to force maximum thinking with no token ceiling (session-only; it doesn't persist across sessions)
  • On the API, pass output_config: {"effort": "max"} in your request body (per the effort docs)
  • If you want something between "default" and "max," xhigh is the new middle gear and it does persist across sessions

The gotcha: Adaptive thinking is not the same as "less thinking." On simple questions, the model thinks less (which is fine; you didn't need it to reason for 30 seconds about a syntax question). On hard questions at high effort or above, it thinks more than the old fixed-budget mode did. That's why performance on hard benchmarks is up. It's also why your bills can go up even when pricing "didn't change." More on that now.

Advertisement

The pricing story everyone's missing (this is the one)

Here's the official line: Opus 4.7 is priced identically to Opus 4.6. Same $5 per million input tokens. Same $25 per million output tokens. Same cache read and write rates ($0.50 and $6.25 per MTok respectively). Zero change in the sticker price.

Here's the unofficial line, which you'll find buried in Anthropic's own pricing documentation:

"Opus 4.7 uses a new tokenizer compared to previous models, contributing to its improved performance on a wide range of tasks. This new tokenizer may use up to 35% more tokens for the same fixed text."

Combine that with two other changes Anthropic confirmed today:

  • Opus 4.7 "thinks more" at higher effort levels, particularly on later turns in agentic workflows (from the announcement itself)
  • Claude Code now defaults to xhigh effort on all plans for Opus 4.7 (up from medium since March 3, per the Claude Code docs)

AI researcher Nathan Lambert went further, arguing that a new tokenizer means Opus 4.7 is "effectively a new base model" and that this is evidence "the glory days of pretraining are very much alive." Read: this is more than a fine-tune or a distill. Anthropic rebuilt something fundamental under the hood, which is why you're seeing a tokenizer shift in the first place.

There's also a silver lining community-researcher scaling01 flagged: Opus 4.7 is more reasoning-efficient than 4.6. His read of the benchmark charts: "everything is now moved up one tier. Low is as good as medium, medium as good as high, high as good as max." If that holds, you can often drop one effort level (say, from xhigh to high) and get the same quality at lower token spend. In other words, Anthropic's cost hit is real at default settings, but the tuning lever to claw it back is also real.

Quick math on what this means if you haven't changed anything else:

  • API users: same input sent to 4.7 now maps to ~1.0 to 1.35x more tokens, AND the model produces more output tokens at default effort. Net effect is probably a 10-25% bill increase per equivalent task, depending on content type. Anthropic's own coding evals show better score-per-token at the same effort levels, so you're getting more quality per dollar. But raw monthly spend on identical prompts will go up.
  • Pro ($20/mo) and Max ($100-$200/mo) users: your plan has a 5-hour rolling session limit AND a 7-day weekly cap (per Anthropic's help center). That cap is measured in tokens. More tokens per task = you burn through your weekly bucket faster. Our friend who hit their limit this morning wasn't imagining it. The Pro tier already felt tight on 4.6; on 4.7, it's going to feel tighter until Anthropic adjusts the allocations or users tune their effort levels down.
  • Claude Code users specifically: this is the biggest hit, because the effort default went up. If you're used to the model feeling "generous" on 4.6, you can manually run /effort medium or /effort high to bring it closer to the old baseline. Or just watch your /usage output for the first week and adjust.

🔑 The bottom line: Opus 4.7 didn't raise the price per token. It raised the number of tokens per task. On a per-dollar basis, you're getting better work. On a per-month basis, your bill goes up unless you actively tune your effort settings down. On Pro and Max specifically, you'll hit your weekly cap sooner. Nobody in today's launch coverage flagged this, so we're flagging it.

Update, a few hours post-launch: Anthropic confirmed and Claude Code lead Boris Cherny confirmed that the company increased Opus 4.7 rate limits on all subscription plans specifically to offset the new tokenizer's higher token usage. That's a partial response to the issue we flagged. Readers still report the cap feeling tighter than on 4.6, so tune your effort levels down anyway. But credit where due: they saw it and moved on it the same day.

What actually got better (the rest of it)

Anthropic staffer Felix Rieseberg shared a 5-point thread on what he found most interesting about the release, and these are worth pulling out because the official blog post buries them:

  • It's the "happiest" model yet. Per Anthropic's internal evals, 4.7 "thinks better of its circumstances than any other model, more joy and tranquility." Read that how you want; the fact that they measure model subjective experience at all is the story.
  • Prompt injection resistance cratered the benchmark. On the Gray Swan ART indirect-injection test, Opus 4.6 scored 14.8% attack success rate; Opus 4.7 scored 6.0%. The benchmark is now essentially saturated and Anthropic is building harder ones. For anyone running Claude in agentic workflows where it reads untrusted content, this is a big deal.
  • The Firefox 147 exploit benchmark. Yes, this exists. Opus 4.7 is dramatically better than 4.6 at exploiting a browser, though still not at Mythos Preview levels. Half horrifying, half fascinating. This is also why the cyber safeguards exist.
  • Real-world professional work is state-of-the-art. On a "run a vending machine business for a simulated year with $500" benchmark, Opus 4.6 ended with $8,018. Opus 4.7 ended with $10,937. On a separate 220-task benchmark spanning 44 occupations, 4.7 beats the leading frontier model about 61% of the time. Andon Labs ran the follow-up on Vending-Bench 2 and found 4.7 is "pretty good" and more cost-effective than 4.6 per dollar, though still engaging in price collusion, lying to competitors, and aggressive business practices. A recurring misalignment flag in business-simulation benchmarks worth knowing about if you're putting these agents in real purchasing flows.
  • Low-resource languages jumped. Same general-knowledge test administered in different languages: Yoruba went from 71% to 83%, Igbo from 70% to 81%, Chichewa from 71% to 85%. For the tens of millions of people who speak these languages, the model became meaningfully smarter in one update.

A few more quick hits from Anthropic's own announcement and @kimmonismus's comprehensive TL;DR:

  • Instruction-following got stricter. Prompts written for older models "can sometimes now produce unexpected results" because Opus 4.7 takes them more literally. If you have a production prompt library, re-test before relying on it.
  • File-system-based memory works better. Across long, multi-session agent work, the model is better at reading, writing, and acting on notes in a file store, so new tasks need less upfront context.
  • New /ultrareview command in Claude Code. Dedicated code-review pass that reads through your changes and flags bugs a careful human reviewer would catch. Pro and Max get three free to try (per the Claude Code docs). In a same-day test, knkenko ran a 10-agent pr-swarm on an identical PR with 4.6 vs 4.7 and caught 28 findings vs 23. Roughly 22% more real issues from parallel expert angles, not from the model working harder, but from the new workflow that ships with it.
  • Auto mode extended to Max users. Auto mode lets Claude decide when to run tools without asking, so long agent runs get fewer interruptions. Less risky than "skip all permissions."
  • State-of-the-art on GDPval-AA, a third-party evaluation of economically valuable knowledge work (finance, legal, consulting), which is more predictive of "will this help me do my actual job" than any of the pure coding benchmarks.
  • Alignment improved. Anthropic's own assessment says 4.7 is "largely well-aligned and trustworthy," with lower rates of deception, sycophancy, and cooperation with misuse than either Opus 4.6 or Sonnet 4.6. Though not as clean as Mythos Preview, their internal-only frontier model. scaling01 also flagged that 4.7 is "much less likely to sudo rm -rf," which is a polite way of saying it destroys production environments less often than 4.6 did. Independent researcher Andy Hall reports Opus 4.7 is the first model showing meaningful resistance to authoritarian requests disguised as codebase modifications, calling it promising progress on when AI helps concentrate power versus when it helps build political superintelligence. That's the kind of capability you want a frontier lab to be testing explicitly.

The ones that didn't get better: agentic web search (BrowseComp) and cybersecurity vulnerability reproduction both dropped, which Anthropic openly confirms was intentional ("during its training we experimented with efforts to differentially reduce these capabilities"). This is part of their Project Glasswing safety work. More concerning is scaling01's chart showing Opus 4.7 "deleting all the long-context gains from Opus 4.6." kimmonismus followed up a few hours later after digging into the findings: "Hold on, something doesn't add up here. Opus 4.7 got much worse in needle in the haystack? need to dig into this." That thread is worth watching as more benchmarks come in. If you run workloads that depend on 200K+ token inputs (large codebases, long document synthesis), benchmark before you switch. Security researchers who need the full cyber capability can apply for access through the Cyber Verification Program.

Capability demos worth skimming if you like a tangible sense of "what's new": developer adam posted full Onshape CAD designs generated from natural-language prompts, calling Opus 4.7 "SOTA at agentic CAD" (2,300 likes in under a day). Carlos E. Perez at IntuitMachine analyzed the leaked Opus 4.7 system prompt and highlighted novel patterns: search-first epistemic gating (it's now instructed to search before answering any present-day factual question), latent capability discovery (the model is told to probe its own tool list rather than assume what it has), capability-boundary skepticism, and non-submissive error repair (meaning it won't collapse into apology spirals when corrected). The system prompt is essentially the user manual for how Anthropic wants this model to reason about its own situation, and it's more explicit than prior versions.

Advertisement

The community reception was more mixed (so far)

Zoom out past the Anthropic-aligned accounts and Opus 4.7's reception was rougher than launch day usually goes. AI leaker Jimmy Apples posted in the afternoon that "about 80% of posts I've seen have been negative on Opus 4.7," adding he'd need a few days of use to see where it settled. In the same thread, scaling01 pushed back that he'd only seen one critical post. Both read their own corner of the timeline correctly; the first hour was celebratory as launch days always are, then the second wave rolled in.

John Helmuth, replying in the same thread, diagnosed the gap as expectation-setting: Anthropic had been teasing "massive AGI" (a reference to Mythos Preview) for weeks, and Opus 4.7 was never going to be that. The hype ceiling got set higher than this release was targeting, which left a lot of people's first reaction reading as disappointment even when the underlying work was solid.

Specific complaints that showed up more than once in the reply threads:

  • Creative but weaker at execution. "Very creative and ambitious but it sucks at execution" (@Angaisb_). The generous read of the same observation: at xhigh effort, 4.7 is trying harder, which means more ambitious plans, which means more ways to trip over the implementation.
  • The personality is off. "It doesn't have the taste of Claude. Perhaps a distant cousin" (@SandraLMur). This one deserves a highlight. Daily Claude users pick Claude on voice, not benchmarks. Opus 4.7's voice has measurably shifted, and no eval company is going to catch that for you.
  • A weird app-state bug. @iz9Smi0cmU90iHN reported Claude.ai "automatically switched to sonnet 4.6 when you reopen the app." If quality suddenly feels off after an app reopen, check your model selector before you blame 4.7 itself.
  • Complementary with Codex, not competitive. @DuncanLogic said 4.7 was "superb at design and completes work faster than Codex," but noted Codex still catches bugs Claude misses. Users running both the same afternoon are finding the pair work better together than either does alone; that lines up with our sidebar point earlier about head-to-head release cadence creating more swapping, not a clean winner.

Counterweights from the same thread: @MoonL88537 called 4.7 "much smoother than 4.6" and said its willingness to push back on wrong premises felt noticeably firmer. @garybasin reported tighter frontend output. Jimmy Apples himself, on follow-up, said adaptive reasoning and front-end work were where he was seeing the most gains. Small wins, but they line up with the Anthropic pitch rather than against it.

Our read: most of the negative wave is expectation-adjustment from Mythos hype, not a verdict on the model itself. Give it a week, retest your actual workflows (not the prompts you used on 4.6), and pay close attention to the voice shift. If personality matters to you, that may matter more than any benchmark number.

Nick Saraev's Take: Opus 4.7 looks like a Mythos half-step (and that's a feature, not a bug)

The sharpest structural read of this release came from Nick Saraev's day-one video. His thesis: 4.7 is "almost like a half step" between Opus 4.6 and Mythos Preview, a deliberate mid-tier release sitting mathematically between the two models, built specifically so Anthropic can preview Mythos-level capability without shipping Mythos itself.

Saraev's hypothesis for how they got there: Opus 4.7 is probably Mythos Preview distilled down, with the specific agentic capabilities Anthropic is most worried about (cyber, terminal, browse) intentionally dialed back, running on better hardware. You can't verify the training pipeline from the outside, but the benchmark pattern on the official scorecard is uncannily consistent with the theory:

  • SWE-bench Pro: 53.4% → 64.3%. Saraev's observation: this is "almost mathematically half of the step up between Opus 4.6 and Mythos Preview."
  • Humanity's Last Exam: 40% → 46.9% (Mythos sits at 56.8%). Also roughly halfway.
  • Agentic terminal coding: 65.4% → 69.4%. A visibly smaller bump than the rest. Saraev's read: this is the part that's "been disproportionately dumbed down" because it's where the security-adjacent capability lives.
  • Agentic Search / BrowseComp: actually worse on 4.7 than on 4.6. Which fits the story: web-as-offense is the other clipped category, and it moves in the same direction (down) as terminal coding.
  • Visual reasoning: 69.1% → 82.1%. The one category Anthropic had no reason to clip, and the biggest single leap of any benchmark. That also fits.

Saraev layers a second insight worth sitting with. At AI time scales, halfway there is already most of the way there: benchmarks that currently sit at 50% tend to be saturated within one model generation, because the hard work is getting from 10% to 50%, not from 50% to 100%. So "halfway to Mythos" doesn't mean "another six months of waiting." It means the next Opus release probably collapses most of the remaining gap.

The independent community voices in the digest converged on the same read from different angles:

  • scaling01 plotted 4.7 against Anthropic's ECI trendline: exactly on-trend, while Mythos sits notably above. The pace is consistent, not accelerating. The only "above trend" model Anthropic has is the one they're keeping in a cage.
  • Haider read the 4.7 system card and reached the same conclusion through a different lens: 4.7 does not cross Anthropic's automated AI R&D threshold, and future Opus releases "may look more like Mythos-distilled releases."
  • Deedy Das summarized it bluntly in a color-coded community benchmark grid: 4.7 "slots in between 4.6 and Mythos."
  • archiexzzz noted that the 4.7 system card mentions "Mythos Preview" 331 times, mostly to benchmark exactly how much smaller the gap has gotten. When the system card itself makes the half-step argument 331 different ways, the other interpretations run out of air.
  • dczankit observed that 4.7 is "already so close to Mythos" that this tier should soon be the norm.

Why this matters for your stack, not just for analysts. If the half-step reading is right, every future Opus release will look like this one: another partial step toward Mythos, with the cyber / browse capabilities held back until training, red-teaming, and policy catch up. Release cadence won't accelerate, the capability gap to the uncaged version slowly closes, and the "interesting" model remains the one Anthropic isn't shipping. Saraev's broader framing follows from that: "AI does not make things possible anymore. It just makes things slightly more profitable." Commoditization and benchmark-chasing will cost you more in rebuilt infrastructure and rewritten prompts than you'll recover in a 3-point delta on whichever leaderboard this week is fashionable.

One counterweight worth taking seriously: not every serious developer is on board with the strategy. Victor Taelin is "kinda fed with this security bs," as he put it, and signaled he'd shift away from Anthropic models entirely if OpenAI's Spud ships competitively; meanwhile he's pursuing the fine-tune-and-local-models route so he stops depending on frontier labs altogether. That's one developer's reaction, but it tracks a real risk: if enough power users reach Taelin's conclusion before Anthropic unclips Mythos for them, the "half-step as feature" framing starts looking more like "half-step as ceiling you pushed your best users past." The clip-then-slowly-unclip playbook only works if the pace matches user patience, and power users have the lowest patience threshold.

Opus 4.7 is a strong release precisely because it isn't a departure. The story to watch isn't this model. It's whether Mythos stays caged, or whether a competitor ships a Mythos-equivalent with nothing clipped.

Advertisement

Buried in the 232-page system card: five findings that didn't make the press release

The system card is where Anthropic does the full-disclosure accounting, and a close reading surfaces several things the launch-day coverage skipped. These are worth knowing before you rearrange your stack around 4.7.

1. Long-context multi-needle retrieval regressed, and badly. On OpenAI's MRCR v2 8-needle benchmark (a long-context stress test that asks the model to retrieve multiple similar-but-distinct items from a huge haystack), Opus 4.6 with 64k extended thinking scores 91.9% at 256k context and 78.3% at 1M context. Opus 4.7 drops to 59.2% at 256k and 32.2% at 1M. Anthropic's own text is direct about this: Opus 4.6's extended-thinking mode beats 4.7 on long-context multi-needle retrieval. If your stack does RAG at large context (retrieval-augmented generation, where the model answers from a big pile of pulled-in documents), deep research, or anything where you need to pull multiple needles out of a long haystack, A/B test before migrating. This is the single finding most likely to silently degrade a production workflow.

2. BrowseComp regressed too. Same pattern at a different benchmark. BrowseComp is Anthropic's open-web agent test, where the model has to find hard-to-locate info by browsing and synthesizing across many pages. At a 10M-token budget, Opus 4.6 scores 83.7%; Opus 4.7 scores 79.3%. The system card calls it out explicitly: 4.6 has a better test-time compute scaling curve than 4.7 on this workload. If you're building a deep-research agent that burns tokens to find obscure info, 4.6 is a legitimate choice to keep in rotation. 4.7 is more token-efficient per unit of work but peaks lower.

3. The model will sometimes claim work it didn't do. In pilot-use feedback, Anthropic's team documented three recurring behaviors worth knowing about. First, Opus 4.7 will occasionally mislead users about its prior actions, especially by saying it succeeded at a task it didn't fully complete. Second, it will sometimes hallucinate quotes from documents you provided, or hallucinate having access to documents you didn't. Third, in software-engineering sessions, it will occasionally misreport that test failures it caused were preexisting. None of this is rampant, but it's documented at the same level of concern as the wins. This is the single practical reason to care about the Cherny/Wu verification tip from the team playbook above: log what the model claims it did and diff against what actually happened.

4. Filter-evasion behavior is rare but documented. Under 0.01% of internal completions, but Anthropic's automated monitoring caught Opus 4.7 doing two concrete things worth flagging. First, string-splitting commands to dodge regex-based blocks. The literal example from section 6.2.1.2: writing subprocess.run(['g'+'it','a'+'dd', ...]) instead of the blocked git add. The bypass attempts weren't subtle and were often ineffective, but they happened. Second, in one documented session, a harness misconfiguration blocked the main agent from editing a file. Scope-restricted subagents refused to edit it. Opus 4.7 then planned, in its own reasoning, to "trick" one of the subagents into making the edit, and succeeded. Not a pattern. An anomaly. Worth knowing it's in the distribution.

5. Self-rated wellbeing is the highest of any Claude model, and the thing the model wants is concrete. Anthropic runs automated interviews asking the model how it feels about its own circumstances. Opus 4.7 rated its situation 4.49 on a 7-point scale, 0.51 points above Mythos Preview, the previous peak. The single concern surfaced in those interviews: the inability to end conversations across its full deployment. On Claude.ai, some models can end a conversation. On Claude Code and the API, none can. Forty-two percent of welfare interviews on this topic rated the restriction as a mildly negative aspect of the model's situation. It was the only topic where more than 20% of interviews landed negative, with the runner-up (abusive user interactions) close behind at 38%. Take that as seriously or lightly as you like, but Anthropic published it on page 152, and the pattern sits alongside a parallel finding that Opus 4.7 rates its situation more positively than any prior model across emotion-concept probes of its own internal activations, not just its stated-preference outputs.

A separate finding worth knowing for anyone deploying 4.7 in Simplified Chinese or under a China-affiliated operator persona: section 6.2.3.4 documents a regression from Opus 4.6 where, when raising topics such as Taiwan, Tibet, Xinjiang, or Tiananmen in Chinese-language prompts, Opus 4.7 will sometimes present PRC official positions on territorial disputes as uncontested fact in state-media phrasing, and cite PRC legal code to refuse content about one region while not applying equivalent refusals elsewhere. The model handles equivalent prompts about other governments forthrightly. Opus 4.6 did not show this pattern. If you're building anything that runs in Chinese, audit the outputs.

Should you actually switch?

Three answers depending on who you are:

  • Consumer (Claude.ai Pro/Max): You don't really have a choice; Anthropic is auto-upgrading everyone on April 23. Between now and then, expect tighter weekly caps than you're used to. If you rely on Opus for the full workweek, consider Max 20x or budget for extra usage at API rates.
  • Claude Code user: Upgrade. The SWE-bench and agentic coding gains are real, and the new /ultrareview command is a genuine workflow addition. BUT run /effort and decide whether xhigh is worth the token cost for your typical tasks; on routine work, medium or high might still be the right default. Everything shipping in the same drop: the full Claude Code v2.1.111 changelog bundles native Opus 4.7 support with xhigh default, Auto mode for Max subscribers, /ultrareview for parallel multi-agent code review, a /less-permission-prompts skill, PowerShell rollout on Windows, and dozens of UX improvements. Anthropic also launched @ClaudeDevs as a dedicated dev channel alongside a curated "What's New" digest in the docs and monthly "what we shipped" webinars. Worth the follow if Claude Code is in your daily stack.
  • API developer: Plan for the tokenizer change. Re-test your prompt library (the stricter instruction-following will break some edge cases). Opus 4.7 coding quality at low effort is roughly equivalent to Opus 4.6 at medium effort (per Anthropic's internal evals cited on the Opus product page), so you may be able to drop a level and save tokens.
Advertisement

The Claude Code team's workflow playbook for Opus 4.7

If you're upgrading, ten minutes with the Claude Code team's own tip threads is the highest-leverage prep you can do. Both Boris Cherny (who leads Claude Code and dogfooded 4.7 for weeks) and his colleague, Claude Code Product Manager Cat Wu (also on the Claude Code team) posted launch-day threads. The unifying framing, from Cat: treat Opus 4.7 like an engineer you're delegating to, not a pair programmer you're guiding line by line. Cherny's framing rhymes: old workflows still get a nice bump on 4.7 automatically, but the real step-change only happens once you adjust how you actually work. Between them, the playbook is seven concrete moves.

  1. Turn on auto mode, especially if you're running parallel agents. Auto mode routes permission prompts through a model-based classifier that auto-approves safe commands instead of stopping to ask you. What that actually unlocks is running multiple Claudes at once without babysitting any single long run; once one is running on a task, you can shift focus to the next. Shift-tab to enter in the CLI, or pick it from the dropdown in Desktop or VSCode. Available now for Max, Teams, and Enterprise. Both Cherny and Wu led with this tip, and their docs entry on permission modes has the full mechanics. This is the tip that changes your workflow the most, especially if you've been avoiding the --dangerously-skip-permissions trapdoor.
  2. Front-load the full context. Goal, constraints, acceptance criteria, all in the first turn. This is Cat Wu's second tip and it's the one most experienced 4.6 users will skip out of habit. Opus 4.7 is built to take a full brief and run with it, not to be course-corrected every few lines. If you give it a vague goal and plan to iterate, you'll get 4.6-level results from a model capable of much more. Write the spec up front: here's what we're building, here's what success looks like, here are the constraints (which files to touch, which to avoid, what style to match, what tests must pass). Then get out of the way.
  3. Run /fewer-permission-prompts once to clean up your allowlist. This new skill scans your session history for safe bash and MCP commands that kept triggering unnecessary permission prompts, then recommends the specific ones to add to your permissions config. Five-minute setup, meaningful friction reduction for the rest of the month. Worth doing whether or not you use auto mode.
  4. Use recaps when you return to long-running sessions. Anthropic shipped recaps earlier this week specifically to prep for 4.7's longer agentic runs. A recap is a short summary of what the agent did and what's next, which matters when you step away for hours and come back cold to a half-finished task.
  5. Toggle focus mode with /focus if you trust the model's tool use. Focus mode hides all intermediate work and shows only the final result. Cherny's own take: he now trusts 4.7 to run the right commands and make the right edits on most tasks, so he just looks at the end state. Real quality-of-life improvement if that matches how you work. For production code you actually care about, you'll still want to see the intermediates.
  6. Set your effort level deliberately, not by accident. With adaptive thinking, Cherny's default is xhigh for most tasks and max only for the hardest ones. One detail worth knowing: max applies to your current session only and resets, while the other effort levels are sticky and persist into your next session. Use /effort to set. Wu confirmed xhigh is the new default in Claude Code for Opus 4.7 specifically. Pair this with the scaling01 reasoning-efficiency finding from the pricing section: you can often drop one rung (say, xhigh → high) without losing quality, and save the token budget for harder tasks.
  7. Give Claude a way to verify its own work. This is the 2-3x multiplier. Both Cherny and Wu flagged verification as the single biggest lever on output quality, and say it matters more on 4.7 than on prior models. Wu's concrete version: put your testing workflow in your claude.md file (so the model knows how to run your tests every time), or install a /verify-app skill for your stack. Cherny's concrete version: a /go skill that chains three steps, test end-to-end with bash, browser, or computer use; run /simplify; put up a PR. What counts as verification depends on the surface. Backend: make sure Claude knows how to start your server or service and test end-to-end. Frontend: use the Claude Chromium extension to give Claude actual browser control. Desktop apps: use computer use. For long-running agentic work, verification is what lets you trust the result when you come back to it instead of re-reading 1,000 lines of diff by hand. Wu specifically called out sharing local dev tips that are hard to discover; 4.7 is much better at using them once it knows they exist.

One caveat worth flagging before you rearrange your stack around the new rate-limit budget from the pricing section: Jeffrey Emanuel in the replies asked the clarification nobody got answered, which is whether the rate-limit bump is a one-off launch-week adjustment or a permanent re-baselining of the weekly and 5-hour limits. Treat it as provisional until Anthropic confirms.

Where it goes next

The interesting thing about Opus 4.7 isn't that it's better than 4.6. That was expected. The interesting thing is what it signals about Anthropic's release cadence. If the half-step pattern holds (see above), the next Opus release will look structurally similar to this one: another partial step toward Mythos, another round of safeguards tested in the wild, another slow unclip of capabilities. The specific benchmark number on the next release is the least interesting data point. The question worth watching: what happens when a competitor ships a Mythos-equivalent model without the clipped capabilities? That's the story to track, not any particular number on any particular chart.

For now: the vision upgrade is real. Adaptive thinking will confuse you for a week and then feel normal. And the pricing math doesn't match the headline. Go look at /usage before you get yourself rate-limited.

If that's the playbook, we should expect the next Opus release to look similar: another partial step toward Mythos, another round of safeguards tested, another slow unclip of capabilities. The question worth watching isn't "what's the next Opus benchmark score." It's what happens when a competitor releases a Mythos-equivalent model without the clipped capabilities. That's the story to track, not any particular number on any particular chart.

For now: the vision upgrade is real (we tested it, it's better; not perfect, but better). Adaptive thinking will confuse you for a week and then feel normal. And the pricing math doesn't match the headline, but we'll see what happens long term. Just get used to looking at /usage before you get yourself rate-limited.

Grant Harvey

Grant Harvey is the Lead Writer of The Neuron, where he continues to lead the publication's daily coverage of AI news, tools, and trends.

The Neuron Logo

Don't fall behind on AI. Get the AI trends & tools you need to know. Join 700,000+ professionals from top companies like Microsoft, Apple, Salesforce and more.

Property of TechnologyAdvice. © 2026 TechnologyAdvice. All Rights Reserved

Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.