devtake.dev

Claude Opus 4.8 flags the bugs it writes four times more often than Opus 4.7

Anthropic's Opus 4.8 posts 69.2% on SWE-Bench Pro, lets code flaws slip 4x less often, and ships parallel subagents in Claude Code. Here's what matters.

Dieter Morelli · · 6 min read · 4 sources
Anthropic's announcement artwork for Claude Opus 4.8, a soft gradient panel with the Claude wordmark.
Image: Anthropic · Source

Anthropic shipped Claude Opus 4.8 on May 28, and the framing is unusually flat for an AI launch. The company calls it “a modest but tangible improvement on its predecessor.” No fireworks. No claim that everything just changed. For a developer trying to decide whether to bother re-testing prompts, that honesty is the most useful thing about the release.

So what actually moved? The benchmark wins are real but incremental. The interesting shift is behavioral: Opus 4.8 is better at admitting when it doesn’t know, and around four times less likely to let a bug it wrote slip past without flagging it. There’s also a new parallel-subagent mode in Claude Code, a cheaper fast tier, and an effort dial you can turn. If you ship code with Claude, two or three of those are worth your attention this week.

What changed under the hood

The headline numbers are coding and reasoning. On SWE-Bench Pro, the hardest variant where models patch real issues in live repos, Opus 4.8 scores 69.2%, up from 64.3% for Opus 4.7 and well ahead of the 58.6% The Decoder reports for GPT-5.5. On Humanity’s Last Exam, a set of genuinely hard graduate-level questions, it lands 49.8% without tools and 57.9% with them.

Those are gains, not leaps. A 5-point jump on SWE-Bench Pro is the kind of thing you feel on the margins: slightly fewer botched patches, slightly longer tasks that hold together. Anthropic also says the model does the same knowledge work with 15% fewer passes per task and 35% fewer output tokens than Opus 4.7. That efficiency matters more than the raw score for anyone paying per token.

The agentic side got the bigger workout. On Online-Mind2Web, the browser-driving benchmark that measures how well a model can click through real websites, Opus 4.8 hits 84%, per a tester quote in Anthropic’s post. The company also claims it’s the first model to clear 10% on the all-pass standard of its internal Legal Agent Benchmark, and the only one to finish every case in a “Super-Agent” test end-to-end. Take the internal benchmarks with the usual grain of salt. They’re Anthropic’s own scorecards. But the direction is consistent: longer chains of tool use that don’t fall apart halfway through.

Pricing didn’t move. Standard mode stays at $5 per million input tokens and $25 per million output, the same rate Anthropic has held since Opus 4.5. The change is in the fast tier.

The honesty fix is the headline

Here’s the part Anthropic buried and Simon Willison led with. Opus 4.8 is “around four times less likely than its predecessor to allow flaws in code it has written to pass unremarked,” per Anthropic’s own writeup. It’s also more willing to say it isn’t sure.

Willison dug into how the model earned its low error rate. It “achieved this mainly by abstaining on questions about which it was uncertain,” he wrote, rather than by getting more answers right. That’s a real distinction. A model that says “I don’t know” instead of confidently inventing an API is less impressive on a leaderboard and far more useful in a code review.

Why does this land harder than a benchmark bump? Because the failure mode developers actually hate is the confident wrong answer. A model that hallucinates a function signature, then writes three files around it, costs you an hour. One that flags “I’m not certain this method exists” saves it. Anthropic’s alignment team frames this as gains in “prosocial traits like supporting user autonomy,” but the practical version is simpler: it lies to you less. We covered the company’s earlier deception-monitoring work when that research first landed, and this release reads like the product side of the same bet.

There’s a small API change that compounds this. Opus 4.8 now accepts a role: "system" message after a user turn in the messages array, which means you can steer the system prompt partway through a long conversation without restating the whole thing. Willison called that “really powerful,” and for long-running agents it is: you can update the rules mid-session as the task drifts, instead of resetting context.

Willison, who tends to be hard on AI marketing, called the muted framing “so refreshing to see an AI lab honestly describe a release as a minor incremental improvement.” When the harshest reviewer in the room praises your restraint, that’s the signal.

Dynamic Workflows and a cheaper fast mode

The feature with the most reach is Dynamic Workflows, a research preview inside Claude Code. It lets the model plan a large job, then run hundreds of parallel subagents in a single session and verify their output before reporting back, according to Anthropic. Think of a refactor that touches 80 files: instead of one agent grinding through them serially, it fans the work out and checks itself at the end.

That’s a structural change to how an agent uses its turn, not a prompt trick. It’s also gated. Dynamic Workflows ships only on Enterprise, Team, and Max plans at launch, so solo developers on cheaper tiers won’t see it yet. If you’ve felt Claude Code stall on big multi-file tasks, this is the lever. Just know your plan decides whether you get to pull it. The same week, Anthropic also doubled Claude Code rate limits, so the agent has more headroom to actually run these longer sessions.

Then there’s the money. Fast mode, where Opus runs at 2.5x normal speed, now costs $10 per million input and $50 per million output. Anthropic says that’s three times cheaper than fast mode on previous models. For latency-sensitive work like autocomplete or interactive agents, a 3x price cut on the fast path changes what’s economical to build. A new effort control rounds it out: you choose how much compute Claude spends, trading benchmark scores for tokens and speed. Crank it for a gnarly bug, dial it down for boilerplate.

What this means for you

If you already run Opus 4.7 in production, the upgrade math is easy. Same price, better honesty, fewer tokens per task. Re-run your evals and switch the model string to claude-opus-4-8 if your numbers hold. Don’t expect a night-and-day difference in output quality, because there isn’t one. Do expect fewer confidently-wrong moments, which is the kind of thing that doesn’t show in a benchmark but shows in a sprint.

If you’re shopping Anthropic against OpenAI, the published benchmarks favor Opus 4.8, but benchmarks are Anthropic’s home court. Run your own SWE-bench-style harness on your actual repo before you commit. And if you’re on a solo plan eyeing Dynamic Workflows, check the tier list first: the headline feature is enterprise-gated at launch. My read: this is a free lunch for existing Opus users and a “test it yourself” for everyone else. The honesty work is the part that’ll age well.

Why you’re hearing about this now

The timing is deliberate. Anthropic shipped Opus 4.8 the same week OpenAI’s GPT-5.5 has been the benchmark to beat, and the company now carries a roughly $900B valuation it has to keep justifying. A “modest” release between flagship leaps is how a lab signals it’s compounding rather than gambling. Watch the next 60 days: if Dynamic Workflows graduates from research preview to general availability, that’s the real product story. The benchmark deltas will be a footnote by then.

Share this article

Quick reference

SWE-Bench Pro
A coding benchmark that asks a model to fix real issues in actively maintained repos, harder than standard SWE-bench because there's no public ground-truth to memorize.
agentic
Describes an AI that plans multi-step work and acts on its own (running tools, editing files, calling APIs) rather than just answering a single prompt.

Sources

Frequently Asked

How much does Claude Opus 4.8 cost?
Standard mode is $5 per million input tokens and $25 per million output, unchanged since Opus 4.5. Fast mode runs $10/$50, which Anthropic says is three times cheaper than fast mode on previous models.
What is Dynamic Workflows in Claude Code?
A research preview that lets Claude plan a large task, spin up hundreds of parallel subagents in one session, then verify their output before reporting back. It's limited to Enterprise, Team, and Max plans at launch.
Is Opus 4.8 better than GPT-5.5?
On Anthropic's published benchmarks, yes. Opus 4.8 hits 69.2% on SWE-Bench Pro versus 58.6% for GPT-5.5, and wins roughly 67% of head-to-head knowledge-work comparisons. Independent testing is still early.
What does 'effort control' do?
It lets you dial how much compute Claude spends on a task. Higher effort scores better on hard benchmarks but costs more tokens and runs slower; lower effort is cheaper and faster for routine work.

Mentioned in this article