devtake.dev

#benchmarks

RSS
Anthropic's announcement artwork for Claude Fable 5 and Claude Mythos 5, a soft gradient panel with the Claude wordmark.
AI·

Claude Fable 5 is Anthropic's first public Mythos-class model. It tops SWE-Bench Pro at 80.3%.

Claude Fable 5 hits 80.3% on SWE-Bench Pro and ships on Bedrock and Copilot at $10/$50 per million tokens, free on paid plans only through June 22.

The Stanford Law School building on Stanford University's campus
AI·

Stanford tested AI against law professors. The pros picked the AI 75% of the time.

A blinded Stanford Law study had 16 professors grade AI tutoring answers against their own. Here's what the 75% win rate actually measures, and what it doesn't.

Anthropic's announcement artwork for Claude Opus 4.8, a soft gradient panel with the Claude wordmark.
AI·

Claude Opus 4.8 flags the bugs it writes four times more often than Opus 4.7

Anthropic's Opus 4.8 posts 69.2% on SWE-Bench Pro, lets code flaws slip 4x less often, and ships parallel subagents in Claude Code. Here's what matters.

The DELEGATE-52 project repository on GitHub, showing Microsoft's benchmark for testing LLM document editing fidelity
AI·

Microsoft tested 19 LLMs as document editors. Even the best ones corrupted 25% of the content.

The DELEGATE-52 benchmark tests AI editing across 52 professional domains. Frontier models corrupt a quarter of document content over long workflows.

A mathematics lecture hall with equations on blackboards
AI·

Timothy Gowers gave GPT 5.5 an open math problem. It returned a novel proof in 17 minutes.

The 1998 Fields Medal winner reports GPT 5.5 Pro produced a novel proof for an unsolved math problem in 17 minutes, and says the era of owning theorems is ending.

OpenAI just retired SWE-bench Verified. The headline coding benchmark of 2025 is officially saturated.
AI·

OpenAI just retired SWE-bench Verified. The headline coding benchmark of 2025 is officially saturated.

OpenAI says SWE-bench Verified is saturated and contaminated, and 60% of remaining problems are unsolvable. Here's what comes next, and why every coding leaderboard is suspect.

DeepSeek social card from the V4 API documentation release post.
AI·

DeepSeek V4 lands: 1.6T-param open MoE, 1M-token context, and SWE-bench within 0.2 of Opus 4.6

DeepSeek shipped V4-Pro and V4-Flash under MIT on April 24. V4-Pro hits 80.6% on SWE-bench Verified. V4-Flash is $0.14 in / $0.28 out.

Claude Opus 4.7 launch artwork from the Anthropic news post
AI·

Claude Opus 4.7 is here, and the long-context benchmarks got worse

Anthropic's Opus 4.7 is state-of-the-art on SWE-bench and CursorBench, but independent tests show regressions on long-context retrieval and thematic reasoning.