devtake.dev — #benchmarks

devtake.dev — #benchmarksArticles tagged benchmarks on devtake.dev.https://devtake.dev/en-usClaude Fable 5 is Anthropic's first public Mythos-class model. It tops SWE-Bench Pro at 80.3%.https://devtake.dev/article/claude-fable-5-launch/https://devtake.dev/article/claude-fable-5-launch/Claude Fable 5 hits 80.3% on SWE-Bench Pro and ships on Bedrock and Copilot at $10/$50 per million tokens, free on paid plans only through June 22.Tue, 09 Jun 2026 18:55:00 GMTaiai-modelsanthropicclaudeclaude-mythosbenchmarksllmagentic-codingdieter-morelliStanford tested AI against law professors. The pros picked the AI 75% of the time.https://devtake.dev/article/stanford-ai-beats-law-professors/https://devtake.dev/article/stanford-ai-beats-law-professors/A blinded Stanford Law study had 16 professors grade AI tutoring answers against their own. Here's what the 75% win rate actually measures, and what it doesn't.Wed, 03 Jun 2026 11:15:00 GMTaiaillmbenchmarkslegal-aiai-modelsgeminiragai-evaldieter-morelliClaude Opus 4.8 flags the bugs it writes four times more often than Opus 4.7https://devtake.dev/article/claude-opus-4-8-launch/https://devtake.dev/article/claude-opus-4-8-launch/Anthropic's Opus 4.8 posts 69.2% on SWE-Bench Pro, lets code flaws slip 4x less often, and ships parallel subagents in Claude Code. Here's what matters.Fri, 29 May 2026 07:20:00 GMTaiai-modelsanthropicclaudellmbenchmarksagentic-codingclaude-codeopus-4-7dieter-morelliMicrosoft tested 19 LLMs as document editors. Even the best ones corrupted 25% of the content.https://devtake.dev/article/llms-corrupt-documents-delegation-errors/https://devtake.dev/article/llms-corrupt-documents-delegation-errors/The DELEGATE-52 benchmark tests AI editing across 52 professional domains. Frontier models corrupt a quarter of document content over long workflows.Sun, 10 May 2026 09:00:00 GMTaillmai-modelsbenchmarksmicrosoftdelegationvibe-codingdieter-morelliTimothy Gowers gave GPT 5.5 an open math problem. It returned a novel proof in 17 minutes.https://devtake.dev/article/fields-medal-gowers-gpt-open-problems/https://devtake.dev/article/fields-medal-gowers-gpt-open-problems/The 1998 Fields Medal winner reports GPT 5.5 Pro produced a novel proof for an unsolved math problem in 17 minutes, and says the era of owning theorems is ending.Sat, 09 May 2026 07:30:00 GMTaiopenaillmai-modelsbenchmarksdieter-morelliOpenAI just retired SWE-bench Verified. The headline coding benchmark of 2025 is officially saturated.https://devtake.dev/article/openai-retires-swe-bench-verified/https://devtake.dev/article/openai-retires-swe-bench-verified/OpenAI says SWE-bench Verified is saturated and contaminated, and 60% of remaining problems are unsolvable. Here's what comes next, and why every coding leaderboard is suspect.Mon, 27 Apr 2026 10:00:00 GMTaiopenaiswe-benchbenchmarksai-modelsllmai-codingevaluationsclaude-opusdieter-morelliDeepSeek V4 lands: 1.6T-param open MoE, 1M-token context, and SWE-bench within 0.2 of Opus 4.6https://devtake.dev/article/deepseek-v4-release/https://devtake.dev/article/deepseek-v4-release/DeepSeek shipped V4-Pro and V4-Flash under MIT on April 24. V4-Pro hits 80.6% on SWE-bench Verified. V4-Flash is $0.14 in / $0.28 out.Fri, 24 Apr 2026 21:30:00 GMTaideepseekdeepseek-v4llmai-modelsopen-weightsmoebenchmarksopen-sourcedieter-morelliClaude Opus 4.7 is here, and the long-context benchmarks got worsehttps://devtake.dev/article/anthropic-claude-opus-4-7-launch/https://devtake.dev/article/anthropic-claude-opus-4-7-launch/Anthropic's Opus 4.7 is state-of-the-art on SWE-bench and CursorBench, but independent tests show regressions on long-context retrieval and thematic reasoning.Fri, 17 Apr 2026 09:30:00 GMTaiclaudeanthropicopus-4-7llmbenchmarksmythosai-modelsdieter-morelli