<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"><channel><title>devtake.dev — #benchmarks</title><description>Articles tagged benchmarks on devtake.dev.</description><link>https://devtake.dev/</link><language>en-us</language><item><title>Claude Fable 5 is Anthropic&apos;s first public Mythos-class model. It tops SWE-Bench Pro at 80.3%.</title><link>https://devtake.dev/article/claude-fable-5-launch/</link><guid isPermaLink="true">https://devtake.dev/article/claude-fable-5-launch/</guid><description>Claude Fable 5 hits 80.3% on SWE-Bench Pro and ships on Bedrock and Copilot at $10/$50 per million tokens, free on paid plans only through June 22.</description><pubDate>Tue, 09 Jun 2026 18:55:00 GMT</pubDate><category>ai</category><category>ai-models</category><category>anthropic</category><category>claude</category><category>claude-mythos</category><category>benchmarks</category><category>llm</category><category>agentic-coding</category><author>dieter-morelli</author></item><item><title>Stanford tested AI against law professors. The pros picked the AI 75% of the time.</title><link>https://devtake.dev/article/stanford-ai-beats-law-professors/</link><guid isPermaLink="true">https://devtake.dev/article/stanford-ai-beats-law-professors/</guid><description>A blinded Stanford Law study had 16 professors grade AI tutoring answers against their own. Here&apos;s what the 75% win rate actually measures, and what it doesn&apos;t.</description><pubDate>Wed, 03 Jun 2026 11:15:00 GMT</pubDate><category>ai</category><category>ai</category><category>llm</category><category>benchmarks</category><category>legal-ai</category><category>ai-models</category><category>gemini</category><category>rag</category><category>ai-eval</category><author>dieter-morelli</author></item><item><title>Claude Opus 4.8 flags the bugs it writes four times more often than Opus 4.7</title><link>https://devtake.dev/article/claude-opus-4-8-launch/</link><guid isPermaLink="true">https://devtake.dev/article/claude-opus-4-8-launch/</guid><description>Anthropic&apos;s Opus 4.8 posts 69.2% on SWE-Bench Pro, lets code flaws slip 4x less often, and ships parallel subagents in Claude Code. Here&apos;s what matters.</description><pubDate>Fri, 29 May 2026 07:20:00 GMT</pubDate><category>ai</category><category>ai-models</category><category>anthropic</category><category>claude</category><category>llm</category><category>benchmarks</category><category>agentic-coding</category><category>claude-code</category><category>opus-4-7</category><author>dieter-morelli</author></item><item><title>Microsoft tested 19 LLMs as document editors. Even the best ones corrupted 25% of the content.</title><link>https://devtake.dev/article/llms-corrupt-documents-delegation-errors/</link><guid isPermaLink="true">https://devtake.dev/article/llms-corrupt-documents-delegation-errors/</guid><description>The DELEGATE-52 benchmark tests AI editing across 52 professional domains. Frontier models corrupt a quarter of document content over long workflows.</description><pubDate>Sun, 10 May 2026 09:00:00 GMT</pubDate><category>ai</category><category>llm</category><category>ai-models</category><category>benchmarks</category><category>microsoft</category><category>delegation</category><category>vibe-coding</category><author>dieter-morelli</author></item><item><title>Timothy Gowers gave GPT 5.5 an open math problem. It returned a novel proof in 17 minutes.</title><link>https://devtake.dev/article/fields-medal-gowers-gpt-open-problems/</link><guid isPermaLink="true">https://devtake.dev/article/fields-medal-gowers-gpt-open-problems/</guid><description>The 1998 Fields Medal winner reports GPT 5.5 Pro produced a novel proof for an unsolved math problem in 17 minutes, and says the era of owning theorems is ending.</description><pubDate>Sat, 09 May 2026 07:30:00 GMT</pubDate><category>ai</category><category>openai</category><category>llm</category><category>ai-models</category><category>benchmarks</category><author>dieter-morelli</author></item><item><title>OpenAI just retired SWE-bench Verified. The headline coding benchmark of 2025 is officially saturated.</title><link>https://devtake.dev/article/openai-retires-swe-bench-verified/</link><guid isPermaLink="true">https://devtake.dev/article/openai-retires-swe-bench-verified/</guid><description>OpenAI says SWE-bench Verified is saturated and contaminated, and 60% of remaining problems are unsolvable. Here&apos;s what comes next, and why every coding leaderboard is suspect.</description><pubDate>Mon, 27 Apr 2026 10:00:00 GMT</pubDate><category>ai</category><category>openai</category><category>swe-bench</category><category>benchmarks</category><category>ai-models</category><category>llm</category><category>ai-coding</category><category>evaluations</category><category>claude-opus</category><author>dieter-morelli</author></item><item><title>DeepSeek V4 lands: 1.6T-param open MoE, 1M-token context, and SWE-bench within 0.2 of Opus 4.6</title><link>https://devtake.dev/article/deepseek-v4-release/</link><guid isPermaLink="true">https://devtake.dev/article/deepseek-v4-release/</guid><description>DeepSeek shipped V4-Pro and V4-Flash under MIT on April 24. V4-Pro hits 80.6% on SWE-bench Verified. V4-Flash is $0.14 in / $0.28 out.</description><pubDate>Fri, 24 Apr 2026 21:30:00 GMT</pubDate><category>ai</category><category>deepseek</category><category>deepseek-v4</category><category>llm</category><category>ai-models</category><category>open-weights</category><category>moe</category><category>benchmarks</category><category>open-source</category><author>dieter-morelli</author></item><item><title>Claude Opus 4.7 is here, and the long-context benchmarks got worse</title><link>https://devtake.dev/article/anthropic-claude-opus-4-7-launch/</link><guid isPermaLink="true">https://devtake.dev/article/anthropic-claude-opus-4-7-launch/</guid><description>Anthropic&apos;s Opus 4.7 is state-of-the-art on SWE-bench and CursorBench, but independent tests show regressions on long-context retrieval and thematic reasoning.</description><pubDate>Fri, 17 Apr 2026 09:30:00 GMT</pubDate><category>ai</category><category>claude</category><category>anthropic</category><category>opus-4-7</category><category>llm</category><category>benchmarks</category><category>mythos</category><category>ai-models</category><author>dieter-morelli</author></item></channel></rss>