#llm

Large language model releases, benchmarks, capability jumps, and the infrastructure that runs them.

Gemini Intelligence interface on an Android phone

Gemini Intelligence turns Android 17 into an agent that drives your apps

Google's Android Show pitched Gemini Intelligence and AppFunctions, an MCP-style way for the assistant to call inside your apps. Here's how it works and what to watch.

A MacBook Pro beside a Surface Book, both open on a white surface, USB-C ports in view

AI·2 hours ago

Running a coding agent fully on Apple Silicon, no cloud, is now an off-the-shelf stack

A popular Hacker News how-to walked through a fully local coding agent on Apple Silicon. Here's the realistic 2026 stack: runner, model, and harness.

AI·4 days ago

Claude Fable 5 is Anthropic's first public Mythos-class model. It tops SWE-Bench Pro at 80.3%.

Claude Fable 5 hits 80.3% on SWE-Bench Pro and ships on Bedrock and Copilot at $10/$50 per million tokens, free on paid plans only through June 22.

Abstract cybersecurity illustration of a glowing padlock over a circuit board, representing data protection

AI·5 days ago

OpenAI added a Lockdown Mode to ChatGPT to blunt prompt-injection attacks

OpenAI shipped Lockdown Mode in ChatGPT to cut off the data-exfiltration step of prompt-injection attacks. Here's what it actually restricts and who should turn it on.

OpenAI's Codex branding over a code background, illustrating Codex expanding across the ChatGPT app.

AI·last week

OpenAI is putting Codex in every ChatGPT app, with six business plugins for non-coders

On June 2 OpenAI said Codex is coming to the ChatGPT app everywhere within weeks, and shipped six role-specific plugins for sales, analytics, design, and finance teams.

The Stanford Law School building on Stanford University's campus

AI·last week

Stanford tested AI against law professors. The pros picked the AI 75% of the time.

A blinded Stanford Law study had 16 professors grade AI tutoring answers against their own. Here's what the 75% win rate actually measures, and what it doesn't.

AI·2 weeks ago

Claude Opus 4.8 flags the bugs it writes four times more often than Opus 4.7

Anthropic's Opus 4.8 posts 69.2% on SWE-Bench Pro, lets code flaws slip 4x less often, and ships parallel subagents in Claude Code. Here's what matters.

A source-code editor open to C++ code, evoking the debate over AI-written contributions to open source

Open Source·2 weeks ago

SQLite won't accept AI-written code, but QEMU just opened the door to it

Two of the most cautious C projects split on AI contributions in the same week. The real fight is over copyright provenance and who cleans up the slop.

A developer's Emacs session in a Linux terminal, editing C source alongside a shell

AI·2 weeks ago

Hacker News is obsessed with durable Postgres workflows and a game about clicking yes

Six dev-tooling and AI posts that climbed Hacker News in late May 2026: durable execution on plain Postgres, LLM code smells, a permission-fatigue game, Rust 1.96, and more.

AI·3 weeks ago

DeepSeek locked in the 75% V4-Pro cut. The API now undercuts every Western frontier model.

On May 23 DeepSeek told customers the V4-Pro discount becomes its standard price after May 31. Output drops from $3.48 to $0.87 per million tokens.

Diagram of an artificial neural network with input, hidden, and output layers

AI·3 weeks ago

Andrej Karpathy joined Anthropic. The OpenAI founding member's job: use Claude to train Claude.

Karpathy started this week at Anthropic on Nick Joseph's pre-training team. His mandate is using Claude to accelerate Claude's own training.

Cyera Research disclosure illustration for the Bleeding Llama vulnerability in Ollama's model execution pipeline

Security·last month

A crafted Ollama model file leaks the whole server's memory. 300,000 instances are exposed.

Cyera disclosed CVE-2026-7482 on May 1, a CVSS 9.1 unauthenticated heap read in Ollama. Three API calls dump prompts, env vars, and API keys from any open instance.

The DELEGATE-52 project repository on GitHub, showing Microsoft's benchmark for testing LLM document editing fidelity

AI·last month

Microsoft tested 19 LLMs as document editors. Even the best ones corrupted 25% of the content.

The DELEGATE-52 benchmark tests AI editing across 52 professional domains. Frontier models corrupt a quarter of document content over long workflows.

A mathematics lecture hall with equations on blackboards

AI·last month

Timothy Gowers gave GPT 5.5 an open math problem. It returned a novel proof in 17 minutes.

The 1998 Fields Medal winner reports GPT 5.5 Pro produced a novel proof for an unsolved math problem in 17 minutes, and says the era of owning theorems is ending.

Microsoft and OpenAI logos paired on a navy gradient backdrop.

AI·2 months ago

Microsoft and OpenAI just rewrote their deal. Exclusivity is dead, and so is the AGI clause.

Microsoft loses exclusive rights to OpenAI's models. The revenue share now caps at 2030 and stops depending on AGI. Here's what actually changed and who it benefits.

Arcee AI Trinity branding from the Trinity-Large-Thinking blog post.

Open Source·2 months ago

Arcee's Trinity-Large-Thinking is a 399B open MoE that costs 96% less than Opus

Arcee released Trinity-Large-Thinking on April 1: a 399B-param sparse MoE with 13B active, Apache 2.0 weights, $0.88 per million output tokens, and PinchBench just behind Opus 4.6.

Security·2 months ago

A malicious GGUF file owns your SGLang server: CVE-2026-5760 is an unpatched 9.8

SGLang's reranker renders chat templates without a sandbox. Load a hostile GGUF, hit /v1/rerank, and the attacker has Python on your inference box. No patch yet.

AI·2 months ago

OpenAI just retired SWE-bench Verified. The headline coding benchmark of 2025 is officially saturated.

OpenAI says SWE-bench Verified is saturated and contaminated, and 60% of remaining problems are unsolvable. Here's what comes next, and why every coding leaderboard is suspect.