ThoughtsJun 13, 2026

AI model benchmarks are a suggestion

A bar chart where a hand presses its thumb down to inflate the tallest bar, which is crowned with a first-place winner's rosette — a rigged, self-certified benchmark.

We don't hire developers on their school grades. Grades are standardised, supervised, accumulated over years of repeated testing — better measurement than any model benchmark will ever be. And the industry still decided they barely predict who can do the job. We interview. We run probation. We judge people on the work.

Yet every model launch ships a table of scores and a wave of one-shot demos, and we read them like report cards. Treat both as a suggestion at best. Neither predicts whether the model survives contact with your actual work.

The demos measure party tricks

One-shotting a Minecraft clone. Rendering a macOS desktop in a single HTML file. Drawing a PlayStation controller in SVG. Strip away the novelty and they all share a shape: zero-context, greenfield generation of things with a thousand public implementations sitting in the training data. My boss has never once asked me to one-shot a Minecraft clone. Real work is the inverse — a ten-year-old codebase, conventions documented in nobody's head but Dave's, a change that has to land without breaking forty downstream consumers. The demo evaluates the model in the one environment you will never work in.

Simon Willison is at least honest that his pelican-riding-a-bicycle test is a joke people have been "inadvisably" taking seriously. The launch-day reposters treating these stunts as capability evidence are not. Once a joke benchmark becomes a launch-day screenshot format, it stops being clean evidence.

The formal benchmarks measure the test, not the model

Goodhart's law: when a measure becomes a target, it stops being a good measure. And a benchmark score is the single most valuable marketing asset a lab owns — about the most tempting target you could name. You don't need to allege fraud to see where this goes. Two ordinary pressures push a score up without anyone cheating, and it's worth keeping them apart.

The first is contamination. Train on the public internet and the test set is in there somewhere — the model has met the answers long before you pose the question.

The second is overfitting. Tune a model over and over to lift its score on a single benchmark, then ship whichever version happened to score highest, and you haven't shipped the best model. You've shipped the best score.

The evidence is no longer anecdotal:

GSM1k built a fresh, difficulty-matched twin of GSM8k (the standard grade-school maths benchmark) and found accuracy drops of up to eight points, with several model families showing systematic overfitting.
The SWE-Bench Illusion showed state-of-the-art models naming the buggy file path from the issue description alone — no repository access. They got it right up to 76% of the time on SWE-bench repos, collapsing to 53% on repos outside the benchmark. That is not debugging. That is recall.
The Leaderboard Illusion documented Meta testing twenty-seven private Llama 4 variants on Chatbot Arena before publishing only the winner, on a leaderboard that lets providers retract scores they don't like.

There is no guarantee the entry topping the arena is the model behind the API you can buy.

The reporting launders the numbers

A launch post compares against the competitor's previous generation, picks the favourable subset of benchmarks, and occasionally commits outright chart crimes — GPT-5's launch drew 52.8 as a taller bar than 69.1, live on stream.

The press then reprints the lab's own table, because independently re-running a benchmark suite costs real money and the access is gated.

"Beats X on fourteen of seventeen benchmarks."

Vendor-certified

Chosen by whom? The vendor.

Run by whom? The vendor.

Reported by whom? The vendor.

A student grading its own homework.

We would laugh a database vendor out of the room for self-certified performance numbers. We accept them from model labs weekly.

There is no single axis

A benchmark table implies a single ranking: this model is simply better than that one. Put two frontier models on the same problem and the ordering dissolves. I've watched the newer Claude Fable 5 read an Opus 4.8 answer and concede the older model's was better — one anecdote, not a measurement, but a single ranking shouldn't produce anecdotes like that. Newer is not better, higher-scoring is not better; better at what is the only question with an answer.

And the axis nobody headlines is confidence. In my use, GPT errs towards under-confidence — hedging, caveating, waiting for permission — while Claude will occasionally do the wrong thing with complete conviction. Which is better? It depends entirely on how you work: confidently wrong is survivable inside a tight review loop and lethal in an unattended agent; under-confident is safe and slow. SimpleQA at least grades "not attempted" separately from "incorrect", which is a calibration score in all but name — how well a model's stated confidence matches its real accuracy. It never makes the launch chart. Knowing when you don't know doesn't sell like a coding benchmark every model already aces.

Intelligence is no use paired with amnesia

On a long-running task, context can matter more than raw intelligence. The smartest model available turns into a liability the moment it forgets the spec it agreed an hour earlier, the file it already edited, the approach it already rejected — anyone who has watched an agent proudly reintroduce the bug it fixed that morning knows the feeling. The biggest cause is compaction: when the context fills up, the harness summarises the history away and the detail goes with it.

The thing that helped most was just more room. Moving from the couple hundred thousand tokens that were standard to a million was a genuine game-changer on a long-running task: the agent runs far longer before it has to compact, summarise, and start forgetting. The model still won't use the millionth token as well as the first. That's context rot, and past a point it can make a long context useless. But the extra room alone is worth it, and no benchmark measures that.

~200K → 1M

tokens before it compacts

The fix isn't just the model, it's the harness around it. A spec written to disk is memory that survives compaction. Worktrees limit how much an agent has to keep in its head at once. That is half the point of spec-driven development — you write the decisions down so the model doesn't have to remember them.

And harnesses are starting to get a real memory. Skills — procedures kept in files — now survive compaction in Claude Code: once you've invoked one, the harness re-attaches its content after the conversation is summarised away, while anything you only said in chat takes its chances with the summariser. It took a while to get there; the issue tracker is full of agents that quietly dropped their own procedures after auto-compaction and went on to make the exact mistakes their skills warned against. The split is the one you'd expect — the how-to sticks, the conversation gets lost — so what a model keeps and what it forgets is now a design choice, made differently by every harness and changing with each release.

Sub-agents come at the same problem from the other end. Hand a task to one and it burns its own fresh context on the grubby part — the fifty file reads, the dead ends, the tool noise — then hands back a few lines of conclusion. The main agent keeps the answer and never sees the mess that made it, so its own context stays clean; the worker never lives long enough to forget. Workflows take the same trick to scale: a script fans a job across dozens of sub-agents and keeps the intermediate work in its own variables, so the main agent is left holding only the final answer. That's how you take on a job no single context could hold. The fix turns out not to be just a bigger window but more, smaller ones — the harness splitting the work up the way any team does. Same model, different amnesia. No benchmark measures behaviour after the third compaction.

Benchmarks score a model; you ship a workflow

Most launch tables score the bare model — the API on its own, with nothing built around it. That keeps the comparison clean — same prompt, same single attempt for everyone — and it makes the result unrealistic for anyone whose job is not integrating the API itself. Almost nobody meets a frontier model raw. You meet it through a harness — a CLI agent, an IDE plugin, a chat app — each with its own way of gathering what the model sees, its own hidden instructions, and its own tools. The SWE-bench leaderboard quietly admits the harness is half the result: entries are model-plus-scaffold pairs, because the same model posts different numbers depending on what is wrapped around it.

Developers learned this the hard way when the same Claude model shipped inside both GitHub Copilot and Claude Code, and Claude Code was visibly better. Cue bafflement, then the conspiracy theories — throttling, secret quantisation, a worse checkpoint for Microsoft. The mundane explanation is the damning one: the wrapper differs — different context, different instructions, different tools.

Same weights, different model in practice.

And the platform layer keeps going — voice input, memory across sessions, worktree support, agent views. None of it appears in a benchmark table; all of it changes what the system in front of you can do.

Auto mode is the latest and sharpest example. Claude Code can now make permission decisions itself: a safety classifier reviews each tool call before it runs, waves through the safe ones, blocks the destructive ones and pushes Claude towards a different approach, and escalates to a human only when Claude keeps insisting on an action the classifier won't clear. That is what makes hours-long unattended runs viable without reaching for --dangerously-skip-permissions — and notice what it admits. The product ships a second model to supervise the first, because not even the lab trusts raw one-shot output with real permissions. The benchmark scores the soloist. The product ships a chaperone. Even a perfectly clean benchmark measures the wrong unit — the engine on a dyno, when what you drive is the car in traffic.

One-shot scores overstate the model you need

A benchmark hands the model the whole problem and grades a single pass, so it rewards raw capability — and quietly teaches you that you need the most expensive model for everything.

Structured development changes the requirement — and it isn't something you buy from a lab. Spec-driven development is an open practice. Apply it and the work decomposes. Opus drafts the spec — the design decisions, the choices that compound if wrong — and Sonnet implements against it, because implementing a clear spec is a far easier problem than one-shotting the feature. A mid-tier model inside a disciplined spec-and-review loop will out-ship a frontier model driven freehand. Same task, different decomposition, different capability requirement — and the benchmark cannot see it, because the benchmark never decomposes. It answers the question "which model can do this alone, blind, in one pass". Your workflow stopped asking that a while ago.

The benchmark

One model

One pass

Blind

Alone

Your workflow

Opus drafts the spec

Sonnet implements

Spec + review loop

Decomposed

What a benchmark is actually for

A smoke test and a coarse ordering. It tells you the model isn't broken and roughly which tier it sits in. That is the suggestion, and it is worth precisely that much — no more. The only benchmark with predictive power for your work is your work. Take the dullest tickets from your last month — the migration, the flaky test, the refactor nobody wanted — and run the candidate model against them, inside the workflow you actually use. Twenty private tasks that never leave your machine beat every public leaderboard, because nobody can train on them.

When the next launch lands, skip the table. Give the model your Tuesday.