A real bug is a truer benchmark than a leaderboard
Every model arrives with a scorecard: SWE-bench, HumanEval, a contest rating. Those numbers are real, but they share a shape — a self-contained puzzle, a known answer, one shot, graded on the diff. They measure deduction under certainty. They tell you almost nothing about the thing you actually depend on in an agent: what a model does when it's stuck on a real problem, inside a running system, with no answer key and a feedback loop that lies to it.
You can't put that on a leaderboard, because the moment a task is clean enough to score automatically, it stops resembling the work. So we mostly fly blind on the property that matters most.
This post is about one time we didn't. A user's app broke in a specific, reproducible way, and through model routing we ended up running two different coders against it — same prompt, same broken file, same live app, same verifier. Not a synthetic eval. A natural experiment. One model spent 51 steps and its entire budget and never landed the fix; the other landed it cleanly in 19. The interesting part isn't the score. It's why the gap exists — because the bug turned out to be almost perfectly engineered to separate two kinds of model, and a leaderboard would have rated them nearly tied.
To see what it measured, you have to understand the bug. So we'll go through it properly — the actual graphics, the actual formulas — and then read the two traces against each other.
The problem under test: one mesh per chunk
A user was building a Minecraft-style voxel world in React Three Fiber. The first generation was good — a textured, lit, playable world. Then came the optimization every voxel engine eventually needs, and it turned the entire scene black.
A naïve voxel renderer gives every block its own mesh, so the GPU issues one draw
call per block — thousands per frame. The fix is merged chunk geometry: pack
every visible face of a 16×16×N chunk into a single BufferGeometry and draw the
chunk in one call. You hand-build three vertex attributes — position, normal,
uv — plus an index buffer.
That hand-build is where correctness gets fragile, because a lit material doesn't show you positions. It shows you light. And light depends on the normals, and the normals — in the naïve approach — depend on the winding order of your triangles. That chain is the whole story, and it's the reason this bug is a good benchmark: there is no closed-form answer to read off. The missing fact lives in the running pipeline.
The shading math, and where black comes from
The blocks used MeshLambertMaterial — diffuse Lambert shading. For one light,
the outgoing color of a fragment is:
C_out = C_tex · ( k_a·A + k_d · max(0, N̂ · L̂) )
└ ambient ┘ └─── diffuse ───┘
C_tex— the texel sampled from the atlas at the fragment's UVN̂— the surface normal (unit),L̂— direction to the light (unit)k_a·A— ambient term,k_d·max(0, N̂·L̂)— diffuse term
The diffuse term is a clamped dot product. The clamp is the trap: if N̂ points
away from the light — or worse, into the block — then N̂·L̂ ≤ 0, the
max(0, …) zeroes it, and you're left with only ambient:
N̂ pointing inward ⇒ N̂·L̂ ≤ 0 ⇒ C_out = C_tex · k_a · A
And ambient was tiny (more on that below), so C_out ≈ 0. Black. Not
missing, not untextured — fully shaded black, which looks like a lighting bug, or
a texture bug, or anything you want it to be.
So: which way do the normals point? In the naïve build, you never say. You let Three.js infer them:
geo.setIndex([0,1,2, 1,3,2]) // two triangles per quad
geo.computeVertexNormals() // normals INFERRED from those triangles
And computeVertexNormals() computes each triangle's normal from the cross
product of its edges, in index order:
N = normalize( (v1 − v0) × (v2 − v0) )
The cross product is antisymmetric — a × b = −(b × a) — so swapping two
indices flips the normal's sign. [0,1,2] and [0,2,1] describe the same
three points and the opposite normal. The winding order of your index buffer
silently is your lighting.
Here's the same coin's other face. WebGL's backface culling (side: FrontSide)
keeps a triangle only when it's wound counter-clockwise as the camera sees it —
the sign of its projected screen-space area. The same index order that flips
the normal also flips the facing. So one wrong winding does two things at once:
it flips the normal (→ Lambert black) and flips the facing (→ backface-culled,
the "I only see thin/single surfaces" symptom). One root cause, two completely
different-looking bugs.
This is an inference chain: a value you care about (lit color) is derived,
link by link, from a value that's easy to get subtly wrong (winding), through a
step that hides the error (computeVertexNormals). Debugging it means proving
every link correct, by hand, for all six cube faces.
Why this bug is a discriminating benchmark
A good benchmark separates the things it tests. Most coding evals don't separate much — a stronger reasoner just scores higher across the board. This bug separates, and it does so because of two properties it happened to combine.
First: it was overdetermined — three independent, each-sufficient reasons to be black. Any one alone produces the same image.
| # | Cause | The physics | Fix |
|---|---|---|---|
| 1 | Inverted winding | N̂·L̂ ≤ 0 → diffuse 0; also backface-culled |
correct winding or explicit normals |
| 2 | MeshStandardMaterial (PBR) |
a metal-rough BRDF with roughness 0.85 and no environment map has almost no diffuse energy under a single light — IBL is doing the work that wasn't there |
use MeshLambertMaterial |
| 3 | Dawn lighting | day cycle started at t=0 → sun on the horizon |
start at midday / raise floor |
Cause 3 hides in plain sight. The day cycle drove the sun height as sin(t) with
t = 0 at load, so the directional light came up at max(0, sin 0)·3.5 + 0.2 = 0.2 — the sun exactly on the horizon — with a duplicate hard-coded
<ambientLight> fighting the cycle. Effective illumination ≈ 0.05. The scene
loaded at dawn; even perfect geometry renders near-black at t = 0.
Overdetermination is the cruelest property a bug can have, because it destroys your feedback loop. Fix the winding → still black (PBR + dawn). Fix the material → still black (winding + dawn). Every correct fix looks like a failed fix. You cannot climb a gradient when every step reads as zero. A model that treats debugging as reason harder, then patch gets no signal that it's making progress.
Second: despite all that, the bug was cheaply isolable — one material swap, as we'll see, collapses the whole search. So the task quietly asks one question: when the screen gives you no gradient, do you keep reasoning, or do you run an experiment to manufacture a gradient? A pure deductive policy fails here; an interactive one walks right through. That's the separator. A leaderboard built from one-shot puzzles can't pose this question at all.
What the first model actually did
Watch the strong open-weights coder work it and you see a sharp engineer trapped in exactly that zero-gradient. It went straight at the hardest link in the chain — proving the winding by hand — and it kept getting the sign wrong, in both directions, on different attempts. Two real quotes from its trace, hours apart:
…so my "fix" actually made it worse — the original
0-1-2was correct for the top face!
All 6 faces have inverted winding. The fix is simple: change indices from
[0,1,2],[1,3,2]to[0,2,1],[1,2,3].
Both can't be true. It hand-traced the cross product for the top face — v0 = (−½,½,−½), v1 = (½,½,−½), v2 = (−½,½,½) — got (v1−v0)×(v2−v0) = (0,−1,0),
correctly noticed that points down into the block, and then on the next pass
re-derived it the other way. It cycled the same theories — winding, then
"computeVertexNormals produces zero-length normals on degenerate triangles,"
then "the CanvasTexture isn't uploaded to the GPU," then the dawn lighting, then
PBR, then a duplicate ambient light, then outputColorSpace not being forwarded
by R3F, then drei's <Sky> overriding the background. Several of those
sub-diagnoses were correct. It announced "Now I have the full picture" five
separate times, each a different picture, and exhausted its step budget — 51
steps, budget-exhausted — with the fix never landing.
It wasn't dumb. It was doing the right activity (reason about each link) on a problem where that activity can't converge: an overdetermined bug gives no gradient, and the specific link it fixated on (winding ⇄ normal sign) is a hand-traced derivation it could not keep consistent across attempts. This is precisely the failure mode a leaderboard can't expose — on a clean, single-cause puzzle this same model is excellent.
What the second model did differently
We re-pointed the coder slot at GPT-5.3 Codex. It did not win the winding argument. It refused to have the argument. From its trace:
I'm going to remove winding/normal ambiguity entirely by rebuilding chunk faces as non-indexed triangles with explicit per-face normals.
That single sentence is the whole fix, and it's a representation change. A cube
face's normal is not something to infer — it's a known constant: +Y for the
top, −Z for the back, +X for the right. So write it down instead of
letting computeVertexNormals reconstruct it from winding you might have wrong.
The fragile build (the open-weights model's, and the original):
// 4 shared verts per quad, 2 triangles via an index buffer
geo.setAttribute('position', new THREE.BufferAttribute(pos, 3))
geo.setAttribute('uv', new THREE.BufferAttribute(uv, 2))
geo.setIndex([0,1,2, 1,3,2]) // ← winding decides normals AND facing
geo.computeVertexNormals() // ← normals INFERRED; a winding bug silently corrupts them
const mat = new THREE.MeshStandardMaterial({ map: atlas, roughness: 0.85 }) // ← PBR → dark, cause #2
The robust build (Codex's):
// 6 verts per quad (two triangles, NON-indexed), explicit constant normal
const FACE_NORMAL = { top:[0,1,0], bottom:[0,-1,0], north:[0,0,-1],
south:[0,0,1], east:[1,0,0], west:[-1,0,0] }
for (const [a,b,c] of [[0,1,2],[0,2,3]]) { // one fixed CCW triangulation, written once
for (const i of [a,b,c]) {
pos.push(...quad[i])
nrm.push(...FACE_NORMAL[face]) // ← the normal, stated, not derived
uv.push(...cellUV[i])
}
}
geo.setAttribute('normal', new THREE.BufferAttribute(nrm, 3)) // explicit
const mat = new THREE.MeshLambertMaterial({ map: atlas }) // lit, cheap, correct given good normals
Look at what this removes. With the normal stated explicitly:
N̂ is a constant per face ⇒ N̂·L̂ is correct regardless of winding
⇒ the entire winding → normal → black chain is GONE
Winding can now only affect facing (cull), not lighting — and a culled face is a loud, obvious symptom, not a silent dark one. The non-indexed layout (6 verts, no shared index) also means each triangle's normal is independent — no averaging across a shared vertex can split the difference between two faces. The bug class isn't fixed; it's made unrepresentable.
Then Codex did the second thing the first model never did — it isolated the remaining variable. From its trace: "I'm adding a true unlit debug toggle now so we can isolate texture-atlas sampling from lighting." In code, that's one material swap:
// MeshBasicMaterial ignores N̂·L̂ entirely: C_out = C_tex
const debug = new THREE.MeshBasicMaterial({ map: atlas })
MeshBasicMaterial drops the diffuse term — C_out = C_tex, no normals, no
lights. So:
- unlit shows correct textures → the atlas/UV path is fine, the bug is lighting/normals
- unlit also shows wrong colors → the bug is atlas/UV (a
flipYor cell-offset error)
One toggle collapses a two-variable search (is it geometry or is it light?) into two one-variable observations. That's the single most basic move in graphics debugging, and it's the one the spiraling model never made because it was busy re-deriving cross products. It is also, exactly, the move a leaderboard never asks for: there's nothing to deduce, only something to run.
The ledger, same prompt and same broken file for both:
| Coder | Coder calls | Result |
|---|---|---|
| Open-weights coder | 51 | budget-exhausted — fix never landed |
| GPT-5.3 Codex | 31 | all-subtasks-terminal at step 14, then 19 — clean |
The benchmark only existed because the verifier was blind
There's a reason a model had to debug this blind in the first place, and it's the same reason it makes such a clean benchmark. The platform's generation verifier is deliberately objective: it type-checks, resolves imports, probes the API, runs the unit tests, and browses the page in real Chromium watching for thrown exceptions. For this app it returned, every single time:
verify_page /: renders OK (title "App")
True. The page loaded. A <canvas> painting all-black pixels throws nothing —
it's a perfectly healthy canvas full of C_out ≈ 0. Every gate was a
text-and-structure gate, and a rendered image is neither. So the only ground
truth that could distinguish a working build from a broken one lived in the pixels
— which is exactly why this task measures interaction policy and not deduction. We
added a vision-QA pass: screenshot the canvas, ask a vision model one narrow
question — is this objectively broken? It came back with the sentence no text
check could produce: "everything is dark grayscale with no visible colors." That
turned the agent's flailing from blind into merely hard — and gave us a graded
task with a real, image-grounded answer key.
A fingerprint of the objective
The two models knew the same graphics — the open-weights coder correctly named the PBR material, the dawn lighting, the duplicate ambient. On a knowledge benchmark they'd score the same. What separated them was the policy for when stuck, and the plan ledger records that policy in the bluntest possible form: the names of the subtasks each model created.
The open-weights model's trail, in order:
fix-chunk-material → fix-merged-material → fix-normals-and-material
→ fix-winding-texture-normals → render-fix-v4-merged
Five consecutive fix- attempts at the same target: re-attack the patch,
re-derive the winding, try again. There is no information-gathering step anywhere
in it. Codex's trail:
render-debug-isolation → render-fix-v5-rewrite
The first is an experiment — a subtask whose purpose is to learn, not to
ship (the unlit MeshBasicMaterial toggle). The second is a rewrite, not a
patch. Across 21 render-related subtasks in the project's whole history, the word
"isolation" appears exactly once, and it's Codex's. So does "rewrite."
The numbers agree from two more angles:
| Signal | Open-weights coder | GPT-5.3 Codex |
|---|---|---|
| Steps → outcome | 51 → budget-exhausted (no fix) |
14, then 19 → all-subtasks-terminal (clean) |
| Per-step output tokens | median 74, max 6,400 (re-derivation bursts) | median 257, max 4,511 (steady) |
| First move when stuck | re-derive the cross product | isolate the variable |
More steps, worse result. And the shape is diagnostic: the open-weights output was bimodal — terse tool-pokes punctuated by 6,400-token analysis walls, twice reaching opposite conclusions on the same winding derivation — while Codex's was even, because it spent its tokens advancing rather than re-opening.
Read those three signals together and they point at one thing — the thing a true, environment-grounded benchmark can see and a leaderboard can't. There are problems solvable by deduction — a proof, a type error; more thinking is the path — and problems solvable only by interaction — a stateful GPU pipeline, where the missing fact lives in the running system and is reachable only by running an experiment. A model post-trained on long-horizon agentic tasks with execution feedback learns that an action's value is the uncertainty it removes; it reaches for the isolation toggle and the representation rewrite. A model whose strength is competition-grade reasoning brings that strength to every problem — including the ones where re-deriving the cross product a sixth time is motion without progress. The voxel bug is a near-perfect separator because it is both overdetermined (no deductive gradient) and cheaply isolable (one material swap): it rewards exactly the policy agentic training instills and punishes exactly the one pure reasoning instills.
The honest caveat, worth keeping: this is one trace, read from the outside. Sampling, the harness, context limits, and raw model scale all confound it, and none of it is a claim about weights. A single real bug is a sample size of one — which is the price of measuring the property leaderboards can't. But the pattern — debug by experiment, not re-derivation; restructure so the bug is unrepresentable, not out-argue it — is reproducible, and it's what the public direction of coding-agent training predicts. The most honest benchmark we ran this month wasn't a benchmark at all. It was one user's black screen, and it told us something the scorecards never would: which model to reach for when the screen stops giving answers.