The modelPart 1 of 1

A real bug is a truer benchmark than a leaderboard

Every model ships with leaderboard numbers. They measure one-shot deductive puzzles — and tell you almost nothing about how a model behaves when it's stuck on a real problem inside a live system. We got an unplanned head-to-head: the same prompt, the same broken file, the same running app, two models. One burned its entire budget over 51 steps re-deriving cross products by hand and never fixed it; the other fixed it in 19. The gap wasn't raw intelligence. It was *policy when stuck* — and the bug was engineered, by accident, to measure exactly that.

A real bug is a truer benchmark than a leaderboard

Every model arrives with a scorecard: SWE-bench, HumanEval, a contest rating. Those numbers are real, but they share a shape — a self-contained puzzle, a known answer, one shot, graded on the diff. They measure deduction under certainty. They tell you almost nothing about the thing you actually depend on in an agent: what a model does when it's stuck on a real problem, inside a running system, with no answer key and a feedback loop that lies to it.

You can't put that on a leaderboard, because the moment a task is clean enough to score automatically, it stops resembling the work. So we mostly fly blind on the property that matters most.

This post is about one time we didn't. A user's app broke in a specific, reproducible way, and through model routing we ended up running two different coders against it — same prompt, same broken file, same live app, same verifier. Not a synthetic eval. A natural experiment. One model spent 51 steps and its entire budget and never landed the fix; the other landed it cleanly in 19. The interesting part isn't the score. It's why the gap exists — because the bug turned out to be almost perfectly engineered to separate two kinds of model, and a leaderboard would have rated them nearly tied.

To see what it measured, you have to understand the bug. So we'll go through it properly — the actual graphics, the actual formulas — and then read the two traces against each other.

The problem under test: one mesh per chunk

A user was building a Minecraft-style voxel world in React Three Fiber. The first generation was good — a textured, lit, playable world. Then came the optimization every voxel engine eventually needs, and it turned the entire scene black.

A naïve voxel renderer gives every block its own mesh, so the GPU issues one draw call per block — thousands per frame. The fix is merged chunk geometry: pack every visible face of a 16×16×N chunk into a single BufferGeometry and draw the chunk in one call. You hand-build three vertex attributes — position, normal, uv — plus an index buffer.

That hand-build is where correctness gets fragile, because a lit material doesn't show you positions. It shows you light. And light depends on the normals, and the normals — in the naïve approach — depend on the winding order of your triangles. That chain is the whole story, and it's the reason this bug is a good benchmark: there is no closed-form answer to read off. The missing fact lives in the running pipeline.

The shading math, and where black comes from

The blocks used MeshLambertMaterial — diffuse Lambert shading. For one light, the outgoing color of a fragment is:

C_out = C_tex · ( k_a·A  +  k_d · max(0, N̂ · L̂) )
                  └ ambient ┘   └─── diffuse ───┘
  • C_tex — the texel sampled from the atlas at the fragment's UV
  • — the surface normal (unit), — direction to the light (unit)
  • k_a·A — ambient term, k_d·max(0, N̂·L̂) — diffuse term

The diffuse term is a clamped dot product. The clamp is the trap: if points away from the light — or worse, into the block — then N̂·L̂ ≤ 0, the max(0, …) zeroes it, and you're left with only ambient:

N̂ pointing inward   ⇒   N̂·L̂ ≤ 0   ⇒   C_out = C_tex · k_a · A

And ambient was tiny (more on that below), so C_out ≈ 0. Black. Not missing, not untextured — fully shaded black, which looks like a lighting bug, or a texture bug, or anything you want it to be.

So: which way do the normals point? In the naïve build, you never say. You let Three.js infer them:

geo.setIndex([0,1,2,  1,3,2])   // two triangles per quad
geo.computeVertexNormals()       // normals INFERRED from those triangles

And computeVertexNormals() computes each triangle's normal from the cross product of its edges, in index order:

N = normalize( (v1 − v0) × (v2 − v0) )

The cross product is antisymmetric — a × b = −(b × a) — so swapping two indices flips the normal's sign. [0,1,2] and [0,2,1] describe the same three points and the opposite normal. The winding order of your index buffer silently is your lighting.

Here's the same coin's other face. WebGL's backface culling (side: FrontSide) keeps a triangle only when it's wound counter-clockwise as the camera sees it — the sign of its projected screen-space area. The same index order that flips the normal also flips the facing. So one wrong winding does two things at once: it flips the normal (→ Lambert black) and flips the facing (→ backface-culled, the "I only see thin/single surfaces" symptom). One root cause, two completely different-looking bugs.

(v1−v0)×(v2−v0)

signed screen area

max(0, N̂·L̂)

FrontSide cull

index winding
[0,1,2] vs [0,2,1]

vertex normal N̂

front/back facing

Lambert diffuse

visible?

black faces

thin / missing faces

This is an inference chain: a value you care about (lit color) is derived, link by link, from a value that's easy to get subtly wrong (winding), through a step that hides the error (computeVertexNormals). Debugging it means proving every link correct, by hand, for all six cube faces.

Why this bug is a discriminating benchmark

A good benchmark separates the things it tests. Most coding evals don't separate much — a stronger reasoner just scores higher across the board. This bug separates, and it does so because of two properties it happened to combine.

First: it was overdetermined — three independent, each-sufficient reasons to be black. Any one alone produces the same image.

# Cause The physics Fix
1 Inverted winding N̂·L̂ ≤ 0 → diffuse 0; also backface-culled correct winding or explicit normals
2 MeshStandardMaterial (PBR) a metal-rough BRDF with roughness 0.85 and no environment map has almost no diffuse energy under a single light — IBL is doing the work that wasn't there use MeshLambertMaterial
3 Dawn lighting day cycle started at t=0 → sun on the horizon start at midday / raise floor

Cause 3 hides in plain sight. The day cycle drove the sun height as sin(t) with t = 0 at load, so the directional light came up at max(0, sin 0)·3.5 + 0.2 = 0.2 — the sun exactly on the horizon — with a duplicate hard-coded <ambientLight> fighting the cycle. Effective illumination ≈ 0.05. The scene loaded at dawn; even perfect geometry renders near-black at t = 0.

Overdetermination is the cruelest property a bug can have, because it destroys your feedback loop. Fix the winding → still black (PBR + dawn). Fix the material → still black (winding + dawn). Every correct fix looks like a failed fix. You cannot climb a gradient when every step reads as zero. A model that treats debugging as reason harder, then patch gets no signal that it's making progress.

Second: despite all that, the bug was cheaply isolable — one material swap, as we'll see, collapses the whole search. So the task quietly asks one question: when the screen gives you no gradient, do you keep reasoning, or do you run an experiment to manufacture a gradient? A pure deductive policy fails here; an interactive one walks right through. That's the separator. A leaderboard built from one-shot puzzles can't pose this question at all.

What the first model actually did

Watch the strong open-weights coder work it and you see a sharp engineer trapped in exactly that zero-gradient. It went straight at the hardest link in the chain — proving the winding by hand — and it kept getting the sign wrong, in both directions, on different attempts. Two real quotes from its trace, hours apart:

…so my "fix" actually made it worse — the original 0-1-2 was correct for the top face!

All 6 faces have inverted winding. The fix is simple: change indices from [0,1,2],[1,3,2] to [0,2,1],[1,2,3].

Both can't be true. It hand-traced the cross product for the top face — v0 = (−½,½,−½), v1 = (½,½,−½), v2 = (−½,½,½) — got (v1−v0)×(v2−v0) = (0,−1,0), correctly noticed that points down into the block, and then on the next pass re-derived it the other way. It cycled the same theories — winding, then "computeVertexNormals produces zero-length normals on degenerate triangles," then "the CanvasTexture isn't uploaded to the GPU," then the dawn lighting, then PBR, then a duplicate ambient light, then outputColorSpace not being forwarded by R3F, then drei's <Sky> overriding the background. Several of those sub-diagnoses were correct. It announced "Now I have the full picture" five separate times, each a different picture, and exhausted its step budget — 51 steps, budget-exhausted — with the fix never landing.

It wasn't dumb. It was doing the right activity (reason about each link) on a problem where that activity can't converge: an overdetermined bug gives no gradient, and the specific link it fixated on (winding ⇄ normal sign) is a hand-traced derivation it could not keep consistent across attempts. This is precisely the failure mode a leaderboard can't expose — on a clean, single-cause puzzle this same model is excellent.

What the second model did differently

We re-pointed the coder slot at GPT-5.3 Codex. It did not win the winding argument. It refused to have the argument. From its trace:

I'm going to remove winding/normal ambiguity entirely by rebuilding chunk faces as non-indexed triangles with explicit per-face normals.

That single sentence is the whole fix, and it's a representation change. A cube face's normal is not something to infer — it's a known constant: +Y for the top, −Z for the back, +X for the right. So write it down instead of letting computeVertexNormals reconstruct it from winding you might have wrong.

The fragile build (the open-weights model's, and the original):

// 4 shared verts per quad, 2 triangles via an index buffer
geo.setAttribute('position', new THREE.BufferAttribute(pos, 3))
geo.setAttribute('uv',       new THREE.BufferAttribute(uv, 2))
geo.setIndex([0,1,2, 1,3,2])    // ← winding decides normals AND facing
geo.computeVertexNormals()       // ← normals INFERRED; a winding bug silently corrupts them
const mat = new THREE.MeshStandardMaterial({ map: atlas, roughness: 0.85 }) // ← PBR → dark, cause #2

The robust build (Codex's):

// 6 verts per quad (two triangles, NON-indexed), explicit constant normal
const FACE_NORMAL = { top:[0,1,0], bottom:[0,-1,0], north:[0,0,-1],
                      south:[0,0,1], east:[1,0,0], west:[-1,0,0] }

for (const [a,b,c] of [[0,1,2],[0,2,3]]) {     // one fixed CCW triangulation, written once
  for (const i of [a,b,c]) {
    pos.push(...quad[i])
    nrm.push(...FACE_NORMAL[face])             // ← the normal, stated, not derived
    uv.push(...cellUV[i])
  }
}
geo.setAttribute('normal', new THREE.BufferAttribute(nrm, 3)) // explicit
const mat = new THREE.MeshLambertMaterial({ map: atlas })     // lit, cheap, correct given good normals

Look at what this removes. With the normal stated explicitly:

is a constant per face  ⇒  N̂·L̂ is correct regardless of winding
                           ⇒  the entire winding → normal → black chain is GONE

Winding can now only affect facing (cull), not lighting — and a culled face is a loud, obvious symptom, not a silent dark one. The non-indexed layout (6 verts, no shared index) also means each triangle's normal is independent — no averaging across a shared vertex can split the difference between two faces. The bug class isn't fixed; it's made unrepresentable.

Then Codex did the second thing the first model never did — it isolated the remaining variable. From its trace: "I'm adding a true unlit debug toggle now so we can isolate texture-atlas sampling from lighting." In code, that's one material swap:

// MeshBasicMaterial ignores N̂·L̂ entirely:  C_out = C_tex
const debug = new THREE.MeshBasicMaterial({ map: atlas })

MeshBasicMaterial drops the diffuse term — C_out = C_tex, no normals, no lights. So:

  • unlit shows correct textures → the atlas/UV path is fine, the bug is lighting/normals
  • unlit also shows wrong colors → the bug is atlas/UV (a flipY or cell-offset error)

One toggle collapses a two-variable search (is it geometry or is it light?) into two one-variable observations. That's the single most basic move in graphics debugging, and it's the one the spiraling model never made because it was busy re-deriving cross products. It is also, exactly, the move a leaderboard never asks for: there's nothing to deduce, only something to run.

The ledger, same prompt and same broken file for both:

Coder Coder calls Result
Open-weights coder 51 budget-exhausted — fix never landed
GPT-5.3 Codex 31 all-subtasks-terminal at step 14, then 19 — clean

The benchmark only existed because the verifier was blind

There's a reason a model had to debug this blind in the first place, and it's the same reason it makes such a clean benchmark. The platform's generation verifier is deliberately objective: it type-checks, resolves imports, probes the API, runs the unit tests, and browses the page in real Chromium watching for thrown exceptions. For this app it returned, every single time:

verify_page /: renders OK (title "App")

True. The page loaded. A <canvas> painting all-black pixels throws nothing — it's a perfectly healthy canvas full of C_out ≈ 0. Every gate was a text-and-structure gate, and a rendered image is neither. So the only ground truth that could distinguish a working build from a broken one lived in the pixels — which is exactly why this task measures interaction policy and not deduction. We added a vision-QA pass: screenshot the canvas, ask a vision model one narrow question — is this objectively broken? It came back with the sentence no text check could produce: "everything is dark grayscale with no visible colors." That turned the agent's flailing from blind into merely hard — and gave us a graded task with a real, image-grounded answer key.

A fingerprint of the objective

The two models knew the same graphics — the open-weights coder correctly named the PBR material, the dawn lighting, the duplicate ambient. On a knowledge benchmark they'd score the same. What separated them was the policy for when stuck, and the plan ledger records that policy in the bluntest possible form: the names of the subtasks each model created.

The open-weights model's trail, in order:

fix-chunk-material → fix-merged-material → fix-normals-and-material
  → fix-winding-texture-normals → render-fix-v4-merged

Five consecutive fix- attempts at the same target: re-attack the patch, re-derive the winding, try again. There is no information-gathering step anywhere in it. Codex's trail:

render-debug-isolation → render-fix-v5-rewrite

The first is an experiment — a subtask whose purpose is to learn, not to ship (the unlit MeshBasicMaterial toggle). The second is a rewrite, not a patch. Across 21 render-related subtasks in the project's whole history, the word "isolation" appears exactly once, and it's Codex's. So does "rewrite."

The numbers agree from two more angles:

Signal Open-weights coder GPT-5.3 Codex
Steps → outcome 51 → budget-exhausted (no fix) 14, then 19 → all-subtasks-terminal (clean)
Per-step output tokens median 74, max 6,400 (re-derivation bursts) median 257, max 4,511 (steady)
First move when stuck re-derive the cross product isolate the variable

More steps, worse result. And the shape is diagnostic: the open-weights output was bimodal — terse tool-pokes punctuated by 6,400-token analysis walls, twice reaching opposite conclusions on the same winding derivation — while Codex's was even, because it spent its tokens advancing rather than re-opening.

Read those three signals together and they point at one thing — the thing a true, environment-grounded benchmark can see and a leaderboard can't. There are problems solvable by deduction — a proof, a type error; more thinking is the path — and problems solvable only by interaction — a stateful GPU pipeline, where the missing fact lives in the running system and is reachable only by running an experiment. A model post-trained on long-horizon agentic tasks with execution feedback learns that an action's value is the uncertainty it removes; it reaches for the isolation toggle and the representation rewrite. A model whose strength is competition-grade reasoning brings that strength to every problem — including the ones where re-deriving the cross product a sixth time is motion without progress. The voxel bug is a near-perfect separator because it is both overdetermined (no deductive gradient) and cheaply isolable (one material swap): it rewards exactly the policy agentic training instills and punishes exactly the one pure reasoning instills.

The honest caveat, worth keeping: this is one trace, read from the outside. Sampling, the harness, context limits, and raw model scale all confound it, and none of it is a claim about weights. A single real bug is a sample size of one — which is the price of measuring the property leaderboards can't. But the pattern — debug by experiment, not re-derivation; restructure so the bug is unrepresentable, not out-argue it — is reproducible, and it's what the public direction of coding-agent training predicts. The most honest benchmark we ran this month wasn't a benchmark at all. It was one user's black screen, and it told us something the scorecards never would: which model to reach for when the screen stops giving answers.

Build on the platform these posts describe.

Describe your app in plain English — Adorable writes the code, sets up the database, and ships it live.

Start building free