Start herePart 3 of 4

The life of a weekend app

One Friday evening, one casual paragraph about a ski-trip expense spreadsheet, and a platform doing everything the demo never shows. This is the true story of a single app — real prompts, real timestamps, real money — from first message to a verified, frozen, costs-nothing application. Every stop on the timeline is a door into the machinery that made it boring.

The life of a weekend app

Every post on this blog dissects one organ of the platform — the map, the database, the freezer. This one watches the whole organism live. It's a true story: one app, built on the platform on a Friday evening, with the actual prompts, the actual timestamps, and the actual bill. Nothing in this post is a mock-up, including the part where a build runs out of budget and the part where the first bug gets found by a human.

Friday, 18:42 — a paragraph instead of a spec

The whole specification, exactly as typed:

Me and 7 friends are going on a ski trip in February and I'm tired of the spreadsheet. I want an app where anyone in the group can add an expense they paid for (lift passes, groceries, fuel, dinners), tag who it covers, and at any point see a "who owes who" summary so we can settle up at the end with as few transfers as possible. Everything's in EUR. Needs to work well on phones since we'll add stuff on the slopes.

No schema, no pages, no tech stack. Three requirements hide in the prose, and none of them is stated as a requirement: mobile-first ("on the slopes"), shared data ("anyone in the group"), and a genuine algorithm — settle up with as few transfers as possible is a min-cash-flow problem, not a CRUD endpoint.

18:46 — four minutes later, the architect's plan comes back. Six features, eleven subtasks:

# Feature
0 App shell & navigation — mobile-first, bottom tabs
1 Trip & members — the 8 friends, configurable names
2 Add an expense — who paid, what for, amount, who it covers
3 Expenses feed & balances
4 Settle up — minimum transfers to zero all balances
5 Polish & verify — seed demo data, test the math end-to-end

All three implicit requirements made it in, plus one nobody asked for: test the math end-to-end. The plan is a contract, not a vibe — each subtask declares the files it will create and the acceptance criteria it will be judged by. Judged objectively: by compilers and probes, never by the model grading itself.

18:52 — plan approved with one word ("yes").

18:52:15 — infrastructure arrives before the first line of code

Six seconds after approval, before the agent has written anything, the project has: its own Kubernetes namespace with quotas, its own PostgreSQL branch (copy-on-write, instant, scale-to-zero), the React + tRPC scaffold, and the database schema from the plan. Provisioning is deterministic platform code, not agent work — the agent is the least trusted component here and gets handed a furnished room, not a toolbox near a datacenter.

18:52 → 21:11 — the build, with receipts

Here is what an agent actually does with its budget, from this run's tool ledger:

What Share of steps
Writing code ~24%
Verifying — type checks, live contract probes, in-pod test runs, headless browsing ~27%
Reading context ~29%
Plan bookkeeping (which is also what triggers verification) ~15%
Direct SQL checks against its own database ~4%

For every two steps of writing there were two of checking. That ratio isn't agent virtue — it's the platform's stance enforced by tooling: a subtask only counts as done when the verifier pipeline says so. Three moments from the ledger show what that means in practice:

A truncated file got caught at the gate. The seed-data step once emitted a 176-character db/seed.ts — a model hiccup that would have "succeeded" as a file write. The emission gate rejected it as degenerate; the retry wrote the real thing. The platform's job is making sure a model's bad day doesn't compile into the product.

A subtask failed honestly. The members-management page burned three steps without writing any of its declared files, and the scope gate refused to credit it: status failing, evidence "completed 3 steps without writing any of its declared files." No false green. On the next pass, an anti-spiral nudge kicked the agent out of a diagnostic loop at step 11 — and the page landed, five files, tests green.

The math got checked in SQL. The plan said test the settle-up end-to-end, so the agent seeded eight members and a dozen expenses, then ran WITH balances AS (...) queries directly against its branch to verify that the transfer plan nets every balance to zero — then patched its own test suite where it disagreed. The app ships with that suite; it was born testable.

One more honest beat: halfway down the plan, the first run hit its step budget. A budget is a circuit breaker, not a promise that every plan fits in one pass — so the run banked its five verified subtasks, committed them, and paused, at 19:53, saying so. Then it waited — not computing, not billing, not pretending to work — for 35 minutes, until a human wandered back at 20:28 and typed "continue". The second run picked up the chain where the ledger said it stopped, skipped everything already verified, and finished the remaining six. (It also retried the page that had failed — that's where the anti-spiral save above happened.)

21:11 — done. Eleven of eleven subtasks passing. The wall clock says 2h19 from approval to done, but half an hour of that was the platform waiting, frozen mid-plan, for a human to come back from dinner. The agent's actual working time was about 1h45. The release commit is pinned to the database's exact log position (bdc2258 @ LSN 0/21933A8), which sounds like trivia until you need code and data to time-travel together.

This is the app, in production, photographed by the same headless Chromium the platform uses for its own runtime checks:

Ski Trip 2025 — mobile home screen: total spend €2,561.10, 11 expenses, per-person balance, recent expenses list, bottom tab navigation Settle Up screen — who owes who with the fewest payments: 8 members' balances and a plan of 7 transfers moving €1,076.33

The settle screen is the underspecified Friday sentence, shipped: "who owes who… with as few transfers as possible" became eight balances and a plan of 7 transfers — the minimum for eight people, since every settlement graph needs at most n−1 edges. And if you look closely at the negative amounts, you can see the next iteration already on camera: a doubled minus sign. Tuesday's bug report will be one sentence, like the last one.

21:28 — the first bug is found by a human

Poking the preview: two strange members named "x", carrying balances, and no way to delete them. Reported the way a person actually reports things:

I see some strange users x which have a non zero balance but I don't see any expenses associated to them. And I can't delete them.

The diagnosis is a story about automated QA: during runtime verification the agent's headless browser had typed into the real add-member form, and the test members stuck. Deleting them then surfaced a real design gap — members referenced by expense splits were foreign-key-protected, and the UI had no answer.

What came back, nine minutes later, was better than a delete-unblock: a re-split design — removing a member redistributes their shares across the remaining participants — plus a regression test for exactly the case the existing suite didn't cover (the agent's own observation, from its narration). Twenty-one tests green, committed.

That's the iteration loop this platform is built around: the bug arrives as a sentence, the fix arrives as a verified commit, and the person on the slopes never saw a stack trace.

23:00 — what a finished weekend app costs

The evening's bill, from the metering tables:

  • Build: ~4.8M tokens through the coding model across two runs and the fix iteration — about $3.50 raw model cost for a full-stack, tested, seeded application.
  • Idle: 148 seconds after the last keepalive, the platform froze the app's container at the cgroup — this very app, from this very story, freed 201 MB of RAM to swap on its first quiet evening. Its database compute scaled to zero separately. An idle weekend idea costs effectively nothing, indefinitely, by default.

The spreadsheet died at 18:42. By dinner there was an app with verified math, its own database branch, a test suite, and an idle cost of zero — and nobody on the trip needs to know what an LSN is.

What the preview doesn't promise yet

Everything above is still one user's preview — and a preview makes almost no promises. The interesting part of this app's life starts when seven other people get the URL: a production deploy that forks the data timeline, real expenses from real friends that the next deploy must not touch, a risky migration, and — when it goes wrong — a rollback with explicit rules about what it may delete.

That's part two of this app's story. The mechanisms are already written up in Act 2 — this app just hasn't lived them yet. It will by Sunday.

Build on the platform these posts describe.

Describe your app in plain English — Adorable writes the code, sets up the database, and ships it live.

Start building free