Time-travel for code AND data
An AI agent writes code, and we have a great undo for that: it's called git. Every generation is a commit; "go back" is a checkout. Solved.
But agents don't only write code — they run migrations, seed tables, and
issue UPDATEs. And there, "go back" is a backup-restore incident. An agent
that drops the wrong column or runs an UPDATE without a WHERE has done
something git can't touch. For autonomous, fast-moving agents, data is the
unprotected surface.
So we gave the database the same property git gave the code: point-in-time undo, tied to the same checkpoints. Click "restore" on a past version and your schema and rows roll back with the code — to a state that actually ran.
This post is how it works: the storage model that makes it cheap, the checkpoint, the copy-on-write restore, and the cache coherence behind the compute repoint.
Why physical PITR, not "replay the migrations"
The naive version of data-undo is logical: keep the DDL, run the down-migrations. That loses your rows, can't represent arbitrary mutations, and diverges from reality the moment anything writes outside a migration. We wanted the actual bytes of the database as they were at a point in time — every row, index, sequence, and catalog entry — reconstructed exactly.
That's physical point-in-time recovery, and it's only cheap if storage and compute are separated. Adorable runs a self-hosted Neon: the pageserver stores versioned pages + WAL in object storage; compute is a stateless Postgres that reads pages from the pageserver on demand.
A timeline is the unit of history: a Postgres database reconstructed from page versions + WAL up to a given LSN (Log Sequence Number — Postgres's monotonic byte offset into the WAL). Branching a timeline is copy-on-write: the child shares the parent's unchanged pages and only writes a page when it diverges. Creating one is an O(1) metadata operation, independent of database size.
That single primitive — branch a timeline from another at a specific LSN — is the whole feature. Everything below is plumbing around it.
A checkpoint is a two-field marker, not a copy
At each generation commit we record a checkpoint. The critical design choice:
a checkpoint forks nothing. It's a row that remembers which timeline and
which LSN — (timeline_id, lsn) — plus the git commit it corresponds to. No
storage is allocated; the fork is materialized lazily, only if you ever restore.
The hook lives at the end of the generation pipeline, right after the git commit, and is strictly best-effort — a time-travel failure must never break a build:
const commitResult = await commitGeneration(projectDir, { userPrompt, runSummary });
if (commitResult?.commitHash) {
await recordCheckpointSafe(projectId, { commitSha: commitResult.commitHash });
}
recordCheckpoint reads the timeline's current head LSN from the proxy and writes
one row into project_db_branches:
const lsnInfo = await getBranchLsn(projectId); // GET /api/branches/{id}/lsn
if (!lsnInfo?.timelineId) return null; // frontend-only app: nothing to tag
return recordBranch(projectId, {
name: checkpointBranchName(commitSha), // ckpt-<sha12>
kind: "checkpoint",
timelineId: lsnInfo.timelineId,
createdLsn: lsnInfo.lastRecordLsn,
commitSha,
servesEnv: null, // serves no compute until restored
});
Which LSN a checkpoint pins
recordCheckpoint pins the pageserver's last_record_lsn — the high-water
mark of WAL the pageserver has durably ingested — read once via getBranchLsn.
That LSN trails the compute's most recent commit by whatever WAL is still in
flight (compute → safekeeper → pageserver), so a checkpoint captures everything
durable on the page store at that instant, not a write that hasn't landed yet:
compute commit ─(WAL in flight)→ safekeeper ─→ pageserver: last_record_lsn
▲ checkpoint pins this durable high-water mark
In practice that's the right boundary: checkpoints are taken at generation
commits (recordCheckpointSafe runs after commitGeneration), by which point
the run's writes have propagated to the page store — so a later fork at the
checkpoint's LSN restores exactly the state it held there.
Checkpoints accumulate (one per commit), so recordCheckpoint prunes to the
newest CHECKPOINT_KEEP (50, env-tunable) rows. They're markers, so pruning a
row reclaims nothing — restore depth is governed by Neon's PITR window, which is
the real retention knob.
Restore = fork at the LSN, then repoint the compute
Restoring to checkpoint C is two moves: (1) materialize the historical
state as a new copy-on-write timeline, (2) point the running compute at it.
timeline T (dev): ──S0──●──S1──▶ ● = checkpoint C's LSN (post-S0, pre-S1)
│
└─ fork ──▶ timeline T' (new dev) ← compute repointed here
contains S0, NOT S1
The dev compute now serves T' (data at S0). T is untouched — restore is
non-destructive, so you can roll forward again. Here's the full path
(the backend's time-travel layer → its Neon HTTP client → the Rust proxy):
The fork is the pageserver primitive, exposed through the proxy. In the proxy's
pageserver client, create_timeline takes an optional ancestor:
if let Some(ancestor) = ancestor_timeline_id {
body["ancestor_timeline_id"] = json!(ancestor);
if let Some(lsn) = ancestor_start_lsn {
body["ancestor_start_lsn"] = json!(lsn); // pin the branch point
}
}
The explicit ancestorTimelineId matters for cross-timeline restore: after
one restore, the active timeline is T'; restoring to an older checkpoint
(on T) must fork T, not the current timeline. The checkpoint row carries the
timeline id, so restoreToCheckpoint forks exactly the right ancestor.
Cache coherence: the proxy owns the repoint
The second move — point the compute at the new timeline — has one non-obvious requirement worth dwelling on, because it's the kind of thing that's easy to get subtly wrong.
The compute spec is rebuilt from the branch credentials stored in Vault:
(timeline_id, password, …). So "point the compute at T'" means "set the
timeline in the credentials, then restart the compute, which reads them and
boots on T'."
The catch: the proxy caches branch credentials in-memory for 5 minutes
(CACHE_TTL = 300s) to keep Vault off the hot path.
A write that goes straight to Vault — for instance, from the backend — leaves
that cache stale, and the next wake rebuilds the compute from the cached (old)
timeline. Everything looks fine — the fork exists, the DB-branch row points at
it, the compute reboots — while the data never moves. A cache miss you can't see.
So the repoint is owned by the proxy, not the backend.
POST /api/branches/{id}/repoint calls put_branch_credentials, which writes
Vault and updates the cache in one operation, then restarts the compute on
the new timeline:
The principle generalizes: if a value is cached behind a service, mutate it through that service. A write that bypasses the cache is a write on a clock you don't control.
A note on the swap
repoint scales the Deployment 0 then 1; the new compute boots on T' and
the proxy's bridge reconnects clients to it. A connection that re-establishes in
the brief window while the previous pod is still draining settles onto the
restored timeline once the new compute is serving — so callers read the
post-restore state after the swap completes. (It's why the verification below
polls the result rather than reading once.)
How it's verified
The fork → repoint → reboot path isn't special-cased for time-travel — it's the
same branch-at-LSN primitive that deploy and rollback ride on, so every
production promotion exercises it in the field. The checkpoint, fork, and
post-restore reconciliation logic are covered by the platform's branch and
restore-reconcile test suites (db-branches.test.js,
db-restore-reconcile.test.js), which stand up the real branch operations rather
than mocking them.
Design bounds
Restore depth is the PITR window. Checkpoint rows are cheap, but you can only fork an LSN the pageserver still retains, so how far back you can restore is governed by Neon's history window. The checkpoint row cap and the PITR window are kept aligned to the depth we offer.
Why it's worth it
For a human, this is a nice undo button. For an agent, it's closer to a prerequisite: it's what makes confident, fast, irreversible data operations safe to delegate. The agent can run the scary migration because the platform can take it back — code and data, to a point that actually ran. Git did this for code a long time ago. The database just needed the same primitive: a branch at an LSN.