The databasePart 3 of 6

Time-travel for code AND data: undoing an AI agent's database mistakes

An AI agent will confidently run a migration that drops the wrong column. Git gives you an undo for code; nothing gives you one for data. Here's how code+data time-travel works on copy-on-write Postgres timelines — the checkpoint-at-LSN, the copy-on-write restore, and the cache coherence that keeps the compute repoint correct.

Time-travel for code AND data

An AI agent writes code, and we have a great undo for that: it's called git. Every generation is a commit; "go back" is a checkout. Solved.

But agents don't only write code — they run migrations, seed tables, and issue UPDATEs. And there, "go back" is a backup-restore incident. An agent that drops the wrong column or runs an UPDATE without a WHERE has done something git can't touch. For autonomous, fast-moving agents, data is the unprotected surface.

So we gave the database the same property git gave the code: point-in-time undo, tied to the same checkpoints. Click "restore" on a past version and your schema and rows roll back with the code — to a state that actually ran.

This post is how it works: the storage model that makes it cheap, the checkpoint, the copy-on-write restore, and the cache coherence behind the compute repoint.


Why physical PITR, not "replay the migrations"

The naive version of data-undo is logical: keep the DDL, run the down-migrations. That loses your rows, can't represent arbitrary mutations, and diverges from reality the moment anything writes outside a migration. We wanted the actual bytes of the database as they were at a point in time — every row, index, sequence, and catalog entry — reconstructed exactly.

That's physical point-in-time recovery, and it's only cheap if storage and compute are separated. Adorable runs a self-hosted Neon: the pageserver stores versioned pages + WAL in object storage; compute is a stateless Postgres that reads pages from the pageserver on demand.

postgres wire
proj_{id}

scale 0→1
route

pages / WAL

WAL

creds / timeline

App / agent

neon-conn-proxy
Rust, :5432

neon-compute
stateless PG

pageserver

safekeeper

Garage S3
versioned pages

Vault

A timeline is the unit of history: a Postgres database reconstructed from page versions + WAL up to a given LSN (Log Sequence Number — Postgres's monotonic byte offset into the WAL). Branching a timeline is copy-on-write: the child shares the parent's unchanged pages and only writes a page when it diverges. Creating one is an O(1) metadata operation, independent of database size.

That single primitive — branch a timeline from another at a specific LSN — is the whole feature. Everything below is plumbing around it.


A checkpoint is a two-field marker, not a copy

At each generation commit we record a checkpoint. The critical design choice: a checkpoint forks nothing. It's a row that remembers which timeline and which LSN(timeline_id, lsn) — plus the git commit it corresponds to. No storage is allocated; the fork is materialized lazily, only if you ever restore.

The hook lives at the end of the generation pipeline, right after the git commit, and is strictly best-effort — a time-travel failure must never break a build:

const commitResult = await commitGeneration(projectDir, { userPrompt, runSummary });
if (commitResult?.commitHash) {
  await recordCheckpointSafe(projectId, { commitSha: commitResult.commitHash });
}

recordCheckpoint reads the timeline's current head LSN from the proxy and writes one row into project_db_branches:

const lsnInfo = await getBranchLsn(projectId);          // GET /api/branches/{id}/lsn
if (!lsnInfo?.timelineId) return null;                  // frontend-only app: nothing to tag
return recordBranch(projectId, {
  name: checkpointBranchName(commitSha),                // ckpt-<sha12>
  kind: "checkpoint",
  timelineId: lsnInfo.timelineId,
  createdLsn: lsnInfo.lastRecordLsn,
  commitSha,
  servesEnv: null,                                      // serves no compute until restored
});

Which LSN a checkpoint pins

recordCheckpoint pins the pageserver's last_record_lsn — the high-water mark of WAL the pageserver has durably ingested — read once via getBranchLsn. That LSN trails the compute's most recent commit by whatever WAL is still in flight (compute → safekeeper → pageserver), so a checkpoint captures everything durable on the page store at that instant, not a write that hasn't landed yet:

compute commit ─(WAL in flight)→ safekeeper ─→ pageserver: last_record_lsn
                                                ▲ checkpoint pins this durable high-water mark

In practice that's the right boundary: checkpoints are taken at generation commits (recordCheckpointSafe runs after commitGeneration), by which point the run's writes have propagated to the page store — so a later fork at the checkpoint's LSN restores exactly the state it held there.

Checkpoints accumulate (one per commit), so recordCheckpoint prunes to the newest CHECKPOINT_KEEP (50, env-tunable) rows. They're markers, so pruning a row reclaims nothing — restore depth is governed by Neon's PITR window, which is the real retention knob.


Restore = fork at the LSN, then repoint the compute

Restoring to checkpoint C is two moves: (1) materialize the historical state as a new copy-on-write timeline, (2) point the running compute at it.

timeline T (dev):  ──S0──●──S1──▶          ● = checkpoint C's LSN (post-S0, pre-S1)
                          │
                          └─ fork ──▶  timeline T' (new dev)   ← compute repointed here
                                       contains S0, NOT S1

The dev compute now serves T' (data at S0). T is untouched — restore is non-destructive, so you can roll forward again. Here's the full path (the backend's time-travel layer → its Neon HTTP client → the Rust proxy):

K8s (compute Deployment)Vaultpageserverneon-conn-proxybackend (time-travel)K8s (compute Deployment)Vaultpageserverneon-conn-proxybackend (time-travel)POST /branches/{id}/fork {lsn, ancestorTimelineId}create_timeline(ancestor_timeline_id, ancestor_start_lsn)T' (copy-on-write, O(1)){timelineId: T'}POST /branches/{id}/repoint {timelineId: T'}put_branch_credentials(timeline=T') %% Vault + in-mem cache, atomicallyscale compute 0 (force_sleeping + stop)rebuild ConfigMap(neon.timeline_id=T') + scale 1readyrecordBranch(dev, T') + setServingBranch(dev)

The fork is the pageserver primitive, exposed through the proxy. In the proxy's pageserver client, create_timeline takes an optional ancestor:

if let Some(ancestor) = ancestor_timeline_id {
    body["ancestor_timeline_id"] = json!(ancestor);
    if let Some(lsn) = ancestor_start_lsn {
        body["ancestor_start_lsn"] = json!(lsn);   // pin the branch point
    }
}

The explicit ancestorTimelineId matters for cross-timeline restore: after one restore, the active timeline is T'; restoring to an older checkpoint (on T) must fork T, not the current timeline. The checkpoint row carries the timeline id, so restoreToCheckpoint forks exactly the right ancestor.


Cache coherence: the proxy owns the repoint

The second move — point the compute at the new timeline — has one non-obvious requirement worth dwelling on, because it's the kind of thing that's easy to get subtly wrong.

The compute spec is rebuilt from the branch credentials stored in Vault: (timeline_id, password, …). So "point the compute at T'" means "set the timeline in the credentials, then restart the compute, which reads them and boots on T'."

The catch: the proxy caches branch credentials in-memory for 5 minutes (CACHE_TTL = 300s) to keep Vault off the hot path. A write that goes straight to Vault — for instance, from the backend — leaves that cache stale, and the next wake rebuilds the compute from the cached (old) timeline. Everything looks fine — the fork exists, the DB-branch row points at it, the compute reboots — while the data never moves. A cache miss you can't see.

So the repoint is owned by the proxy, not the backend. POST /api/branches/{id}/repoint calls put_branch_credentials, which writes Vault and updates the cache in one operation, then restarts the compute on the new timeline:

proxy-owned /repoint

proxy: put_branch_credentials = T'

Vault + cache updated together

stop → rebuild spec → start

compute boots T'

backend-driven (stale cache)

backend writes Vault = T'

proxy cache still = T
5-min TTL

wake builds spec from cache

compute boots T

The principle generalizes: if a value is cached behind a service, mutate it through that service. A write that bypasses the cache is a write on a clock you don't control.

A note on the swap

repoint scales the Deployment 0 then 1; the new compute boots on T' and the proxy's bridge reconnects clients to it. A connection that re-establishes in the brief window while the previous pod is still draining settles onto the restored timeline once the new compute is serving — so callers read the post-restore state after the swap completes. (It's why the verification below polls the result rather than reading once.)


How it's verified

The fork → repoint → reboot path isn't special-cased for time-travel — it's the same branch-at-LSN primitive that deploy and rollback ride on, so every production promotion exercises it in the field. The checkpoint, fork, and post-restore reconciliation logic are covered by the platform's branch and restore-reconcile test suites (db-branches.test.js, db-restore-reconcile.test.js), which stand up the real branch operations rather than mocking them.


Design bounds

Restore depth is the PITR window. Checkpoint rows are cheap, but you can only fork an LSN the pageserver still retains, so how far back you can restore is governed by Neon's history window. The checkpoint row cap and the PITR window are kept aligned to the depth we offer.


Why it's worth it

For a human, this is a nice undo button. For an agent, it's closer to a prerequisite: it's what makes confident, fast, irreversible data operations safe to delegate. The agent can run the scary migration because the platform can take it back — code and data, to a point that actually ran. Git did this for code a long time ago. The database just needed the same primitive: a branch at an LSN.

Build on the platform these posts describe.

Describe your app in plain English — Adorable writes the code, sets up the database, and ships it live.

Start building free