Scale-to-zero Postgres: waking a database mid-connection
The pitch is simple: every app you build gets its own real Postgres, and it costs nothing while nobody's using it. In a world where most apps are half-finished experiments or quietly waiting for their first user, you cannot pay for idle database compute — there's too much of it.
"Scale to zero" is easy to say and annoying to do, because the moment you remove
the compute you've broken the contract every Postgres client assumes: connect
and it answers. So we put a small Rust service in front — neon-conn-proxy —
that makes the gap invisible. It speaks the Postgres wire protocol, and when a
connection arrives for a sleeping database it parks the client, scales the
compute from 0 to 1, and forwards the query once Postgres is live. To the app
it's a slightly slow first query. To the bill, it's nothing while the app is
quiet.
This works because Neon separates storage from compute: the pageserver holds
the data in object storage; compute is a stateless Postgres pod that reads
pages on demand. Scaling compute to zero loses nothing — the data is in the
pageserver. Each project's compute is a K8s Deployment we scale between 0 and
1 replicas.
The interesting parts split into three concerns inside the proxy: the TCP/wire path, the per-project state machine and single-flight wake, and the K8s scaling. Let's walk the path of one connection.
1. Parse just enough of the wire protocol to route
A Postgres client's first move is the startup message: a length-prefixed
blob of key/value parameters, including database and user. There's no TLS
negotiation we need to MITM and no auth yet — we only need one field, the
database name, which we've overloaded to carry the project id: proj_{id}.
// handle_connection — the TCP/wire path
let first = postgres::read_startup(&mut client).await?;
match first {
ClientFirstMessage::Startup(startup) => {
let db_name = startup.database().unwrap_or("postgres");
let project_id = match db_name.strip_prefix("proj_") {
Some(id) => id.to_string(),
None => { /* send a Postgres ErrorResponse and bail */ }
};
queue.ensure_ready(&project_id).await?; // <-- may wake the compute
// ... connect upstream, then forward
}
ClientFirstMessage::Cancel { .. } => Ok(()), // cancel reqs carry no routing info
}
Two wire-protocol details matter:
- Cancel requests (
CancelRequest) look like startup messages but carry a backend PID/secret instead of parameters. They have no routing info, so we drop them rather than guess. - We rewrite the database name back to
postgresbefore forwarding. The app connects toproj_abc12; the real Postgres only knowspostgres. So after the wake we splice a rewritten startup message onto the upstream socket:
let rewritten = postgres::rewrite_database(&startup, &config.neon_db_name); // "postgres"
upstream.write_all(&rewritten).await?;
The proxy never participates in auth — it forwards the startup (rewritten) and then becomes a dumb pipe. The client authenticates directly against the compute.
2. Single-flight wake: a tiny per-project state machine
The wake itself can't be naive. If ten connections hit a sleeping database at
once, you want one wake, not ten races to scale the same Deployment. So each
project has an entry with a three-state machine and a tokio::sync::Notify that
late arrivers wait on:
// per-project state machine
struct ProjectEntry {
state: Mutex<(ProjectState, Option<String>)>, // state + last wake error, one lock
wake_notify: Notify,
active_conns: AtomicUsize,
queued_conns: AtomicUsize,
last_wake_at: Mutex<Option<Instant>>, // idle-sweep grace period
}
ensure_ready reads the state and dispatches:
- Running → verify it's actually running (a stale pod could've been reaped out from under us), then proceed.
- Sleeping → we are the waker:
do_wake. - Waking → someone else is waking it:
wait_for_wake(park on the Notify).
The waker (do_wake) flips the state to Waking, runs the wake body, and on
completion flips to Running (or back to Sleeping with an error) and calls
notify_waiters() to release everyone parked. State transitions live in one
place so every error path — including ? early-returns on missing credentials —
leaves the entry clean. (An earlier version did ? straight through and left
entries stuck in Waking forever; that produced zombie projects nothing could
wake.)
Connection limits are enforced with a compare-and-swap so check-and-increment is atomic under concurrency:
loop {
let current = entry.active_conns.load(Acquire);
if current >= max { return Err("too many connections".into()); }
if entry.active_conns.compare_exchange_weak(current, current+1, AcqRel, Relaxed).is_ok() {
break;
}
}
3. The wake body: scale the Deployment, poll for Postgres
do_wake_body fetches the branch credentials (timeline + password) and calls
into the compute-scaling layer, all under a wake_timeout (15s):
// start_compute_inner — the K8s scaling
let state = check_state(cfg, project_id).await;
if state == ComputeState::Running {
wait_for_ready(cfg, project_id, 30s, /*skip_k8s_check*/ true).await?; // fast path
return Ok(());
}
let spec = build_compute_spec(cfg, tenant_id, timeline_id, password); // compute_ctl spec
update_configmap(&client, &ns, &spec_json).await?; // mount the spec
patch_replicas(&client, &ns, COMPUTE_DEPLOY_NAME, 1).await?; // scale 0 → 1
wait_for_ready(cfg, project_id, 60s, /*skip_k8s_check*/ false).await?;// TCP-poll :55432
wait_for_ready does cold-start the smart way: for a fresh pod it checks K8s pod
readiness first (don't TCP-poll a pod that's still Pending), then TCP-connects
to confirm Postgres is actually accepting connections. For a cgroup-thaw wake
(pod already Running) it skips the K8s round-trip and goes straight to TCP.
Once ready, the upstream socket is opened and the proxy bridges bidirectionally — with an idle timeout, so a connection that goes silent doesn't pin the compute awake forever and block scale-to-zero:
// bridge loop, simplified
tokio::time::timeout(idle_timeout, async {
tokio::select! {
r = client_read.read(&mut buf) => { /* n==0 → closed; else write upstream */ }
r = upstream_read.read(&mut buf2)=> { /* n==0 → closed; else write client */ }
}
}).await // Err(_) → "connection idle timeout", close → lets the idle sweep reclaim it
A background idle sweep does the other half: every minute it scans projects
in Running with zero active connections, and if they've been idle past the
timeout (and aren't pinned — e.g. mid-generation) it scales them back to 0. A
short post-wake grace (last_wake_at) keeps it from instantly killing a compute
that just finished waking.
Single-flight wake has one correctness rule
The waiting half hinges on a subtle tokio::sync::Notify property, and it's
worth being precise about: notify_waiters() only wakes futures that are
already registered when it's called.
So the naive shape — create the Notified future lazily at the .await — has a
gap. Between ensure_ready reading state == Waking and the waiter actually
polling .notified(), the waker can fire notify_waiters(). A not-yet-registered
waiter misses it and then blocks for the whole wake_timeout:
// naive — the Notified future registers at the await point (too late)
let result = tokio::time::timeout(
wake_timeout + 5s,
entry.wake_notify.notified(),
).await;
The correct shape is the documented tokio idiom: bind and enable() the
Notified future before re-checking the condition, so you're a registered
waiter first, then look at the state:
let notified = entry.wake_notify.notified();
tokio::pin!(notified);
notified.as_mut().enable(); // register interest NOW
{
let guard = entry.state.lock().await;
match guard.0 {
ProjectState::Running => return Ok(()), // already done
ProjectState::Sleeping => return Err(/* wake error */), // already failed
ProjectState::Waking => {} // still waking → await below
}
}
let result = tokio::time::timeout(wake_timeout + 5s, notified).await;
Now there's no gap: either the state re-check sees the wake already finished
(return immediately), or we're a registered waiter and notify_waiters() reaches
us. The rule, stated plainly: with Notify, register interest, then check the
condition, then await — checking before registering is a race every time.
Why bother: the economics
Separating storage from compute and putting a wake-proxy in front turns the cost model upside down. A project that nobody's using costs zero compute — the pod is gone, not idling. Launch a hundred experiments; pay for the handful that get traffic. The first query after a quiet period eats a cold start (single-digit seconds); every query after is direct (the proxy is a pass-through once the bridge is up). That tradeoff — a slow first query for a free-while-idle database — is exactly the one you want when an AI agent is spinning up databases by the thousand. (The app pods above these databases get the same treatment by a different mechanism — the cgroup v2 freezer — because for a warm dev server, unlike a stateless compute, the cheapest wake is the one that never has to reboot anything.)
Tradeoffs
- Cold-start latency. The first connection after sleep pays the scale-0→1 + Postgres-boot cost. Parking the connection hides it, but it's real; latency-sensitive paths keep the compute warm (or pin it).
- The wake path is the hot path. Every connection to every project goes
through
ensure_ready, so its concurrency model is load-bearing across every database — which is exactly why the waiter-registration rule above is stated so precisely. - Idle-sweep vs. long-lived idle connections. A connection that's open but silent would pin the compute awake; the per-connection idle timeout in the bridge is what lets the sweep reclaim it, tuned against client keepalive behavior.