The databasePart 2 of 6

Scale-to-zero Postgres: a Rust proxy that wakes your database mid-connection

Every app on the platform gets its own Postgres that costs nothing while idle — the compute scales to zero. The trick is waking it transparently on the next connection: a Rust proxy speaks the Postgres wire protocol, parks the client, scales a K8s Deployment 0→1, and forwards the query once it's live. Here's the wire parsing, the single-flight wake, and the tokio Notify registration discipline that keeps the wake correct under concurrency.

Scale-to-zero Postgres: waking a database mid-connection

The pitch is simple: every app you build gets its own real Postgres, and it costs nothing while nobody's using it. In a world where most apps are half-finished experiments or quietly waiting for their first user, you cannot pay for idle database compute — there's too much of it.

"Scale to zero" is easy to say and annoying to do, because the moment you remove the compute you've broken the contract every Postgres client assumes: connect and it answers. So we put a small Rust service in front — neon-conn-proxy — that makes the gap invisible. It speaks the Postgres wire protocol, and when a connection arrives for a sleeping database it parks the client, scales the compute from 0 to 1, and forwards the query once Postgres is live. To the app it's a slightly slow first query. To the bill, it's nothing while the app is quiet.

This works because Neon separates storage from compute: the pageserver holds the data in object storage; compute is a stateless Postgres pod that reads pages on demand. Scaling compute to zero loses nothing — the data is in the pageserver. Each project's compute is a K8s Deployment we scale between 0 and 1 replicas.

scale 0→1

forward (rewritten)

app / agent
conn to proj_abc12

neon-conn-proxy
(Rust, :5432)

K8s API

Deployment neon-compute
(adorable-abc12 ns)

neon-compute pod
stateless Postgres

pageserver (data)

The interesting parts split into three concerns inside the proxy: the TCP/wire path, the per-project state machine and single-flight wake, and the K8s scaling. Let's walk the path of one connection.


1. Parse just enough of the wire protocol to route

A Postgres client's first move is the startup message: a length-prefixed blob of key/value parameters, including database and user. There's no TLS negotiation we need to MITM and no auth yet — we only need one field, the database name, which we've overloaded to carry the project id: proj_{id}.

// handle_connection — the TCP/wire path
let first = postgres::read_startup(&mut client).await?;
match first {
    ClientFirstMessage::Startup(startup) => {
        let db_name = startup.database().unwrap_or("postgres");
        let project_id = match db_name.strip_prefix("proj_") {
            Some(id) => id.to_string(),
            None => { /* send a Postgres ErrorResponse and bail */ }
        };
        queue.ensure_ready(&project_id).await?;   // <-- may wake the compute
        // ... connect upstream, then forward
    }
    ClientFirstMessage::Cancel { .. } => Ok(()),  // cancel reqs carry no routing info
}

Two wire-protocol details matter:

  • Cancel requests (CancelRequest) look like startup messages but carry a backend PID/secret instead of parameters. They have no routing info, so we drop them rather than guess.
  • We rewrite the database name back to postgres before forwarding. The app connects to proj_abc12; the real Postgres only knows postgres. So after the wake we splice a rewritten startup message onto the upstream socket:
let rewritten = postgres::rewrite_database(&startup, &config.neon_db_name); // "postgres"
upstream.write_all(&rewritten).await?;

The proxy never participates in auth — it forwards the startup (rewritten) and then becomes a dumb pipe. The client authenticates directly against the compute.


2. Single-flight wake: a tiny per-project state machine

The wake itself can't be naive. If ten connections hit a sleeping database at once, you want one wake, not ten races to scale the same Deployment. So each project has an entry with a three-state machine and a tokio::sync::Notify that late arrivers wait on:

first connection (do_wake)

compute ready → notify_waiters()

wake failed → notify_waiters(err)

idle sweep (0 conns, timed out)

subsequent connections (verify + forward)

Sleeping

Waking

Running

// per-project state machine
struct ProjectEntry {
    state: Mutex<(ProjectState, Option<String>)>, // state + last wake error, one lock
    wake_notify: Notify,
    active_conns: AtomicUsize,
    queued_conns: AtomicUsize,
    last_wake_at: Mutex<Option<Instant>>,         // idle-sweep grace period
}

ensure_ready reads the state and dispatches:

  • Running → verify it's actually running (a stale pod could've been reaped out from under us), then proceed.
  • Sleepingwe are the waker: do_wake.
  • Waking → someone else is waking it: wait_for_wake (park on the Notify).

The waker (do_wake) flips the state to Waking, runs the wake body, and on completion flips to Running (or back to Sleeping with an error) and calls notify_waiters() to release everyone parked. State transitions live in one place so every error path — including ? early-returns on missing credentials — leaves the entry clean. (An earlier version did ? straight through and left entries stuck in Waking forever; that produced zombie projects nothing could wake.)

Connection limits are enforced with a compare-and-swap so check-and-increment is atomic under concurrency:

loop {
    let current = entry.active_conns.load(Acquire);
    if current >= max { return Err("too many connections".into()); }
    if entry.active_conns.compare_exchange_weak(current, current+1, AcqRel, Relaxed).is_ok() {
        break;
    }
}

3. The wake body: scale the Deployment, poll for Postgres

do_wake_body fetches the branch credentials (timeline + password) and calls into the compute-scaling layer, all under a wake_timeout (15s):

// start_compute_inner — the K8s scaling
let state = check_state(cfg, project_id).await;
if state == ComputeState::Running {
    wait_for_ready(cfg, project_id, 30s, /*skip_k8s_check*/ true).await?; // fast path
    return Ok(());
}
let spec = build_compute_spec(cfg, tenant_id, timeline_id, password); // compute_ctl spec
update_configmap(&client, &ns, &spec_json).await?;                    // mount the spec
patch_replicas(&client, &ns, COMPUTE_DEPLOY_NAME, 1).await?;          // scale 0 → 1
wait_for_ready(cfg, project_id, 60s, /*skip_k8s_check*/ false).await?;// TCP-poll :55432

wait_for_ready does cold-start the smart way: for a fresh pod it checks K8s pod readiness first (don't TCP-poll a pod that's still Pending), then TCP-connects to confirm Postgres is actually accepting connections. For a cgroup-thaw wake (pod already Running) it skips the K8s round-trip and goes straight to TCP.

Once ready, the upstream socket is opened and the proxy bridges bidirectionally — with an idle timeout, so a connection that goes silent doesn't pin the compute awake forever and block scale-to-zero:

// bridge loop, simplified
tokio::time::timeout(idle_timeout, async {
    tokio::select! {
        r = client_read.read(&mut buf)   => { /* n==0 → closed; else write upstream */ }
        r = upstream_read.read(&mut buf2)=> { /* n==0 → closed; else write client   */ }
    }
}).await   // Err(_) → "connection idle timeout", close → lets the idle sweep reclaim it

A background idle sweep does the other half: every minute it scans projects in Running with zero active connections, and if they've been idle past the timeout (and aren't pinned — e.g. mid-generation) it scales them back to 0. A short post-wake grace (last_wake_at) keeps it from instantly killing a compute that just finished waking.


Single-flight wake has one correctness rule

The waiting half hinges on a subtle tokio::sync::Notify property, and it's worth being precise about: notify_waiters() only wakes futures that are already registered when it's called.

So the naive shape — create the Notified future lazily at the .await — has a gap. Between ensure_ready reading state == Waking and the waiter actually polling .notified(), the waker can fire notify_waiters(). A not-yet-registered waiter misses it and then blocks for the whole wake_timeout:

// naive — the Notified future registers at the await point (too late)
let result = tokio::time::timeout(
    wake_timeout + 5s,
    entry.wake_notify.notified(),
).await;
late waiterwaker (do_wake)late waiterwaker (do_wake)naive — registers too lateensure_ready sees state=Wakingcompute ready → notify_waiters()now calls .notified() — misses the signalblocks until wake_timeout

The correct shape is the documented tokio idiom: bind and enable() the Notified future before re-checking the condition, so you're a registered waiter first, then look at the state:

let notified = entry.wake_notify.notified();
tokio::pin!(notified);
notified.as_mut().enable();                 // register interest NOW

{
    let guard = entry.state.lock().await;
    match guard.0 {
        ProjectState::Running  => return Ok(()),                 // already done
        ProjectState::Sleeping => return Err(/* wake error */),  // already failed
        ProjectState::Waking   => {}                              // still waking → await below
    }
}

let result = tokio::time::timeout(wake_timeout + 5s, notified).await;

Now there's no gap: either the state re-check sees the wake already finished (return immediately), or we're a registered waiter and notify_waiters() reaches us. The rule, stated plainly: with Notify, register interest, then check the condition, then await — checking before registering is a race every time.

late waiterwakerlate waiterwakercorrect — register, then checkalt[already Running][still Waking]notified().enable() — registeredre-check statereturn immediatelynotify_waiters() → reaches registered waiter

Why bother: the economics

Separating storage from compute and putting a wake-proxy in front turns the cost model upside down. A project that nobody's using costs zero compute — the pod is gone, not idling. Launch a hundred experiments; pay for the handful that get traffic. The first query after a quiet period eats a cold start (single-digit seconds); every query after is direct (the proxy is a pass-through once the bridge is up). That tradeoff — a slow first query for a free-while-idle database — is exactly the one you want when an AI agent is spinning up databases by the thousand. (The app pods above these databases get the same treatment by a different mechanism — the cgroup v2 freezer — because for a warm dev server, unlike a stateless compute, the cheapest wake is the one that never has to reboot anything.)

Tradeoffs

  • Cold-start latency. The first connection after sleep pays the scale-0→1 + Postgres-boot cost. Parking the connection hides it, but it's real; latency-sensitive paths keep the compute warm (or pin it).
  • The wake path is the hot path. Every connection to every project goes through ensure_ready, so its concurrency model is load-bearing across every database — which is exactly why the waiter-registration rule above is stated so precisely.
  • Idle-sweep vs. long-lived idle connections. A connection that's open but silent would pin the compute awake; the per-connection idle timeout in the bridge is what lets the sweep reclaim it, tuned against client keepalive behavior.

Build on the platform these posts describe.

Describe your app in plain English — Adorable writes the code, sets up the database, and ships it live.

Start building free