The runtimePart 2 of 2

Pausing a thousand idle apps: KEDA, CRIU, and the cgroup v2 freezer

Every app on the platform must cost nothing while nobody's looking at it — and come back instantly when someone does. We chose containers over microVMs, then tried three ways to make idle free: scale-to-zero with KEDA, checkpoint/restore with CRIU, and finally the cgroup v2 freezer with swap reclaim. The first two taught us, in very different ways, what the constraint actually was. This is the chronicle of the experiments, the failure modes we hit, and the design that survived.

Pausing a thousand idle apps: KEDA, CRIU, and the cgroup v2 freezer

An AI app builder has a brutal idle profile. Users generate apps the way they open browser tabs: enthusiastically, in bulk, and then they walk away. At any given moment the overwhelming majority of apps on the platform are doing nothing — no requests, no users, no reason to exist as running processes. But each one is a Vite dev server plus a Node process tree holding a few hundred megabytes of RAM, and the owner expects it to be right there when they come back.

So the requirement is two-sided, and the sides fight each other:

  1. An idle app must cost ~zero. CPU and, critically, RAM — memory is the binding resource on a node packed with hundreds of namespaces.
  2. Wake must feel instant. The user clicked their preview link. A spinner that lasts 30 seconds reads as "my app is broken," not "my app was asleep."

We went through one foundational decision and three runtime architectures to satisfy both. The decision — containers, not VMs — set the option space; the three architectures were real implementations that ran on the platform, each failure narrowing the space until only one shape was left. The progression is worth recounting because the lesson generalizes: the cheapest way to make something stop costing resources is to not stop it at all.

VMs or containers: the first fork in the road

Every platform that runs other people's workloads faces this question exactly once, and the answer shapes everything downstream. The fashionable answer in 2026 is the microVM — the Firecracker / Fly-machines shape: every app gets a real virtual machine with its own kernel, booted on demand, destroyed on idle. It's a respectable architecture, and it's worth being precise about what it buys, because we deliberately walked away from it.

What VMs genuinely offer: a hardware-enforced isolation boundary. A guest kernel panic, a container-escape exploit, a kernel-resource exhaustion attack — all of it stops at the hypervisor. If your tenants upload arbitrary hostile binaries, that boundary is close to non-negotiable.

What VMs cost, for this workload specifically:

  • Density is the business model, and RAM doesn't share across kernels. A thousand mostly-idle apps on a node only works if their memory is one kernel's problem. Each microVM carries its own guest kernel, its own page cache, its own copy of every shared library mapping — memory that one shared kernel deduplicates across containers for free. When the binding resource is RAM and the fleet is 95% idle, paying a per-app kernel tax is paying the tax precisely where it hurts.
  • The microVM solves the wrong cold start. Firecracker's 125ms boot is a marvel, and for our workload it's a rounding error. A generated app's wake time is dominated by the guest warm-up: Node startup, Vite's dependency scan, the module graph, the JIT. VM-per-app pays for hypervisor orchestration and still eats the 10–30 seconds that actually matter. The cold-start problem we needed to solve lives above the kernel, where a hypervisor can't reach it — but, as we'll see, the kernel can.
  • The agent writes to a live filesystem. This one is unique to an AI app builder, and it was decisive. Generation and preview share one read-write volume (/data/projects): the agent writes a file, the running pod's Vite picks it up via HMR, the user sees the change — no rebuild, no image, no deploy. An ephemeral-VM model means baking images or syncing filesystems on every iteration, which turns the platform's fastest, most-executed loop — agent writes code, user sees it — into its slowest.
  • Kubernetes is a lifecycle machine we didn't have to build. A namespace per app gives us quotas, limit ranges, secrets injection, ingress routing, RBAC, and a watch stream for real-time status — the entire day-2 operations surface — as configuration rather than code. The VM world has orchestrators too, but none with this depth of ecosystem for free.

And the argument that only became visible later, the one this whole post builds to: a container is a cgroup, and a cgroup is something the kernel can operate on. Freeze it, weigh it, swap it out, thaw it — uniformly, across every app on the node, with file writes. A VM is opaque to the host kernel by design; the isolation boundary that protects the workload also seals it off from exactly the kind of fleet-wide, kernel-mediated resource surgery that makes thousands-of-idle-apps economics work. VM suspend/resume exists, but managing a thousand suspended memory images is a storage subsystem you build and operate; managing a thousand frozen cgroups is the kernel's existing swap machinery doing its job.

The tradeoff we accepted is real and we hold it consciously: a shared kernel is a weaker isolation boundary than a hypervisor. It's bounded for this platform — every app runs in its own namespace under ResourceQuota and LimitRange, nothing runs privileged, and the workloads are agent-generated application code operating under the platform's own contracts, not arbitrary uploaded binaries. And it's revisable without rearchitecting: Kubernetes' RuntimeClass makes hardened sandboxes (gVisor-style) a per-workload swap, not a platform rewrite. Containers it was. The question became — what does "idle" mean for a container?

Experiment 1: scale to zero (KEDA)

The first answer was the orthodox one. Idle means zero replicas. We deployed KEDA with its HTTP add-on: an HTTPScaledObject per project, the interceptor in the request path, scale to zero after a cooldown, scale back to one on the next HTTP request.

apiVersion: http.keda.sh/v1alpha1
kind: HTTPScaledObject
spec:
  hosts:
    - {projectId}.{previewDomain}
  scaleTargetRef:
    name: app
    kind: Deployment
    service: web
    port: 80
  replicas:
    min: 0
    max: 1
  scaledownPeriod: 300

It worked, in the sense that idle apps consumed nothing. And it taught us the constraint the hard way: scale-to-zero conflates "stop costing" with "stop existing." When the replica count hits zero the pod is gone, so wake is a full cold start — schedule the pod, pull-check the image, mount the volumes, start Node, let Vite walk the dependency graph. For our workload that was 10–30 seconds per wake. Every single time. The database equivalent of this problem got its own Rust proxy and a mid-connection wake dance; for the app pod there is no wire protocol to park — the user is staring at a browser tab.

The secondary frictions were instructive too:

  • The autoscaler fights manual lifecycle. When a user explicitly clicked Resume, we'd scale the Deployment 0→1 — and KEDA, observing no traffic yet (the user hadn't loaded the preview, because it wasn't up), would immediately scale it back to 0. The fix was choreography: delete the HTTPScaledObject before scaling up, wait out a grace period while the interceptor registers traffic, then re-create it. Any design where you must temporarily disable the system so the system doesn't undo you is a design filing a complaint about itself.
  • The interceptor sits in the data path. Every request to every preview — including the hot, awake ones — now flowed through an extra proxy hop that existed only for the rare cold case.
  • Wake state leaked into the frontend. Keepalive pings are disabled for a sleeping project (that's the point), so the UI had to special-case polling for the health transition after a resume, and refresh stale preview tokens minted against the previous incarnation of the pod.

The verdict wasn't "KEDA is bad" — it's good at what it's for. The verdict was that pod deletion is the wrong primitive for a workload whose warm state (a JIT-warmed Node process, Vite's module graph) is expensive to rebuild and free to keep. We needed the process to survive its own hibernation.

Experiment 2: checkpoint/restore (CRIU)

If the process must survive, the next idea is obvious to anyone who has read about live migration: CRIU — Checkpoint/Restore In Userspace. Snapshot the entire process tree to disk (memory pages, file descriptors, TCP connections, the works), kill the container, and later restore it exactly where it was. Idle cost: literally zero, not even swap. Wake: restore from a local image, single-digit seconds. On paper, the perfect answer.

We built it for real: a privileged Rust DaemonSet (criu-agent) on every node, shelling out to runc against k3s's containerd state — nsenter'd into the host's namespaces, because runc needs the host's cgroup hierarchy and the real /run:

// criu-agent: checkpoint a project container via the host's runc
let (stdout, stderr, ok) = run_host_cmd(
    "runc",
    &[
        "--root", "/run/k3s/containerd/runc/k8s.io",
        "checkpoint",
        "--image-path", &image_dir,
        "--tcp-established",   // snapshot live TCP, kubelet's API can't
        "--leave-running",
        "--ext-unix-sk",
        "--file-locks",
        &container_id,
    ],
    timeout,
).await;

(We went under the kubelet's own checkpoint API deliberately — it supports checkpoint but not the flags we needed, and it has no restore at all. That asymmetry should have been the first hint.)

Checkpointing worked. Images landed on disk, sizes were reasonable, the flags handled the dev server's open sockets and file locks.

Restore is where the architecture collapsed — not on a bug, but on a structural mismatch. runc restore resurrects the process tree, but it resurrects it underneath Kubernetes' model of the world, without telling Kubernetes. The kubelet believes the container it manages is the one its containerd shim is supervising; a restored process is a changeling the shim has no relationship with. Probes, lifecycle hooks, kubectl exec, resource accounting, eviction — every control-plane feature assumes the kubelet owns the process lifecycle, and after a restore it simply doesn't. You haven't restored a container; you've restored a process wearing the container's clothes, and the platform's entire management layer now disagrees with reality. Upstream Kubernetes' own checkpoint support stops at "write a forensic image" for exactly this reason — restore-in-place is the unsolved half.

We could have fought it — there are people heroically wiring CRIU restore into container runtimes — but a platform that hosts thousands of agent-generated, arbitrarily weird apps cannot sit on a foundation where pause/resume, the single most-executed lifecycle operation, runs against the grain of the orchestrator. CRIU lost not because it couldn't freeze a process, but because Kubernetes couldn't be told about it.

One artifact of the experiment survived, though, and it mattered: the per-node privileged Rust DaemonSet with a tiny HTTP API turned out to be exactly the right shape for node-local lifecycle surgery. We kept the chassis and replaced the engine.

The design that survived: cgroup v2 freezer + swap reclaim

The synthesis question, after two eliminations, writes itself. KEDA failed because the pod stopped existing. CRIU failed because Kubernetes didn't know the process had stopped existing. So: what if nothing stops existing?

The kernel has had the answer all along — and this is where the containers-over-VMs decision pays its dividend. Every container is a cgroup, and cgroup v2 gives you two files:

  • cgroup.freeze — write 1 and the kernel freezes every task in the group. Not SIGSTOP (which apps can observe and which reparents process groups oddly) — a freezer-subsystem deep-freeze. Zero CPU, unconditionally, for the entire tree.
  • memory.reclaim — ask the kernel to push a cgroup's pages out to swap, proactively, without killing or even waking anything.

Put together, "pause" becomes four file writes — no pod deletion, no image, no restore, no interceptor:

// freeze-agent: the entire pause operation
fs::write(cgroup.join("memory.swap.max"), b"max")?;   // 1. allow swap
let ram_before = read_memory_current(&cgroup)?;        // 2. for the books
fs::write(cgroup.join("cgroup.freeze"), b"1")?;        // 3. stop the world
fs::write(pod_cgroup.join("memory.reclaim"),           // 4. RAM → swap
          b"999999999999")?;                            //    (best-effort)

Frozen, the app's footprint is zero CPU and a few megabytes of unevictable kernel memory; the working set sits in swap, which is disk, which is the cheapest thing on the node. Thaw is the inverse and it is near-instant: write 0 to cgroup.freeze, restore memory.swap.max, and the process resumes mid-instruction. Pages fault back in from swap lazily, on the access pattern of the actual user — the Vite module graph, the JIT state, every open socket and timer: all still there, because the process never died. The 10–30 second rebuild that scale-to-zero charged on every wake became a page-in cost amortized over the first few interactions.

And — the part CRIU couldn't give us — Kubernetes never notices. The pod is Running the whole time. The kubelet's shim still owns the process. No control-plane feature breaks, because from the orchestrator's point of view nothing happened. We froze the app underneath Kubernetes' model instead of against it. (Liveness probes would notice, of course — generated app pods simply don't carry them; the platform's own idle sweep and health layer decide what "alive" means for a frozen app.)

The rest of the system fell out in a few weeks of integration:

POST /freeze

cgroup.freeze=1
memory.reclaim

default backend

POST /thaw

label adorable/frozen removed

idle sweep
(no keepalive N min)

freeze-agent
(DaemonSet, Rust)

container cgroup

user hits preview

nginx ingress
custom-http-errors

backend wake handler
styled 'waking…' page

project namespace

  • Idle detection stayed boring on purpose: the frontend sends a keepalive ping while anyone is viewing a project; the sweep freezes whatever hasn't pinged in the timeout. No traffic interception anywhere.
  • Wake-on-HTTP runs through nginx ingress's error path rather than an inline proxy: a request to a frozen app hits custom-http-errors → default backend → a wake handler that thaws the namespace and serves a styled "waking up" page that reloads into the live app. The hot path has zero added hops; only the cold path pays.
  • The database joins the same rhythm. A frozen app's Postgres compute scales to zero independently and wakes mid-connection. The freeze state is published as a namespace label (adorable/frozen=true) so the DB proxy can read it from the K8s API directly — no shared Redis state between the two systems, and the proxy's wake budget accounts for "the whole app is thawing, not just my compute."
  • KEDA was switched off. The freezer covered everything it did — idle detection, zero idle cost, wake on request — with a faster wake and two fewer moving parts in the request path.

What the freezer demanded in return

No primitive is free, and honesty about the third design is what makes the first two eliminations meaningful. The freezer's costs are real — they're just the right shape of cost: correctness obligations in our control plane, not seconds on the user's clock.

Freeze/thaw is a multi-system transaction, and we had to treat it like one. A freeze touches the cgroup, the ingress annotations, Redis state, the namespace label, and the DB proxy's view of the world. Partially applied, the combinations are nasty — the early implementation could leave pods frozen while the ingress still routed traffic at them if a later step failed, so freeze gained a rollback chain (unwind the cgroup write if the routing flip fails). The agent itself learned the same lesson in miniature: a second freeze request arriving for an already-frozen container must be a no-op that returns the original metadata — the naïve version overwrote its memory of the container's original memory.swap.max with the max it had itself written, so the eventual thaw "restored" swap to permanently-on. Idempotency isn't a nicety in lifecycle code; it's the difference between state and corruption.

State that lives in two places must be cross-verified. "Is this project frozen?" is answered by Redis (fast) and by the freeze-agent's view of the actual cgroup (true). After a node or backend restart these can disagree, so the platform treats the cgroup as ground truth and heals the cache against it rather than trusting either side alone.

And the operations needed one owner. Freeze, thaw, pause (user-initiated stop), unpause, start, rebuild — six subsystems each growing their own calls into this machinery was how the early bugs got in. All of it now routes through a single ProjectLifecycle layer that owns transition ordering, holds the per-project lock, and is the only caller allowed to touch the low-level operations. The freezer is four file writes; the system around the freezer is an exercise in making four file writes transactional across a distributed platform.

The ladder, and the economics

Where it all landed is a lifecycle ladder, each rung cheaper and slower to leave than the last:

State CPU RAM Wake cost Trigger
Running real real someone's watching
Frozen 0 ~0 (swapped) page-ins, sub-second resume idle timeout
Stopped 0 0 full cold start user pressed stop
DB asleep 0 0 single-digit s, mid-connection independent idle sweep

The default resting state of every app on the platform is frozen, database asleep: effectively free to keep forever, fast enough to wake that "keep it running, just in case" is the default instead of a monthly bill. The inverse — an app whose users won't tolerate even a page-in warm-up — is a deliberate, priced choice (keep-warm), not an accident of architecture.

One fork in the road and three experiments, one conclusion — and it's the one we keep re-learning at every layer of this platform: don't destroy warm state and rebuild it — make it cost nothing and keep it. The database does it with copy-on-write branches and a proxy that wakes compute mid-connection. The apps do it with a kernel freezer and a swap file. The expensive thing was never the pause. It was ever having to come back from it.

Build on the platform these posts describe.

Describe your app in plain English — Adorable writes the code, sets up the database, and ships it live.

Start building free