
Sixty containers on one server
One bare-metal box runs dozens to hundreds of Hoody containers. KSM and BTRFS dedup make the marginal cost near zero.
PagerDuty wakes you. You don't get up. Open the bookmark for the production terminal. PATCH the snapshot from before the bad deploy. Production is back. No bastion, no VPN, no laptop.
On-call is a triage job, not a debugging job. The terminal URL gets you in. The snapshot PATCH gets you out. The morning is for the actual fix.
Alert arrives. Phone screen on, bed light off.
Open terminal-1 URL. tail the log. Spot the env-var change from the 11pm deploy.
PATCH /containers/[id]/snapshots/pre-deploy-2255. The container reverts.
Error rate falls back to the baseline. Channel update sent. Lights off.
Edit-on-phone is hell, so the lazy fix is the right fix. Restore the container to the snapshot you took before the bad deploy. The 11am post-mortem can decide what to actually change.
The same window, embedded in your phone browser. Baseline, deploy, spike, restore, flat. Twenty-eight seconds for the snapshot to come back.
At 03:47 you don't fix bugs. You fix availability.
The on-call rotation isn't a debugging session. It's a triage session. Snapshots make triage instantaneous so the actual debugging happens at 11am, by humans who slept.
Most on-call rituals are scar tissue from infrastructure that wasn't browsable on a phone. The HTTPS URL plus a snapshot PATCH replaces a stack of them.
You opened a URL on your phone and fixed production.