
Sixty containers on one server
One bare-metal box runs dozens to hundreds of Hoody containers. KSM and BTRFS dedup make the marginal cost near zero.
Your product runs hundreds of agent sessions a day. Each one writes its transcript to a SQLite URL. At 6am, a cron entry POSTs to a supervisor agent with one prompt: read yesterday's transcripts, score them, flag the worst three. By the time you sit down, the report card is already open.
the cron job is the supervisor · the supervisor is also an agent
A single 5-field cron entry POSTs to the agent service with a prompt. The supervisor container wakes, reads yesterday's SQLite traces, writes its grades back to the same database, and exits. There is no orchestrator, no rubric service, no eval pipeline.
# POST /api/v1/cron/users/me/entries { "schedule": "0 6 * * *", "command": "curl -X POST $AGENT/api/v1/agent/tasks \ -d @grade.json", "comment": "nightly-supervisor" }
# grade.json — the supervisor's instructions { "description": "Read yesterday's transcripts from /sqlite/sessions WHERE day = '2026-05-03'. Sample 50. Score each on factuality, tool correctness, tone drift. Write findings to the report table. Flag the worst three for human review.", "mode": "code" }
The cron line decides WHEN. The prompt decides WHAT. The supervisor container does the work in ~20 minutes overnight and then disappears. The graded sample is on disk by the time anyone is at their desk.
AgentOps screens show you logs. LangSmith rubrics give you scores. A graded supervisor closes the loop — it reads the transcripts, decides what is bad, and writes the verdict.
Not just metrics. The supervisor opens each session, reads tool calls, checks ground truth, weighs tone. A spreadsheet rubric counts; an agent supervisor judges.
Out of 400 runs, 397 are fine. The supervisor's job is to surface the three that aren't — by name, with a one-line note. You don't scroll a dashboard, you read four lines.
Every grade and every note lands in the same SQLite URL the agents use. Tomorrow's supervisor compares. Drift becomes a query, not a vibe.
Three things happen between 6:00am and 6:21am. None of them require you.
The supervisor agent queries the same SQLite URL the workers wrote to. SELECT * FROM sessions WHERE day = yesterday. Sample 50 at random.
Per session: factuality, tool-call correctness, tone drift, hallucination count. Letter grade + one-line reason. Cost: a single agent task.
INSERT into the report table. Mark the worst three for human review. The page at /grades/[date] is just a SELECT on that table.
By 6:21am there is a graded sample on disk and three flagged transcripts queued. The grader doesn't watch the agents — it runs on a cadence and judges them, like a teacher reading homework overnight.
Numbers grounded in the cron + agent + SQLite surfaces. Not invented benchmarks.
Five fields decide when the supervisor wakes. Change the schedule, change the cadence — hourly, daily, on-demand. The line is the entire scheduler.
A supervisor task that samples 50 sessions, reads each, and writes verdicts typically finishes inside 20 minutes. The container exits when the task does.
No Airflow, no eval service, no DAG scheduler. The cron entry is a row in /etc/crontab. The verdict is a row in SQLite. There is no third thing.
Standard 5-field cron expressions per Hoody Cron API. Supervisor session length depends on sample size and rubric complexity. SQLite is the same hoody-sqlite URL the worker agents already write to — no second store.
The cron job is the supervisor; the supervisor is also an agent.
The standard agent-quality stack: read-only dashboards, manual log review, and rubric tools that score but never act. The supervisor cron does all three in twenty minutes.
Stop reading logs at 11pm. Schedule an agent to do it overnight, and read its report card with your coffee.