Skip to content
use-cases / agent-grades-agents / hero
CRON · AGENT · SQLITE

An agent that grades yesterday's agents

Your product runs hundreds of agent sessions a day. Each one writes its transcript to a SQLite URL. At 6am, a cron entry POSTs to a supervisor agent with one prompt: read yesterday's transcripts, score them, flag the worst three. By the time you sit down, the report card is already open.

Read the agent docs
use-cases / agent-grades-agents / mechanism

One cron line, one prompt, one verdict

A single 5-field cron entry POSTs to the agent service with a prompt. The supervisor container wakes, reads yesterday's SQLite traces, writes its grades back to the same database, and exits. There is no orchestrator, no rubric service, no eval pipeline.

POST /cron/users/me/entries
POST · scheduler
# POST /api/v1/cron/users/me/entries
{
  "schedule": "0 6 * * *",
  "command": "curl -X POST $AGENT/api/v1/agent/tasks \
     -d @grade.json",
  "comment": "nightly-supervisor"
}
grade.json · supervisor prompt
POST · supervisor
# grade.json — the supervisor's instructions
{
  "description": "Read yesterday's transcripts from /sqlite/sessions WHERE day = '2026-05-03'. Sample 50. Score each on factuality, tool correctness, tone drift. Write findings to the report table. Flag the worst three for human review.",
  "mode": "code"
}

The cron line decides WHEN. The prompt decides WHAT. The supervisor container does the work in ~20 minutes overnight and then disappears. The graded sample is on disk by the time anyone is at their desk.

use-cases / agent-grades-agents / powers

Three things a supervisor agent does that a dashboard can't

AgentOps screens show you logs. LangSmith rubrics give you scores. A graded supervisor closes the loop — it reads the transcripts, decides what is bad, and writes the verdict.

READS

It actually reads the transcripts

Not just metrics. The supervisor opens each session, reads tool calls, checks ground truth, weighs tone. A spreadsheet rubric counts; an agent supervisor judges.

DECIDES

It picks the three you should see

Out of 400 runs, 397 are fine. The supervisor's job is to surface the three that aren't — by name, with a one-line note. You don't scroll a dashboard, you read four lines.

WRITES

It writes findings back to SQLite

Every grade and every note lands in the same SQLite URL the agents use. Tomorrow's supervisor compares. Drift becomes a query, not a vibe.

use-cases / agent-grades-agents / flow

From transcripts to verdict in twenty minutes

Three things happen between 6:00am and 6:21am. None of them require you.

/cron/0 6 * * * → agent/tasks → /grades/2026-05-03RUNS WHILE YOU SLEEP
READ

Open yesterday's transcripts

The supervisor agent queries the same SQLite URL the workers wrote to. SELECT * FROM sessions WHERE day = yesterday. Sample 50 at random.

SCORE

Grade each rubric

Per session: factuality, tool-call correctness, tone drift, hallucination count. Letter grade + one-line reason. Cost: a single agent task.

FLAG

Write findings · flag the bottom three

INSERT into the report table. Mark the worst three for human review. The page at /grades/[date] is just a SELECT on that table.

By 6:21am there is a graded sample on disk and three flagged transcripts queued. The grader doesn't watch the agents — it runs on a cadence and judges them, like a teacher reading homework overnight.

use-cases / agent-grades-agents / capacity

What the cadence buys you

Numbers grounded in the cron + agent + SQLite surfaces. Not invented benchmarks.

  1. ONE CRON LINE0 6 * * *

    Five fields decide when the supervisor wakes. Change the schedule, change the cadence — hourly, daily, on-demand. The line is the entire scheduler.

  2. GRADE WINDOW~20 min

    A supervisor task that samples 50 sessions, reads each, and writes verdicts typically finishes inside 20 minutes. The container exits when the task does.

  3. ORCHESTRATOR DAEMONS0

    No Airflow, no eval service, no DAG scheduler. The cron entry is a row in /etc/crontab. The verdict is a row in SQLite. There is no third thing.

Standard 5-field cron expressions per Hoody Cron API. Supervisor session length depends on sample size and rubric complexity. SQLite is the same hoody-sqlite URL the worker agents already write to — no second store.

use-cases / agent-grades-agents / punchline

The cron job is the supervisor; the supervisor is also an agent.

yesterday · running blindtoday · graded by 6:21
WHAT THE OLD LOOP LOOKED LIKEhuman reads logs · weekly meeting · post-hoc rubric in a sheetnoticed drift after a week · reviewed 0.5% of runs
WHAT IT LOOKS LIKE NOW
use-cases / agent-grades-agents / replaces

What this replaces

The standard agent-quality stack: read-only dashboards, manual log review, and rubric tools that score but never act. The supervisor cron does all three in twenty minutes.

  • human-only agent reviewsAn engineer reading logs by hand · 0.5% sample · catches drift after a week
  • weekly-meeting agent retrospectivesThe drift was already a week old by the time you discussed it
  • manual log inspectiongrep, scroll, hope · no rubric, no score, no record
  • AgentOps quality dashboards (read-only)Charts you have to open · the verdict was never written down
  • LangSmith eval rubrics that don't actScores get computed · no one is paged · no one is told
  • post-hoc spreadsheet rubricsA Google Sheet someone fills out on Friday · stale by Monday
use-cases / agent-grades-agents / cta

Stop reading logs at 11pm. Schedule an agent to do it overnight, and read its report card with your coffee.

Read the agent docs
use-cases / agent-grades-agents / related

Read the others