use-cases / run-a-local-llm-serve-it-to-your-fleet / hero

PIPE · LOCAL LLM · FAN-OUT

Run a local LLM, serve it to your whole fleet

You're running a 70B model on a single GPU box. Fifty downstream containers across your fleet need the same answer for the same query — they're scoring the same catalog, generating the same embeddings, evaluating the same experiment. Don't pay for fifty inferences. Run the model once, broadcast the tokens.

Read the pipe API

gpu-01.fleet.local · llama.cpp

RUNNINGRTX 409024 GB VRAMllama3-70b · q41× SOURCE

STEP 1 · MODEL# generate once, pipe upwardllama.cpp -m llama3-70b.gguf \ -p "$PROMPT" --stream \ | curl -T - \ /pipe/llm?n=50

STEP 2 · PIPEpipe/llm?n=50ONE PATH · FIFTY READERS

fleet · 50 consumer containers

n=50

app-1

app-2

app-3

app-4

app-5

app-6

app-7

app-8

app-9

app-10

app-11

app-12

app-13

app-14

app-15

app-16

app-17

app-18

app-19

app-20

app-21

app-22

app-23

app-24

app-25

app-26

app-27

app-28

app-29

app-30

app-31

app-32

app-33

app-34

app-35

app-36

app-37

app-38

app-39

app-40

app-41

app-42

app-43

app-44

app-45

app-46

app-47

app-48

app-49

app-50

ALL TASTING THE SAME TOKENSBACKPRESSURE PER CONNECTION

the model runs once · the pipe broadcasts · slow workers slow only themselves

use-cases / run-a-local-llm-serve-it-to-your-fleet / mechanism

One GPU, one pipe, fifty consumers

The naive answer is an HTTP server with a queue, request batching, and lock contention. The cheaper answer for this shape: each query goes onto a pipe path with ?n=50. The model runs once. Fifty consumer containers GET the same path and stream the same tokens at the same time, fanned out by the pipe. A slow worker applies backpressure to its own connection — the others stay at line speed.

fleet-broadcast.sh

# 1× GPU box — run the model once and pipe its tokens upward.
llama.cpp -m llama3-70b.gguf -p "$PROMPT" --stream \
  | curl -T - https://pipe.hoody.com/api/v1/pipe/llm?n=50

# 50 consumer containers — same path, ?n=50, fanned out by the pipe.
for i in $(seq 1 50); do
  curl https://pipe.hoody.com/api/v1/pipe/llm?n=50 \
    | jq -c .delta \
    | ./score.py --worker $i &
done

# Sender blocks until 50 readers have connected, then bytes flow.
# Slow workers backpressure their own connection — others stay at line speed.

PUT sends bytes upward. GET pulls them downward. The ?n=50 parameter says how many readers to wait for; the pipe holds the connection until that many connect, then fans the stream out simultaneously to all of them. No queue, no batching layer, no inference-server-with-load-balancer.

ONE INFERENCE

The model runs exactly once per query

fifty downstream containers want the same answer; you generate it on the GPU once. The pipe handles delivery. No request-batching framework, no token caching layer, no "please don't run it again" coordination.

PIPE FAN-OUT

?n=50 fans the same bytes to fifty readers

the pipe blocks until fifty receivers connect, then streams the producer's bytes to each one in parallel. Identical copies, line-rate delivery, zero server-side storage. Up to 256 receivers per path.

PER-RECEIVER BACKPRESSURE

Slow workers slow only themselves

if one consumer container is GC'ing or its disk is busy, its connection lags. The pipe applies backpressure to that receiver — the other 49 keep streaming at full speed. No head-of-line blocking, no queue depth tuning.

FAN-OUT CAP256Per-path receiver ceiling enforced by the pipe — set ?n to wait for that many before the transfer begins.

INFERENCES PER QUERY1The model runs once per query, not once per consumer. Compute cost is decoupled from fleet size.

SDK FOOTPRINT0 kbProducer is curl. Consumers are curl. Anything that speaks HTTP can subscribe — container, agent, browser, shell.

use-cases / run-a-local-llm-serve-it-to-your-fleet / economics

What you stop paying for

When fifty containers want the same answer, the alternatives charge per call, per token, or per inference server. The pipe charges for one HTTP transfer. Run the model on a box you already rent.

BEFOREHosted API · per-token billing50× tokensBedrock or OpenAI bills you for fifty identical completions when fifty containers ask the same question. Same prompt, same answer, charged fifty times.

AFTERLocal model · pipe broadcast1× tokensThe GPU box you already rent generates once. The pipe carries the bytes to all fifty. The fleet scales horizontally without scaling the inference bill.

this isn't every workload — it's the shape where N containers want the same answer. When that's your shape, the pipe is the cheapest fan-out you'll wire up. Workloads with diverging prompts still want a real inference server; this pattern shines when the question is identical and the fleet is wide.

use-cases / run-a-local-llm-serve-it-to-your-fleet / punchline

One GPU, one pipe, fifty containers tasting the same tokens.

0101 · ONE GPU GENERATES THE TOKENS

0202 · ONE PIPE CARRIES THEM

0303 · FIFTY CONTAINERS TASTE THEM AT ONCE

no inference fan-out servicethe path is the broadcast

use-cases / run-a-local-llm-serve-it-to-your-fleet / replaces

What this replaces

Every "give my fleet access to a model" stack you reach for when one query needs to feed many consumers. Each one charges per call, hosts your weights, or asks you to run a load balancer in front of vLLM. The pipe broadcasts once.

AWS Lambda + BedrockPer-token billing × fleet size, weights you don't own
Modal LabsHosted GPU runners, per-second billing per worker
ReplicatePer-call pricing, network round-trip per consumer
OpenAI API at scaleIdentical prompt billed once per consumer
vLLM/TGI behind a load balancerServer, queue, batching tuning, ops surface to keep alive
Self-hosted model gatewaysRouting, auth, rate limits — all DIY for one fan-out

use-cases / run-a-local-llm-serve-it-to-your-fleet / cta

Stop paying fifty inference bills for one answer. Run the model where you already rent the silicon. Open a pipe. Let the fleet read.

Read the pipe API

use-cases / run-a-local-llm-serve-it-to-your-fleet / related

Read the others

Sixty containers on one server

One bare-metal box runs dozens to hundreds of Hoody containers. KSM and BTRFS dedup make the marginal cost near zero.

Containers·Snapshots

Onboard a developer with one link

A new engineer joins on Monday. You send one URL. They open it on whatever laptop they have and they're in a fresh container cloned from your developer-baseline snapshot — code, deps, env, seed data, VSCode-in-browser. Writing code in five minutes, not setting up.

Snapshots·Containers·Terminal·Files

API endpoints that materialize on demand

A wildcard exec script catches the call, asks an LLM to write the handler, runs it in a V8 sandbox, and saves the route. The next call is native.

Exec·Agent·Code·Files

Branch computers like Git

Snapshot a running container — files, processes, memory. Restore in seconds. Fork via /copy. Branching, but for the entire machine.

Snapshots·Containers

Real VS Code on your phone

The Code Orchestrator spawns a VS Code instance on the container and serves the editor over a normal HTTPS URL. Any device with a browser can open it. The work lives in the container, not on the device.

Display·Terminal·Files·Containers+1

AI agents that spawn other AI agents

A research agent posts to /api/v1/projects/$PID/containers to start a child container, then calls the child's agent URL like any other HTTP service. Sub-agents spawn their own sub-agents the same way. No orchestrator framework, just URLs.

Agent·Exec·Containers

One sandbox per customer, automatically

An exec script catches your signup webhook, copies a fresh-customer container, and hands the new tenant their own URL. Isolation is the operating system, not a tenant_id column.

Containers·Snapshots·Exec·Files

Wake up to a finished prototype

Hand the agent a paragraph at midnight. It spawns its own containers, snapshots before risky steps, and posts to your notification webhook at sunrise.

Agent·Snapshots·Containers·Browser+2

Emergency production fix from your phone

PagerDuty wakes you. Open the terminal URL on your phone. PATCH the snapshot from before the bad deploy. Production is back. No bastion, no VPN, no laptop.

Terminal·Snapshots·Network

Tail production logs to a URL anyone can curl

One pipe URL. Up to 256 readers. Three engineers tail the same incident at once with no bastion, no Datadog seat, no log forwarder.

Pipe

Push one build to thirty CI workers at once

The build container streams the tarball to a pipe path with ?n=30. All thirty test workers curl the same URL. Bytes go through once, fanned out.

Pipe

Watch your agent think from the coffee shop

Your agent runs at home. You're at a café. Pipe each event of the loop through Hoody Pipe and curl the same path from your phone — the trace lands character by character. No SSH, no dashboard, no upload.

Pipe·Agent

Move 200GB between clouds with two curls

pg_dump | gzip | curl from Frankfurt. curl | gunzip | psql in Singapore. Bytes flow through the pipe with zero disk in the middle.

Pipe

Send a teammate a database state in one line

pg_dump streams straight into their psql. No file uploaded, no link shared, no download. The pipe routes the bytes through.

Pipe

Stream LLM tokens to anything that reads HTTP

Step 3 streams tokens with curl -T -. Step 4 curls the same path. Tokens move generator to consumer at line speed. No SSE plumbing, no broker.

Pipe·Agent

A progress bar your boss can spectate without joining

Append ?progress to the pipe URL. Anyone who opens it gets a live HTML dashboard — bytes, speed, ETA, state. Up to fifty spectators, none consuming a receiver slot, none touching the stream.

Pipe

The webhook fan-out you didn't have to build

Stripe POSTs to a pipe path with ?n=12. Twelve subscribers curl the receiver URL with ?n=12. The pipe holds the event until everyone is connected.

Pipe·Exec

A CI cache that's just two curl commands

tar | zstd | curl puts node_modules into a pipe. Twenty downstream jobs curl | zstd -d | tar x. No S3 bucket, no cache action, no egress bill.

Pipe·Containers

Drag-drop uploads into your script

hoody-pipe serves a web upload form at every path. Drag a file onto the page, your script reads the bytes from stdin. Zero upload code, no S3 bucket, no presigned URLs.

Pipe·Exec

Broadcast a workshop to 200 viewers from your laptop

ffmpeg streams your screen to a pipe path with ?n=200. Each attendee curls the URL into a browser tab. No platform, no logins, no upload.

Pipe

Inter-container IPC without the message broker

Container A writes to a pipe path. Container B reads from the same path. Backpressure is the connection. No Redis, no queue, no broker.

Pipe·Containers

Tail your agent on the train, get pinged when it lands

The agent streams its trace to a pipe path you can curl from your phone. When it finishes, its last act hits hoody-notifications and your phone buzzes. Two URLs and a buzz — no SDK, no client app, no dashboard.

Pipe·Agent·Notifications

A microphone over HTTP, in two terminals

ffmpeg captures the mic, pipes to a URL. The other end curls and plays the audio. No Zoom, no SDK, no signaling server.

Pipe

Five agents, five pipes, one verdict

A panel of five models reviews the same input. Each runs in its own container and streams its verdict to its own pipe path. A judge process curls all five in parallel and tallies the result.

Pipe·Agent·Containers

Replay this morning's incident to the whole team

Snapshot the incident-time logs in hoody-files. Replay them through a Hoody Pipe URL with ?n=8. Eight engineers curl the same path and watch the cascade fire in lockstep — the post-mortem is a synchronized stream, not a Confluence doc.

Pipe·Files

The fastest 'send me that file' you've ever typed

A teammate pings for a 4 GB dump. Slack rejects it, Drive needs a share request. You type curl -T file …; they type curl … > file. The bytes move directly between disks — no upload bar, no link to share.

Pipe

A live metrics dashboard with no metrics backend

Each container's monitoring loop curls a metric to a pipe URL. The dashboard curls the same URL with ?progress and renders the SSE stream.

Pipe

The cron job that deletes itself when you're done

POST a managed cron entry with expires_at set 48 hours out. The job runs on schedule, then removes itself — no reminder, no cleanup PR, no stale entry.

Cron

Snapshot the container right before the nightly migration

A hoody-cron entry that fires at 02:55 UTC, curls the snapshots URL, and names the artifact pre-migration-2026-05-04. Five minutes later the migration runs. If it succeeds, the snapshot sits idle and costs nothing. If it fails, you restore in 30 seconds with a single PATCH.

Cron·Snapshots

A separate crontab for every customer, automatically

Each tenant gets their own container and their own hoody-cron service. Customer A's 9am digest fires on time even when customer B's job hangs for 40 minutes, because they aren't on the same crontab.

Cron·Containers

Wake an agent at 3am, retire it at 4

A nightly cron POSTs a spawn request, the agent does its hour of work, then a second cron tears the container down. The agent exists only when there is work for it to do.

Cron·Agent·Containers

Daily rollups without an orchestrator

Raw events pile up in a sqlite URL. Every night a cron entry curls an exec endpoint, the script runs the rollup SQL, writes the daily table back. No DAG, no Airflow Postgres, no scheduler dashboard.

Cron·SQLite·Exec

A crontab per branch, deployed with the code

Your repo checks in `.hoody/crontab`. The deploy script PUTs that file to the new container's Cron API. Each branch gets its container, its filesystem, its schedule.

Cron·Containers

On-call escalation that ages out with the shift

POST a cron entry with expires_at = shift end. When the shift ends, the entry deletes itself. The next on-call posts their own.

Cron·Notifications

Hourly scrape, daily digest, weekly archive — one container

Three lines in one crontab: hourly browser scrape into SQLite, daily exec digest, weekly archive to files. Flat-rate server, three rhythms, no scheduler service.

Cron·Browser·SQLite·Files

Let your customers BYO their own cron schedule

Customers POST their own 5-field expressions; their crontab lives in their container, isolated. You don't validate against a global queue.

Cron·Containers

Schedule the agent, not the script

A 5-field cron entry curls hoody-agent with a prompt instead of running a fixed script. Today is the last day of the month — the agent figures it out. The data shape changed — the agent figures it out.

Cron·Agent

A heartbeat for the silent jobs

Each cron run POSTs a heartbeat to a notifications endpoint. A second cron checks last-heartbeat and pages on silence. Silence is the alert.

Cron·Notifications

Keep the last 24 hours as 24 snapshots

An hourly cron POSTs a snapshot named with the hour. After 24 hours each new snapshot overwrites yesterday's at the same hour. The 24-floor time machine.

Cron·Snapshots

Replay this morning's webhooks at the same time tomorrow

You captured 30 minutes of real Stripe traffic into a hoody-files folder. One cron entry replays it against staging at 9am every weekday — same volume, same payloads, same time-of-day pressure.

Cron·Files·Exec

Edit your crontab from a phone, in the airport

Open the cron URL on your phone in the gate area. Tap a row, change a single field of the cron expression, hit Save. PATCH lands. The job fires tonight on the new schedule. No SSH session, no jump box, no laptop.

Cron·Terminal

A scheduled digest that fans out to 200 inboxes

Cron at 9am POSTs to an exec script that builds the digest and curls a pipe URL with ?n=200. Two hundred recipients hit the same URL once.

Cron·Exec·Pipe

Mute the flaky job without losing it

PATCH /entries/[id] [ enabled: false ]. The job stays in your crontab waiting to be fixed. No deletion, no rewrite, no lost context.

Cron

An agent that grades yesterday's agents

A nightly cron POSTs to the supervisor agent with yesterday's agent traces from SQLite. The supervisor scores each one. Cron is the supervisor.

Cron·Agent·SQLite

Cleanup jobs that schedule their own retirement

The cleanup script checks if there's anything left to clean. When the directory is empty, it DELETEs its own cron entry. Job's done, job's gone.

Cron·Files

Roll your TLS certificates without an SSH session

Cron weekly: POST to an exec script that runs certbot, posts the new cert to the proxy via PATCH. No shell session, no key, no jump host.

Cron·Exec

A weekly canary that tries to break production

Sunday 7am cron wakes a Hoody Agent in a fresh container against a snapshot of prod. It runs the OWASP top twenty, fuzzes the API, and writes a findings report to a URL by 9am. Container retires.

Cron·Agent·Browser·Snapshots

The hobby-project graveyard you can afford to keep alive

Eleven half-finished side projects on Heroku is eleven dynos at $5–7 each. On Hoody it's eleven containers on one $29 bare-metal box. Idle costs zero, the URL wakes the container in milliseconds, and the chess engine nobody uses still runs.

Containers

A preview environment per pull request, all month

Each open PR gets its own clone of a snapshot. The container wakes when reviewers click the link; idle PRs cost nothing.

Containers·Snapshots

Run a 12-product portfolio from one bare-metal box

Twelve isolated containers, each its own SaaS, share one $49 bare-metal server — a step above the $29 entry tier, chosen here for the RAM headroom fifty containers want. Per-product margins go from negative to nice.

Containers

Kill the staging-server tax

Stop paying for a duplicate of production. Snapshot the prod container, branch staging from it on demand, freeze it back to disk when nobody's testing. Three environments, one machine, one bill.

Containers·Snapshots

Forty client sites, one rent, one dashboard

Each client site lives in its own container; you bill them per-site, you pay the host once. The math finally works for agencies.

Containers·Workspaces

Replace the E2B bill with the bare metal you already rent

Your agents stop renting compute by the second from E2B/Modal/Daytona. They use containers on the box you already have.

Containers·Agent·Exec

Idle staging costs nothing, so staging stops getting deleted

Staging used to die because it was expensive to keep around. When idle is free, staging gets to live — even the one a teammate touched 90 days ago.

Containers·Snapshots

Per-customer sandboxes at fleet scale

Eight hundred isolated customers on three bare-metal servers — one flat-rate monthly bill, no per-tenant meter. Each tenant gets a real container with its own kernel namespace, filesystem, and URL. Idle containers cost nothing on top of the server you already pay for.

Containers·Snapshots·Exec

The CI cache that's not an S3 line item

Cache files live in /files on the box you already rent. Workers PUT and GET tarballs over HTTP. No S3 bucket, no egress, no third vendor — the bytes never leave the box.

Files·Containers

Fifty demo environments for fifty sales calls

Each prospect gets a real, isolated copy of your product seeded with their data. Cloned from a snapshot. Theirs to keep for a week.

Containers·Snapshots

Run a local LLM, serve it to your whole fleet

One GPU, one pipe, fifty consumers

The model runs exactly once per query

?n=50 fans the same bytes to fifty readers

Slow workers slow only themselves

What you stop paying for

What this replaces

Read the others

Sixty containers on one server

Onboard a developer with one link

API endpoints that materialize on demand

Branch computers like Git

Real VS Code on your phone

AI agents that spawn other AI agents

One sandbox per customer, automatically

Wake up to a finished prototype

Emergency production fix from your phone

Tail production logs to a URL anyone can curl

Push one build to thirty CI workers at once

Watch your agent think from the coffee shop

Share your screen with a URL, not a meeting invite

Move 200GB between clouds with two curls

Send a teammate a database state in one line

Stream LLM tokens to anything that reads HTTP

A progress bar your boss can spectate without joining

The webhook fan-out you didn't have to build

A CI cache that's just two curl commands

Drag-drop uploads into your script

Broadcast a workshop to 200 viewers from your laptop

Inter-container IPC without the message broker

Tail your agent on the train, get pinged when it lands

A microphone over HTTP, in two terminals

Five agents, five pipes, one verdict

Replay this morning's incident to the whole team

The fastest 'send me that file' you've ever typed

A live metrics dashboard with no metrics backend

The cron job that deletes itself when you're done

Snapshot the container right before the nightly migration

A separate crontab for every customer, automatically

Wake an agent at 3am, retire it at 4

Daily rollups without an orchestrator

A crontab per branch, deployed with the code

On-call escalation that ages out with the shift

Hourly scrape, daily digest, weekly archive — one container

Let your customers BYO their own cron schedule

Schedule the agent, not the script

A heartbeat for the silent jobs

Keep the last 24 hours as 24 snapshots

Replay this morning's webhooks at the same time tomorrow

Edit your crontab from a phone, in the airport

A scheduled digest that fans out to 200 inboxes

Mute the flaky job without losing it

An agent that grades yesterday's agents

Cleanup jobs that schedule their own retirement

Roll your TLS certificates without an SSH session

A weekly canary that tries to break production

The hobby-project graveyard you can afford to keep alive

A preview environment per pull request, all month

Run a 12-product portfolio from one bare-metal box

Kill the staging-server tax

Forty client sites, one rent, one dashboard

Replace the E2B bill with the bare metal you already rent

Idle staging costs nothing, so staging stops getting deleted

Per-customer sandboxes at fleet scale

The CI cache that's not an S3 line item

Fifty demo environments for fifty sales calls