Skip to content
use-cases / run-a-local-llm-serve-it-to-your-fleet / hero
PIPE · LOCAL LLM · FAN-OUT

Run a local LLM, serve it to your whole fleet

You're running a 70B model on a single GPU box. Fifty downstream containers across your fleet need the same answer for the same query — they're scoring the same catalog, generating the same embeddings, evaluating the same experiment. Don't pay for fifty inferences. Run the model once, broadcast the tokens.

Read the pipe API
use-cases / run-a-local-llm-serve-it-to-your-fleet / mechanism

One GPU, one pipe, fifty consumers

The naive answer is an HTTP server with a queue, request batching, and lock contention. The cheaper answer for this shape: each query goes onto a pipe path with ?n=50. The model runs once. Fifty consumer containers GET the same path and stream the same tokens at the same time, fanned out by the pipe. A slow worker applies backpressure to its own connection — the others stay at line speed.

fleet-broadcast.sh
# 1× GPU box — run the model once and pipe its tokens upward.
llama.cpp -m llama3-70b.gguf -p "$PROMPT" --stream \
  | curl -T - https://pipe.hoody.com/api/v1/pipe/llm?n=50

# 50 consumer containers — same path, ?n=50, fanned out by the pipe.
for i in $(seq 1 50); do
  curl https://pipe.hoody.com/api/v1/pipe/llm?n=50 \
    | jq -c .delta \
    | ./score.py --worker $i &
done

# Sender blocks until 50 readers have connected, then bytes flow.
# Slow workers backpressure their own connection — others stay at line speed.

PUT sends bytes upward. GET pulls them downward. The ?n=50 parameter says how many readers to wait for; the pipe holds the connection until that many connect, then fans the stream out simultaneously to all of them. No queue, no batching layer, no inference-server-with-load-balancer.

ONE INFERENCE

The model runs exactly once per query

fifty downstream containers want the same answer; you generate it on the GPU once. The pipe handles delivery. No request-batching framework, no token caching layer, no "please don't run it again" coordination.

PIPE FAN-OUT

?n=50 fans the same bytes to fifty readers

the pipe blocks until fifty receivers connect, then streams the producer's bytes to each one in parallel. Identical copies, line-rate delivery, zero server-side storage. Up to 256 receivers per path.

PER-RECEIVER BACKPRESSURE

Slow workers slow only themselves

if one consumer container is GC'ing or its disk is busy, its connection lags. The pipe applies backpressure to that receiver — the other 49 keep streaming at full speed. No head-of-line blocking, no queue depth tuning.

FAN-OUT CAP256Per-path receiver ceiling enforced by the pipe — set ?n to wait for that many before the transfer begins.
INFERENCES PER QUERY1The model runs once per query, not once per consumer. Compute cost is decoupled from fleet size.
SDK FOOTPRINT0 kbProducer is curl. Consumers are curl. Anything that speaks HTTP can subscribe — container, agent, browser, shell.
use-cases / run-a-local-llm-serve-it-to-your-fleet / economics

What you stop paying for

When fifty containers want the same answer, the alternatives charge per call, per token, or per inference server. The pipe charges for one HTTP transfer. Run the model on a box you already rent.

BEFOREHosted API · per-token billing50× tokensBedrock or OpenAI bills you for fifty identical completions when fifty containers ask the same question. Same prompt, same answer, charged fifty times.
AFTERLocal model · pipe broadcast1× tokensThe GPU box you already rent generates once. The pipe carries the bytes to all fifty. The fleet scales horizontally without scaling the inference bill.

this isn't every workload — it's the shape where N containers want the same answer. When that's your shape, the pipe is the cheapest fan-out you'll wire up. Workloads with diverging prompts still want a real inference server; this pattern shines when the question is identical and the fleet is wide.

use-cases / run-a-local-llm-serve-it-to-your-fleet / punchline

One GPU, one pipe, fifty containers tasting the same tokens.

0101 · ONE GPU GENERATES THE TOKENS
0202 · ONE PIPE CARRIES THEM
0303 · FIFTY CONTAINERS TASTE THEM AT ONCE
no inference fan-out servicethe path is the broadcast
use-cases / run-a-local-llm-serve-it-to-your-fleet / replaces

What this replaces

Every "give my fleet access to a model" stack you reach for when one query needs to feed many consumers. Each one charges per call, hosts your weights, or asks you to run a load balancer in front of vLLM. The pipe broadcasts once.

  • AWS Lambda + BedrockPer-token billing × fleet size, weights you don't own
  • Modal LabsHosted GPU runners, per-second billing per worker
  • ReplicatePer-call pricing, network round-trip per consumer
  • OpenAI API at scaleIdentical prompt billed once per consumer
  • vLLM/TGI behind a load balancerServer, queue, batching tuning, ops surface to keep alive
  • Self-hosted model gatewaysRouting, auth, rate limits — all DIY for one fan-out
use-cases / run-a-local-llm-serve-it-to-your-fleet / cta

Stop paying fifty inference bills for one answer. Run the model where you already rent the silicon. Open a pipe. Let the fleet read.

Read the pipe API
use-cases / run-a-local-llm-serve-it-to-your-fleet / related

Read the others