
Sixty containers on one server
One bare-metal box runs dozens to hundreds of Hoody containers. KSM and BTRFS dedup make the marginal cost near zero.
You're running a 70B model on a single GPU box. Fifty downstream containers across your fleet need the same answer for the same query — they're scoring the same catalog, generating the same embeddings, evaluating the same experiment. Don't pay for fifty inferences. Run the model once, broadcast the tokens.
# generate once, pipe upwardllama.cpp -m llama3-70b.gguf \ -p "$PROMPT" --stream \ | curl -T - \ /pipe/llm?n=50pipe/llm?n=50ONE PATH · FIFTY READERSthe model runs once · the pipe broadcasts · slow workers slow only themselves
The naive answer is an HTTP server with a queue, request batching, and lock contention. The cheaper answer for this shape: each query goes onto a pipe path with ?n=50. The model runs once. Fifty consumer containers GET the same path and stream the same tokens at the same time, fanned out by the pipe. A slow worker applies backpressure to its own connection — the others stay at line speed.
# 1× GPU box — run the model once and pipe its tokens upward.
llama.cpp -m llama3-70b.gguf -p "$PROMPT" --stream \
| curl -T - https://pipe.hoody.com/api/v1/pipe/llm?n=50
# 50 consumer containers — same path, ?n=50, fanned out by the pipe.
for i in $(seq 1 50); do
curl https://pipe.hoody.com/api/v1/pipe/llm?n=50 \
| jq -c .delta \
| ./score.py --worker $i &
done
# Sender blocks until 50 readers have connected, then bytes flow.
# Slow workers backpressure their own connection — others stay at line speed.PUT sends bytes upward. GET pulls them downward. The ?n=50 parameter says how many readers to wait for; the pipe holds the connection until that many connect, then fans the stream out simultaneously to all of them. No queue, no batching layer, no inference-server-with-load-balancer.
fifty downstream containers want the same answer; you generate it on the GPU once. The pipe handles delivery. No request-batching framework, no token caching layer, no "please don't run it again" coordination.
the pipe blocks until fifty receivers connect, then streams the producer's bytes to each one in parallel. Identical copies, line-rate delivery, zero server-side storage. Up to 256 receivers per path.
if one consumer container is GC'ing or its disk is busy, its connection lags. The pipe applies backpressure to that receiver — the other 49 keep streaming at full speed. No head-of-line blocking, no queue depth tuning.
When fifty containers want the same answer, the alternatives charge per call, per token, or per inference server. The pipe charges for one HTTP transfer. Run the model on a box you already rent.
this isn't every workload — it's the shape where N containers want the same answer. When that's your shape, the pipe is the cheapest fan-out you'll wire up. Workloads with diverging prompts still want a real inference server; this pattern shines when the question is identical and the fleet is wide.
One GPU, one pipe, fifty containers tasting the same tokens.
Every "give my fleet access to a model" stack you reach for when one query needs to feed many consumers. Each one charges per call, hosts your weights, or asks you to run a load balancer in front of vLLM. The pipe broadcasts once.
Stop paying fifty inference bills for one answer. Run the model where you already rent the silicon. Open a pipe. Let the fleet read.