Act 3: Make it serve

Act 1 made one prompt produce coherent text. Act 2 made that text fast: a KV cache so decode doesn't redo work, a SIMD backend, a multithreaded backend, a Metal GPU backend, and Q8_0 weights that quartered the memory footprint. At the end of Act 2 you have a quick command-line tool: feed it a prompt, get a continuation.

A server is a different animal. Real users don't run a CLI; they POST JSON to an HTTP endpoint, expect tokens streamed back as they're generated, and they show up concurrently. Two users, ten users, a chat client opening three tabs. Act 3 is the jump from "fast CLI" to "real OpenAI-compatible server."

That sounds like it should be a single chapter: wrap the existing code in an HTTP handler and ship. For about a chapter and a half, it is. Then you try to serve a second request at the same time as the first, and the Act 2 design starts to creak.

Why concurrency changes everything

In the Act 2 world, memory and compute have exactly one owner. The KV cache is a Vec<Tensor> living on the stack of one generation loop. The decode loop runs at whatever pace one request needs and no faster. There is one request, ever.

Add a second concurrent request and three problems appear at once:

  • KV memory fragments. Request A's KV cache is one contiguous block sized for its sequence; request B allocates another. The Act 2 cache even reallocates and copies the whole history on every decode step. When A finishes and a longer request C arrives, C's cache may not fit in the gap A left behind. Contiguous allocation does not survive churn.
  • Decode is wasteful one sequence at a time. A single-sequence decode step multiplies a matrix by a vector. The hardware (especially a GPU) is built to multiply matrices by matrices. Run one sequence and most of the chip idles; run several together as one batched pass and the same hardware handles all of them for barely more than the cost of one.
  • Shared prefixes get recomputed. Most requests to a chat server share a prefix: a long system prompt, the earlier turns of a conversation. The Act 2 engine re-encodes and re-prefills those tokens on every request, paying the same cost again and again.

Act 3 builds the server, then fixes each of these.

The path

Seven chapters climb from a chat-aware CLI to a multi-tenant server.

Figure: Act III at a glance. Each chapter adds one capability; later chapters depend on earlier ones.

  • III.1: Chat pipeline. A trained chat model expects a very specific prompt format: role markers, special tokens, a slot where the assistant's reply begins. We load that chat template from the GGUF metadata, render a list of messages through it, and wire the tokenizer and decoder into a reusable chat turn. Ships an interactive chat-repl binary.
  • III.2: HTTP API. An OpenAI-compatible HTTP server (chat-server) built on axum and tokio: GET /v1/models, GET /health, and non-streaming POST /v1/chat/completions. We define the request and response JSON exactly as OpenAI's API does, so existing clients just work.
  • III.3: SSE streaming. Add stream: true. Tokens come back as Server-Sent Events: a sequence of chat.completion.chunk deltas terminated by [DONE], so the user sees text appear as it's generated instead of waiting for the whole reply.
  • III.4: Paged KV cache. Replace the contiguous, reallocating KV cache with a paged one: a fixed pool of small blocks, a page table per layer, O(1) append, no fragmentation. Selectable at runtime with --kv basic|paged.
  • III.5: Radix prefix cache. A radix tree keyed on token ids, storing a KV snapshot at each terminal node. A request whose prompt shares a prefix with a cached one skips straight past the shared tokens. LRU eviction, RAII pins so a live request can't have its KV evicted.
  • III.6: Decode scheduler. A background worker thread that owns a fixed set of slots, admits jobs into free slots, and runs the decode loop. The HTTP handlers stop calling the model directly; they submit a job and await a result. This is what lets chat-server serve concurrent conversations on one model.
  • III.7: Batched decode. When two or more slots are decoding, fuse their per-step forward passes into one batched pass: projections and the MLP become single bigger matmuls. A GuideLLM A/B benchmark measures the throughput gain.

What "done" looks like

At the end of Act 3, chat-server --bind 0.0.0.0:8000 model.gguf is a real OpenAI-compatible server. Point any OpenAI client at it (curl, the Python SDK, a chat UI) and have a conversation. It admits multiple concurrent conversations, streams each user's tokens back over SSE, allocates KV memory in fixed-size blocks so nothing fragments, reuses cached prompt prefixes, and batches concurrent decode steps into one forward pass.

Start with III.1: Chat pipeline.