Act 3: Recap

chat-server is a real server now. It speaks OpenAI's HTTP and SSE protocols, so any OpenAI client connects to it unmodified. It admits concurrent conversations through a background scheduler, stores KV memory in fixed-size paged blocks that never fragment, reuses cached prompt prefixes across requests so a shared system prompt is prefilled once, and fuses concurrent decode steps into one batched forward pass.

What you shipped in Act 3

  • A chat pipeline (III.1) that loads the model's trained prompt format (its Jinja chat template) from GGUF metadata, renders a message list through minijinja, and drives one chat turn end to end, streaming text deltas through a callback. Smoke-tested with the interactive chat-repl binary.
  • An OpenAI-compatible HTTP API (III.2) built on axum and tokio: GET /health, GET /v1/models, and non-streaming POST /v1/chat/completions, with request and response JSON shaped exactly like OpenAI's. The chat-server binary.
  • SSE streaming (III.3): stream: true returns a live sequence of chat.completion.chunk events terminated by [DONE], so a client watches the reply appear token by token.
  • A paged KV cache (III.4) replacing the contiguous, reallocating cache from II.2: a pool of fixed-size blocks, a per-layer page table, O(1) append, no fragmentation. Selectable at runtime with --kv basic|paged.
  • A radix prefix cache (III.5): a tree keyed on token ids storing KV snapshots at terminal nodes, with LRU eviction and RAII pins so a live request's snapshot can't be evicted. Shared prompt prefixes are prefilled once and reused.
  • A decode scheduler (III.6): a background worker thread owning the model, tokenizer, backend, and prefix cache, holding a fixed set of slots, admitting jobs and interleaving their decode steps. The HTTP handlers submit jobs and await results over channels.
  • Batched decode (III.7): when two or more slots decode together, their forward passes fuse into one: projections and the MLP become single matmuls with a real batch dimension, attention stays a per-slot loop. A GuideLLM A/B benchmark measures the throughput gain.

Together these turn the engine from a fast single-user CLI into a multi-tenant server you can point a real chat client at.

The whole journey

Three acts, one consistent arc:

  • Act 1 made it work. From a GGUF file on disk to a model-generate CLI: a binary parser, a BPE tokenizer, a Tensor type, a Backend trait with a scalar CPU implementation, the 28-layer Qwen3 forward pass, and the greedy autoregressive loop. Slow (seconds per token) but every number understood.
  • Act 2 made it fast. A benchmark harness to measure honestly, a KV cache to kill quadratic decode, then three faster backends behind the same Backend trait (SIMD, multithreaded, Metal GPU) and Q8_0 quantized weights to quarter the memory footprint. The same model, an order of magnitude quicker.
  • Act 3 made it serve. Chat templates, an OpenAI HTTP and SSE API, a paged KV cache, a radix prefix cache, a decode scheduler, and batched decode. The same fast engine, now answering many users at once.

Every act reacted to the one before it. Act 2's KV cache only mattered because you had watched Act 1's forward pass crawl. Act 3's paged cache only mattered because you had built Act 2's contiguous one and seen exactly how it would fragment under load. Nothing was abstract.

What's still missing

This is a small-but-real inference engine, not a clone of vLLM or SGLang. Several things production engines do, this one does not:

  • Quantized matmul. II.6 gave us Q8_0 weights on disk and dequantization, but a production engine fuses dequant into the matmul and supports Q4 and FP8. For anything larger than 0.6B, that is table stakes.
  • Speculative decoding. A small draft model proposes several tokens, the target model verifies them in one pass, often 2-3× the decode throughput. Not built.
  • Chunked prefill. A single very long prompt currently monopolizes a slot during its prefill. Splitting prefill into chunks so it interleaves with other requests' decode steps is the next scheduler refinement.
  • Multi-GPU. The engine runs on one device. A 70B-class model needs tensor parallelism across several.
  • Structured output and tool calls. Constrained decoding (JSON mode, grammars) and function-calling APIs: all unbuilt.
  • Production robustness. Graceful shutdown under load, backpressure, admission control, per-tenant quotas, multi-model hosting. Everything a real API needs and a tutorial codebase skips.

What you learned

More than the specific code: you learned the shape of the problem. Every production inference engine in 2026 is a variation on the primitives you built here: a tensor type, a backend abstraction, a KV cache, a paged allocator, a prefix cache, a scheduler, a batched forward pass. Open the vLLM, SGLang, or TGI source and it is no longer foreign: it's a matter of mapping their names onto yours.

You started with a bag of numbers in a binary file. You finished with a server that streams tokens to concurrent users over an OpenAI-compatible API, and you wrote every line between.

Continue to Where to from here.