Where to from here
You started with an empty cargo new. You finished with inferno: an OpenAI-compatible HTTP server, written end-to-end in Rust, that runs Qwen3 on a paged KV cache, with a decode scheduler that batches concurrent requests and a radix-tree prefix cache. No ML framework, no tch, no candle. Every line of code, from the tensor struct to the scheduler loop, is something you built and understand.
That's the whole point. The shortest description of modern LLM inference engineering is "a tensor library, a model, a KV cache, a scheduler, a paged allocator, and a prefix cache, stitched together with an HTTP server." You now have all six. The rest of the field (vLLM, SGLang, TGI, llama.cpp, mistral.rs, TensorRT-LLM, whatever) is variations on those primitives, plus the hard-won details that come from running them in production on serious hardware.
What to read next
Now that the primitives are concrete, production engine source code becomes readable. Concrete suggestions:
- vLLM's scheduler. The decode scheduler in III.6 is simplified; vLLM's adds admission control, preemption, and priority. The code is Python but legible.
- SGLang's radix cache. A more sophisticated sibling of the prefix cache in III.5.
- llama.cpp. The definitive CPU and mixed-precision reference. Their quantization code is the thing to read after you've taken II.6 seriously.
- Tri Dao's flash attention papers (v1, v2, v3). The attention you built in I.5 materializes the full score matrix; flash attention is the IO-aware algorithm that production engines use instead. A natural next thing to study, and a natural next thing to implement here.
Where to push this codebase next
If you want to keep extending what you built, the highest-leverage next steps:
- Flash attention. The attention in this codebase materializes the
seq × seqscore matrix in memory; flash attention tiles the computation so it never does. The payoff is that long contexts stop being quadratically expensive in memory. - Speculative decoding. A small draft model proposes
ktokens, the big model verifies them in a single forward. Slot it into the scheduler from III.6. 2–3× decode on most workloads, and the data-flow changes are smaller than you'd expect. - A different architecture. The
Modeltrait in this codebase has exactly one implementation: Qwen3. Wiring up a second (Llama, Mistral, Phi, or Mixtral if you're ambitious about mixture-of-experts) is the test of whether the abstractions actually generalize. - More quantization. II.6 does 8-bit weights. The 4-bit layouts (
Q4_Kand friends) are where llama.cpp gets its real density, and a good excuse to learn block-wise quantization properly. - Multi-GPU tensor parallelism. If you have the hardware. A sharded matmul, an all-reduce, a communicator. The engine grows to machines larger than a laptop.
A closing thought
The field moves quickly enough that any specific framework or technique you learn today may be replaced in eighteen months by something better. The primitives won't be. A decade from now, inference engines will still be built from tensors, models, caches, schedulers, and allocators; the details will change, the shape won't. That's the bet this series makes: the point of building one yourself isn't to ship it, but to understand the shape so well that every new technique is a small delta on top of something you already know.
Go build something.