Where to from here

You started with an empty cargo new. You finished with inferno: an OpenAI-compatible HTTP server, written end-to-end in Rust, that runs Qwen3 on a paged KV cache, with a decode scheduler that batches concurrent requests and a radix-tree prefix cache. No ML framework, no tch, no candle. Every line of code, from the tensor struct to the scheduler loop, is something you built and understand.

That's the whole point. The shortest description of modern LLM inference engineering is "a tensor library, a model, a KV cache, a scheduler, a paged allocator, and a prefix cache, stitched together with an HTTP server." You now have all six. The rest of the field (vLLM, SGLang, TGI, llama.cpp, mistral.rs, TensorRT-LLM, whatever) is variations on those primitives, plus the hard-won details that come from running them in production on serious hardware.

Where to push this codebase next

If you want to keep extending what you built, the highest-leverage next steps:

Flash attention. The attention in this codebase materializes the seq × seq score matrix in memory; flash attention tiles the computation so it never does. The payoff is that long contexts stop being quadratically expensive in memory.
Speculative decoding. A small draft model proposes k tokens, the big model verifies them in a single forward. Slot it into the scheduler from III.6. 2–3× decode on most workloads, and the data-flow changes are smaller than you'd expect.
A different architecture. The Model trait in this codebase has exactly one implementation: Qwen3. Wiring up a second (Llama, Mistral, Phi, or Mixtral if you're ambitious about mixture-of-experts) is the test of whether the abstractions actually generalize.
More quantization. II.6 does 8-bit weights. The 4-bit layouts (Q4_K and friends) are where llama.cpp gets its density, and a good excuse to learn block-wise quantization properly.
Multi-GPU tensor parallelism. If you have the hardware. A sharded matmul, an all-reduce, a communicator. The engine grows to machines larger than a laptop.

A closing thought

The field moves quickly enough that any specific framework or technique you learn today may be replaced in eighteen months by something better. The primitives won't be. A decade from now, inference engines will still be built from tensors, models, caches, schedulers, and allocators; the details will change, the shape won't. That's the bet this series makes: the point of building one yourself isn't to ship it, but to understand the shape so well that every new technique is a small delta on top of something you already know.

Go build something.

What to read next

Where to push this codebase next

A closing thought