Build an LLM inference framework from scratch in Rust
Make it work, make it fast, make it serve.
Training frontier models gets the headlines; making them useful in the real world is a different craft. Inference engineering is the work of turning a raw checkpoint into a system that streams fast responses, keeps hardware saturated without blowing memory, and stays responsive when many users hit it at once.
In this series, you build that system end to end. One crate, inferno, grows chapter by chapter: tokenizer, tensors and GGUF loading, a forward pass you implement yourself, CPU and GPU kernels, KV caching and batching, and finally HTTP serving. It goes from an empty repo to an OpenAI-compatible server any chat client can talk to.
github.com/yashkothari42/inferno
Nothing here is pseudocode or hand-waving. Every chapter builds working Rust, and all of it is in that repo. The commit history follows the chapters one to one: nineteen commits, one per chapter, named by act and chapter number. Each commit only adds code and stands on its own. cargo check and cargo run work at every checkpoint.
Read the chapters in order, or check out a chapter's commit and read the diff with the text as commentary.
What you'll build
An OpenAI-compatible HTTP server that loads a real checkpoint, runs on GPU (with tuned CPU paths in the same codebase), and handles concurrent chat traffic: /v1/chat/completions JSON and SSE, continuous batching, paged KV, prefix caching. Don't worry if half those terms are unfamiliar: each gets its own chapter.
Everything in the nineteen chapters below is how we get there, in Rust, with no inference framework, one abstraction at a time, with every major optimization implemented and benchmarked.
What serving a request looks like
HTTP arrives. A chat template formats the messages; the tokenizer converts them to ids; the scheduler decides when and with whom this request batches; the model runs forward through the tensor kernels; token ids stream back out as SSE events.
Figure 1: Chat path. One /v1/chat/completions trace top-to-bottom; node colour = act. The red dashed loop on the right is the decode loop (the only backward arrow).
Every box in that diagram is a chapter. The ones that feel like they should "just work" (the tokenizer, the chat template) turn out to be surprisingly rich up close. The ones that sound complicated (continuous batching, paged attention) turn out to be straightforward once you have built the right abstractions.
Why build your own
Most tutorials pick one arc and stop. Two common shapes:
- PyTorch-first (or a similar stack). You get a model running, then the story is serving and deployment (containers, GPUs, batching APIs, autoscaling) while what happens inside
forward/generatestays mostly a black box. - Tensors and kernels first. You build memory layouts, maybe a matmul or attention block, climb up to a model that can sample tokens, and often end there, before you have to care about concurrent HTTP, KV fragmentation, or interleaved streams.
Each leaves half the system implicit. You can read about vLLM, SGLang, TGI, and llama.cpp all day, but until you have implemented a KV cache and seen it speed up decode 20×, written a Metal matmul kernel and watched the GPU take over from the CPU, scheduled a batch of heterogeneous requests and seen tokens stream out interleaved, the ideas stay abstract. This series walks both halves, hands on.
Scope: what "from scratch" means

Carl Sagan had it right in Cosmos -- if you want something from scratch, you invent the universe first. We're not quite going that far, but we're coming close. No black-box generate(), no mystery layers. We start from an empty repo and build the full inference stack ourselves -- from loading raw model weights and running them on custom kernels, to serving completions over a standard protocol for any client to consume.
What you bring (outside this codebase):
- A model file: a Qwen3 0.6B GGUF checkpoint on disk. The tokenizer (vocabulary and merge table) is read straight out of that file's metadata, so there is nothing else to download.
- Callers: anything that can send HTTP to the address you bind:
curl, a script, the PythonopenaiSDK, a browser chat UI. The series implements the server; it does not ship a product front-end.
Three acts
The series is organized by what the engine can do at the end of each act, not by which subsystem it touches. Make it work, make it fast, make it serve mirrors how real inference engines actually evolved (and how you would build one starting today).
Act 1: Make it work. Six chapters from file format and tokenizer through greedy generation. At the end you have model-generate: coherent text from a real Qwen3 GGUF file, one token at a time, in pure Rust. It is correct, but slow enough that Act 2 has room to run.
Act 2: Make one request as fast as possible. Six chapters: measure with a benchmark harness, then the optimization ladder: KV cache, SIMD CPU kernels, multithreading, a Metal GPU backend, and Q8 weight quantization. One request runs end-to-end on a tuned path; every step is benchmarked.
Act 3: Serve many requests at once. Seven chapters: chat templates, an OpenAI-compatible HTTP API, SSE streaming, paged KV, radix-tree prefix caching, a decode scheduler, and batched decode. Concurrent chats stream per user; work is batched and scheduled; shared prefixes reuse KV.
What you should know already
Comfort with systems programming (the codebase is Rust, but the patterns translate) and a willingness to read math (softmax(QK^T / √d)V and friends show up constantly). No prior ML experience required. The math in each chapter is developed as needed, and the transformer itself gets built up layer by layer in chapter I.5.
The full chapter map
Each box is a chapter or act page -- click to jump straight in.