Act 1: Make it work

It's tempting to skip the "make it work" phase, grab a library, call model.generate(), and dive straight into the performance work that looks interesting. Don't. The cost of that shortcut is that the performance work stays abstract: you never feel the thing you're speeding up. When the KV cache makes decode (the loop that generates one token at a time) over thirty times faster in II.2, it matters that you're comparing it against a forward pass (one full run of the model over its input) you built yourself and watched take about a second per token, each token slower than the last. Every Act 2 chapter is a reaction to something you built here.

So the point of Act 1 is narrow and non-negotiable: get a real Qwen3 checkpoint producing coherent text in a program you wrote every line of. It won't be fast or pretty or servable. It will work.

The reference model

We use Qwen3 0.6B as the running example: a modern decoder-only transformer small enough to iterate on quickly, large enough to surface every interesting bottleneck. Its architecture is representative of the current generation: RMSNorm, grouped-query attention with RoPE, and a SwiGLU MLP. Don't worry if those terms are unfamiliar; each is built up, from scratch, in chapter I.5.

Interactive diagram (click blocks for detail).

Everything in the codebase is written generically against a Model trait, so swapping in a different architecture is a matter of writing one new file. Qwen3 0.6B just happens to be what every CLI calls.

The path

Six chapters climb from raw bytes on disk to a checkpoint generating real tokens. Each depends on the previous one, and each adds exactly one capability.

Figure: Act I at a glance. Each chapter produces a concrete artifact; each depends on the one above.

I.1: GGUF. The file format ggml / llama.cpp uses. Parse the header, the typed key-value metadata (hyperparameters such as layer count and dimensions, the vocab, the chat template), and the tensor index. We don't run anything yet. We learn the format and build a gguf-inspect tool to dump what's inside.
I.2: Tokenizer. Byte-pair encoding: turn text into token ids and back, handling UTF-8 correctly. The vocabulary and merge table come straight from the GGUF metadata parsed in I.1. Ships a tokenizer-demo binary.
I.3: Tensor. A minimal Tensor type (a flat Vec<f32> plus a shape) that every weight and every intermediate value (activation) flows through. Small, but it pins down the memory layout the rest of the engine assumes.
I.4: Backend. A Backend trait listing every numeric operation a transformer needs (matmul, softmax, the pieces of rmsnorm, silu, RoPE) and a plain scalar CPU implementation of all of them. This trait is the seam every faster backend in Act 2 plugs into.
I.5: Qwen3 forward. Wire the 28-layer forward pass (token embedding, RMSNorm, grouped-query attention with RoPE, SwiGLU MLP, residuals, final norm, output head) and load the real weights out of the GGUF file. One forward pass turns a prompt into next-token logits (a score for each vocabulary token).
I.6: Greedy generation. The autoregressive loop: take the highest-probability token, feed it back in, run forward again, repeat until an end-of-sequence token or a length cap. End-to-end multi-token generation behind a model-generate CLI.

Getting started

Create the crate:

BASH

cargo new --lib inferno
cd inferno

cargo new --lib generates Cargo.toml and src/lib.rs. The default Cargo.toml is just package metadata:

TOML

[package]
name = "inferno"
version = "0.1.0"
edition = "2024"
 
[dependencies]

src/lib.rs comes with a sample add function and a test; delete both. Chapter I.1 starts from an empty file.

What "done" looks like

At the end of Act 1, model-generate "Once upon a time" produces a sensible continuation of a real Qwen3 checkpoint. Every line of code was written deliberately; you understand every number that moves through the model.

It will also be slow: about a second per token on an M2 Pro, and worsening as the sequence grows. That's the baseline Act 2 is built against.