Act 2: Recap

The single-request path is fast. Between the Act 1 baseline (seconds per decode token on one scalar CPU core) and the end of Act 2, each chapter carved off a specific inefficiency, and every speedup was measured in the same harness built in II.1. No claim in this act was a vibe; every one of them moved a number we printed.

What moved the needle

Six chapters, in order of what they did:

II.1: Benchmark harness. No speedup: the instrument. A Metrics type that times every forward pass and reports time-to-first-token and decode tokens/second. It established the baseline everything after it is measured against, and it handed us the first clue: forward passes that got slower as the response grew.
II.2: KV cache. The single biggest lever, and an algorithmic fix rather than a faster kernel. Without it, generating token N re-runs the network over all N prior tokens, N² total work. With it, each decode step processes one token and reuses cached per-layer keys and values. Decode collapsed from growing-per-token to near-constant-per-token; throughput jumped 33× (0.55 → 18.4 tokens/sec).
II.3: SIMD CPU backend. The measurement that shaped the act. NEON intrinsics do matmul 4 (and with unrolling, 16) floats per instruction instead of one, and decode did not move, because one core was already saturating its memory bandwidth streaming the weights. The negative result is the act's central fact, measured; the backend machinery it built carries every later chapter, and its SIMD skills win big in II.6.
II.4: Multithreaded CPU backend. rayon spreads matmul rows across every core: a 511-token prefill dropped from 19.4 s to 3.8 s (5.1×). Deliberately no benefit for decode: a single-row matrix-vector product can't be split across cores, so the size threshold sends it straight back to the SIMD path.
II.5: Metal GPU backend. A Metal Shading Language matmul kernel, command buffers, and Apple Silicon's unified memory for copy-free data sharing. Big prefill matmuls move to the thousands of GPU threads (the 511-token prefill drops again, 3.7 s → 1.7 s); small decode matmuls fall back to the CPU, again by threshold.
II.6: Q8_0 quantization. The fix for decode. The model's weights ship as 8-bit quantized blocks, roughly a quarter the bytes of FP32, and the engine learned to multiply against them directly, quantizing the activations on the fly so the whole dot product runs in integer NEON. Decode, which is bandwidth-bound, more than doubled: 18 → 41 tokens/sec.

The one idea that explains all of it

Every chapter maps onto the split the act intro opened with: inference is prefill (compute-bound: reading the prompt, big matrix-by-matrix multiplies) and decode (memory-bandwidth-bound: writing the response one token at a time, matrix-by-vector multiplies).

Read the ladder through that lens and it stops being a list of tricks:

The KV cache helped decode: it fixed decode's algorithm.
SIMD, threads, and Metal attack compute. Threads and Metal duly delivered on prefill, whose bottleneck is compute. SIMD delivered something better than a speedup: proof. Decode didn't move at all under a 4×-faster kernel, because its bottleneck was never arithmetic: one core streaming 2.4 GB of weights per token was already at the bandwidth wall.
Q8_0 helped decode: it cut bytes, and bytes are decode's entire problem.

If you remember one thing from Act 2, remember that decode is bound by memory traffic, not FLOPs (raw arithmetic rate). It is why "just write a faster matmul" only takes you so far, and why halving the weight bytes did what three faster matmul backends couldn't.

What the engine still can't do

Everything we built assumes exactly one user, one prompt, one response at a time:

No server. There's a CLI binary and that's it. No HTTP, no streaming, no request/response shape another program could talk to.
No chat formatting. Qwen3 was trained to respond to <|im_start|>... chat framing; we've been feeding it raw strings. Continuation prompts look fine; anything resembling a chat turn degrades.
No concurrency. BasicKvCache is a per-request structure. Two concurrent requests would each allocate their own, and the layout (a tensor that gets reallocated on every appended token) makes batching their decodes together impossible.
No reuse across requests. Every request re-prefills its prompt from scratch. In real workloads a large share of a prompt's tokens are identical across requests (a shared system prompt, a shared conversation prefix), and we pay the full prefill cost every single time.

None of this is a "keep stacking optimizations" fix. Serving many requests is a different problem entirely. The Act 2 KV cache is excellent for one request and actively in the way of many, and the same is true of how we dispatch GPU work and structure the forward pass.

The bridge to Act 3

Act 3 turns this fast single-user engine into a real server: chat templates, an HTTP API, token-by-token SSE streaming, a paged KV layout that appends in O(1) instead of recopying the cache every token, a prefix cache that reuses prefill work across requests, and a scheduler that interleaves many requests at iteration granularity. Several Act 2 data structures get rewritten. The single-request path stays fast (we don't regress it), but the representations underneath it change to support more than one tenant.

Continue to Act 3: Make it serve.