III.4: Paged KV cache

The server from III.2 and III.3 works, but it serves one request at a time. Before it can serve concurrent requests, one Act 2 component has to be rebuilt: the KV cache.

Recall what the KV cache does, from II.2. Attention at decode step n needs the key and value vectors of all n previous tokens. Without a cache, you'd recompute them every step: quadratic work. The cache stores them: each layer keeps a key tensor and a value tensor, you append one row per token, and decode becomes linear.

The Act 2 cache (BasicKvCache) stored each layer's keys and values as one contiguous tensor and grew it by reallocation. Appending a token row meant concat_dim0: allocate a new buffer one row larger, copy the entire old history into it, copy the new row on the end. For a single short CLI generation that's tolerable. For a server it is not. This chapter explains why, and replaces it with a paged KV cache, and makes the cache implementation selectable at runtime with --kv basic|paged.

Why contiguous storage fails a server

Two problems, both fatal at serving scale.

The copy cost compounds. Appending token n copies n rows. Over a full generation of N tokens that's 1 + 2 + ... + N row-copies, quadratic again, this time in memory bandwidth instead of compute. A 0.6B model has 28 layers; every decode step reallocates and copies 28 K-tensors and 28 V-tensors. The Act 2 KV cache eliminated quadratic compute but quietly reintroduced quadratic copying.

Contiguous blocks fragment. This is the serving-specific killer. Picture a server with several requests in flight. Each request's cache is a contiguous block sized to its sequence. Request A holds 200 tokens, B holds 1000, C holds 150. A finishes and frees its block. Now request D arrives needing 800 tokens, but the freed gap is only A's 200-token hole. The 800 tokens won't fit in the gap even though there might be 800 tokens of free memory scattered around. This is classic external fragmentation, and contiguous per-request allocation guarantees it.

The fix is the same one operating systems use for RAM: stop allocating contiguous variable-size blocks. Allocate fixed-size pages instead, and let a sequence's storage be a list of pages that need not be next to each other. A sequence that needs 800 tokens grabs 800/page-size pages from a shared pool, wherever they happen to be. Free pages go back to the pool individually and can be reused by anyone. No fragmentation, because every page is interchangeable.

This is what vLLM named paged attention. We build a simpler version of the same idea.

The shape of a paged cache

The design has a few moving parts. Per layer:

PLAINTEXT
free pool:  [ blk0 ][ blk1 ][ blk2 ][ blk3 ] ... [ blk255 ]   ← 256 fixed blocks
                                                                  each holds 16 tokens
 
a sequence's block table:  [ blk7 ][ blk2 ][ blk19 ]   ← a list of block ids
                              ↑       ↑        ↑
              tokens 0–15   tokens 16–31   tokens 32–...
 
block metadata (per block): ref_count, num_tokens, block_size, finalized
  • A block holds a fixed number of token slots (16). The actual K and V numbers live in the block's data arrays.
  • A block table is one sequence's ordered list of block ids. Token t lives in block table[t / 16], slot t % 16.
  • Block metadata tracks, per block: how many sequences reference it (ref_count), how many of its slots are filled (num_tokens), and whether it's finalized (frozen, relevant once III.5 shares blocks across sequences).
  • The allocator owns the pool of free block ids and hands them out.

Appending a token: look at the last block in the table; if it's full, grab a fresh block from the pool and append its id; write the K/V row into the next free slot. That's O(1): no copying of history, ever. Reading the whole sequence back ("materialize") walks the block table and concatenates each block's filled portion.

The III.4 commit puts this in src/cache/paged/ with four files. We'll go bottom-up.

Block metadata and the allocator

src/cache/paged/block.rs holds the block table, the metadata, and the allocator. First the block table, just an ordered list of block ids:

src/cache/paged/block.rsRUST
use super::block_allocator_trait::BlockAllocator;
 
#[derive(Debug, Clone, Default)]
pub struct BlockTable {
    entries: Vec<usize>,
}
 
impl BlockTable {
    pub fn new() -> Self {
        Self::default()
    }
 
    pub fn last(&self) -> Option<usize> {
        self.entries.last().copied()
    }
 
    pub(super) fn push(&mut self, block_id: usize) {
        self.entries.push(block_id);
    }
 
    pub fn iter(&self) -> std::slice::Iter<'_, usize> {
        self.entries.iter()
    }
 
    pub fn drain(&mut self) -> Vec<usize> {
        std::mem::take(&mut self.entries)
    }
}

last is the block we're currently filling. iter walks all blocks for materialize. drain empties the table and returns its ids, used when freeing a sequence.

Per-block metadata:

src/cache/paged/block.rsRUST
#[derive(Debug, Clone)]
pub struct BlockMeta {
    pub ref_count: usize,
    pub num_tokens: usize,
    pub block_size: usize,
    pub finalized: bool,
}
 
impl BlockMeta {
    pub fn new(block_size: usize) -> Self {
        Self {
            ref_count: 0,
            num_tokens: 0,
            block_size,
            finalized: false,
        }
    }
 
    pub fn is_full(&self) -> bool {
        self.num_tokens >= self.block_size
    }
 
    pub fn is_free(&self) -> bool {
        self.ref_count == 0
    }
 
    pub fn reset(&mut self) {
        self.ref_count = 0;
        self.num_tokens = 0;
        self.finalized = false;
    }
}

is_full says whether the block can take another token. is_free says whether anyone still references it. reset returns a block to a clean state when it's recycled. ref_count is more than one sequence can hold a block only once the prefix cache shares blocks; for now it's always 0 or 1, but the bookkeeping is here from the start.

Allocation can fail; an error enum spells out how:

src/cache/paged/block.rsRUST
#[derive(Debug, Clone, PartialEq, Eq)]
pub(super) enum AllocError {
    OutOfBlocks,
    InvalidBlock(usize),
    AlreadyFinalized(usize),
}
 
impl std::fmt::Display for AllocError {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        match self {
            Self::OutOfBlocks => write!(f, "block allocator: out of physical blocks"),
            Self::InvalidBlock(id) => write!(f, "block allocator: invalid block id {id}"),
            Self::AlreadyFinalized(id) => {
                write!(f, "block allocator: cannot mutate finalized block {id}")
            }
        }
    }
}
 
impl std::error::Error for AllocError {}

OutOfBlocks is the pool being exhausted, a real possibility with a fixed pool and concurrent requests. InvalidBlock and AlreadyFinalized are programmer errors.

The allocator itself is a free-list: a vector of all block metadata, plus a stack of free ids:

src/cache/paged/block.rsRUST
#[derive(Debug)]
pub struct NaiveAllocator {
    blocks: Vec<BlockMeta>,
    free_ids: Vec<usize>,
}
 
impl NaiveAllocator {
    pub fn new(num_blocks: usize, block_size: usize) -> Self {
        Self {
            blocks: (0..num_blocks)
                .map(|_| BlockMeta::new(block_size))
                .collect(),
            free_ids: (0..num_blocks).rev().collect(),
        }
    }
}

free_ids starts holding every id. It's built .rev() so that popping the stack hands out ids 0, 1, 2, ... in order, purely cosmetic, but it makes debugging output readable.

The allocator implements a BlockAllocator trait so the data layer can be written against the interface, not the concrete NaiveAllocator:

src/cache/paged/block.rsRUST
impl BlockAllocator for NaiveAllocator {
    fn meta(&self) -> &[BlockMeta] {
        &self.blocks
    }
 
    fn allocate(&mut self, table: &mut BlockTable) -> Result<usize, AllocError> {
        let id = self.free_ids.pop().ok_or(AllocError::OutOfBlocks)?;
        self.blocks[id].reset();
        self.blocks[id].ref_count = 1;
        table.push(id);
        Ok(id)
    }
 
    fn free(&mut self, table: &mut BlockTable) -> Vec<usize> {
        let ids = table.drain();
        let mut freed = Vec::new();
        for id in ids {
            if self.blocks[id].ref_count == 0 {
                continue;
            }
            self.blocks[id].ref_count -= 1;
            if self.blocks[id].is_free() {
                self.blocks[id].reset();
                self.free_ids.push(id);
                freed.push(id);
            }
        }
        freed
    }
 
    fn increment_fill(&mut self, id: usize) -> Result<(usize, usize), AllocError> {
        let m = self
            .blocks
            .get_mut(id)
            .ok_or(AllocError::InvalidBlock(id))?;
        if m.finalized {
            return Err(AllocError::AlreadyFinalized(id));
        }
        let slot = m.num_tokens;
        m.num_tokens += 1;
        Ok((id, slot))
    }
}

allocate pops a free id, resets it, marks it referenced, and appends it to a sequence's table. free drains a sequence's table and, for each block, drops the ref count; a block that hits zero refs goes back on the free list. increment_fill claims the next slot in a block: it returns (block_id, slot_index) and bumps num_tokens. The ref_count decrement in free is why blocks can be shared later: only the last referencing sequence actually returns the block to the pool.

The allocator trait

src/cache/paged/block_allocator_trait.rs defines the interface, and folds in the one piece of logic every allocator shares: "append a token, growing the table if the last block is full":

src/cache/paged/block_allocator_trait.rsRUST
use super::block::{AllocError, BlockMeta, BlockTable};
 
pub trait BlockAllocator: std::fmt::Debug + Send {
    fn meta(&self) -> &[BlockMeta];
 
    fn allocate(&mut self, table: &mut BlockTable) -> Result<usize, AllocError>;
 
    fn free(&mut self, table: &mut BlockTable) -> Vec<usize>;
 
    fn append_token_slot(&mut self, table: &mut BlockTable) -> Result<(usize, usize), AllocError> {
        let need_new = match table.last() {
            None => true,
            Some(id) => {
                let m = self.meta().get(id).ok_or(AllocError::InvalidBlock(id))?;
                m.is_full() || m.finalized
            }
        };
        if need_new {
            self.allocate(table)?;
        }
        let id = table.last().ok_or(AllocError::OutOfBlocks)?;
        self.increment_fill(id)
    }
 
    fn increment_fill(&mut self, block_id: usize) -> Result<(usize, usize), AllocError>;
}

append_token_slot is a default method, implemented once on the trait. It checks the last block: if the table is empty, or the last block is full, or finalized, allocate a fresh block. Then claim a slot in whichever block is last. This is the O(1) append: at most one new block allocation, no copying. Concrete allocators only have to supply meta, allocate, free, and increment_fill; append_token_slot comes for free.

The per-layer data

The metadata layer above doesn't store any actual numbers; BlockMeta knows a block has 5 tokens in it, but not what those tokens' K and V vectors are. src/cache/paged/layer.rs holds the real data:

src/cache/paged/layer.rsRUST
use crate::tensor::Tensor;
 
use super::block::BlockTable;
use super::block_allocator_trait::BlockAllocator;
 
pub(crate) const BLOCK_SIZE: usize = 16;
pub(crate) const NUM_BLOCKS: usize = 256;
 
#[derive(Debug, Clone)]
pub(crate) struct KvBlockData {
    k_data: Vec<f32>,
    v_data: Vec<f32>,
}
 
impl KvBlockData {
    pub(crate) fn new(block_size: usize, kv_width: usize) -> Self {
        let cap = block_size * kv_width;
        Self {
            k_data: vec![0.0; cap],
            v_data: vec![0.0; cap],
        }
    }
}

BLOCK_SIZE = 16 is tokens per block; NUM_BLOCKS = 256 is the pool size. One layer's cache is therefore 256 × 16 = 4096 token-slots, enough for several concurrent sequences of a few hundred tokens each. A KvBlockData is one block's storage: a flat f32 buffer for keys and one for values, each sized block_size × kv_width (kv_width is the per-token KV vector length, head_dim times the number of KV heads).

LayerPaged is one transformer layer's worth of paged storage:

src/cache/paged/layer.rsRUST
#[derive(Debug)]
pub(crate) struct LayerPaged {
    pub(crate) blocks: Vec<KvBlockData>,
    pub(crate) table: BlockTable,
    pub(crate) kv_width: usize,
}
 
impl LayerPaged {
    pub(crate) fn new(num_blocks: usize, block_size: usize, kv_width: usize) -> Self {
        Self {
            blocks: (0..num_blocks)
                .map(|_| KvBlockData::new(block_size, kv_width))
                .collect(),
            table: BlockTable::new(),
            kv_width,
        }
    }

blocks is the physical storage, all 256 blocks, allocated up front. table is this layer's sequence of block ids. The physical blocks exist always; the table just records which ones currently belong to the sequence.

Writing one token's K and V into a known (block, slot):

src/cache/paged/layer.rsRUST
    pub(crate) fn write_slot(&mut self, block_id: usize, slot: usize, k_row: &[f32], v_row: &[f32]) {
        debug_assert_eq!(k_row.len(), self.kv_width);
        debug_assert_eq!(v_row.len(), self.kv_width);
        let off = slot * self.kv_width;
        self.blocks[block_id].k_data[off..off + self.kv_width].copy_from_slice(k_row);
        self.blocks[block_id].v_data[off..off + self.kv_width].copy_from_slice(v_row);
    }

Slot s in a block lives at byte-offset s * kv_width into that block's buffer. copy_from_slice writes exactly kv_width floats, one token's K row and V row. This is the only copy in the whole append path, and it copies one row, not the history.

materialize is the reverse: gather every block's filled portion into one contiguous K tensor and one V tensor, the dense form attention needs:

src/cache/paged/layer.rsRUST
    pub(crate) fn materialize(&self, alloc: &dyn BlockAllocator) -> (Tensor, Tensor) {
        let mut total_tokens = 0usize;
        for &bid in self.table.iter() {
            total_tokens += alloc.meta()[bid].num_tokens;
        }
        let mut k_out = Vec::with_capacity(total_tokens * self.kv_width);
        let mut v_out = Vec::with_capacity(total_tokens * self.kv_width);
        for &bid in self.table.iter() {
            let n = alloc.meta()[bid].num_tokens;
            let end = n * self.kv_width;
            k_out.extend_from_slice(&self.blocks[bid].k_data[..end]);
            v_out.extend_from_slice(&self.blocks[bid].v_data[..end]);
        }
        (
            Tensor::new(k_out, vec![total_tokens, self.kv_width]),
            Tensor::new(v_out, vec![total_tokens, self.kv_width]),
        )
    }
 
    pub(crate) fn seq_len(&self, alloc: &dyn BlockAllocator) -> usize {
        self.table
            .iter()
            .map(|&bid| alloc.meta()[bid].num_tokens)
            .sum()
    }
}

It first sums num_tokens across the table to size the output, then walks the table again copying each block's filled prefix (..end, not the whole 16-slot block, since the last block is usually partial). The result is a [total_tokens, kv_width] tensor identical to what BasicKvCache would have held. seq_len just sums the token counts.

Materialize does copy the whole history, but only when attention needs it, not on every append. The append path stays O(1); the copy happens once per decode step at read time, which is unavoidable since attention wants a dense matrix.

The paged cache

src/cache/paged/paged_cache.rs assembles the pieces into a type implementing the KvCache trait. Recall that trait from II.2: set_prefill, push_row, materialize, seq_len, plus a snapshot method (we'll come to that). The struct holds one allocator and one LayerPaged per transformer layer:

src/cache/paged/paged_cache.rsRUST
pub(crate) struct PagedKvCache {
    allocs: Vec<NaiveAllocator>,
    layers: Vec<LayerPaged>,
    ops: Arc<dyn Backend>,
}
 
impl Debug for PagedKvCache {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        f.debug_struct("PagedKvCache")
            .field("allocs", &self.allocs.len())
            .field("layers", &self.layers.len())
            .field("ops", &"<Arc<dyn Backend>>")
            .finish()
    }
}
 
impl PagedKvCache {
    pub fn new(num_layers: usize, kv_width: usize, ops: Arc<dyn Backend>) -> Self {
        Self {
            allocs: (0..num_layers)
                .map(|_| NaiveAllocator::new(NUM_BLOCKS, BLOCK_SIZE))
                .collect(),
            layers: (0..num_layers)
                .map(|_| LayerPaged::new(NUM_BLOCKS, BLOCK_SIZE, kv_width))
                .collect(),
            ops,
        }
    }
}

One allocator and one LayerPaged per layer, each with its own 256-block pool. ops is the backend, needed because writing into blocks means pulling rows out of tensors, which goes through backend ops.

The KvCache trait requires a snapshot: a frozen copy of the cache that can be stored and later re-materialized. The prefix cache in III.5 is the consumer; we build the plumbing here. A snapshot is just the materialized dense K/V tensors per layer:

src/cache/paged/paged_cache.rsRUST
#[derive(Clone)]
struct PagedKvSnapshot {
    kv_width: usize,
    k_by_layer: Vec<Tensor>,
    v_by_layer: Vec<Tensor>,
    ops: Arc<dyn Backend>,
}
 
impl std::fmt::Debug for PagedKvSnapshot {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        f.debug_struct("PagedKvSnapshot")
            .field("num_layers", &self.k_by_layer.len())
            .field("seq_len", &self.seq_len())
            .finish()
    }
}
 
impl KvSnapshot for PagedKvSnapshot {
    fn seq_len(&self) -> usize {
        self.k_by_layer.first().map(|t| t.shape()[0]).unwrap_or(0)
    }
 
    fn materialize(&self) -> Box<dyn KvCache> {
        let mut out = PagedKvCache::new(self.k_by_layer.len(), self.kv_width, self.ops.clone());
        for li in 0..self.k_by_layer.len() {
            out.set_prefill(li, &self.k_by_layer[li], &self.v_by_layer[li]);
        }
        Box::new(out)
    }
}

materialize on a snapshot builds a fresh PagedKvCache and replays the stored tensors into it via set_prefill, turning a saved snapshot back into a usable cache.

Now the KvCache impl. snapshot materializes every layer and bundles the tensors:

src/cache/paged/paged_cache.rsRUST
impl KvCache for PagedKvCache {
    fn snapshot(&self) -> Box<dyn KvSnapshot> {
        let mut k_by_layer = Vec::with_capacity(self.layers.len());
        let mut v_by_layer = Vec::with_capacity(self.layers.len());
        for li in 0..self.layers.len() {
            let (k, v) = self.materialize(li);
            k_by_layer.push(k);
            v_by_layer.push(v);
        }
        let kv_width = self.layers.first().map(|l| l.kv_width).unwrap_or(0);
        Box::new(PagedKvSnapshot {
            kv_width,
            k_by_layer,
            v_by_layer,
            ops: self.ops.clone(),
        })
    }
 
    fn kind(&self) -> &'static str {
        "paged"
    }

set_prefill loads an entire prompt's K/V for one layer, called once after the prefill forward pass. It frees whatever the layer held, then appends every token slot by slot:

src/cache/paged/paged_cache.rsRUST
    fn set_prefill(&mut self, layer: usize, k: &Tensor, v: &Tensor) {
        let ops = self.ops.as_ref();
        debug_assert_eq!(k.shape(), v.shape());
        debug_assert_eq!(k.shape().len(), 2);
        let seq_len = k.shape()[0];
        let kv_width = k.shape()[1];
        let alloc = &mut self.allocs[layer];
        let lp = &mut self.layers[layer];
 
        let _ = alloc.free(&mut lp.table);
        for t in 0..seq_len {
            let (block_id, slot) = alloc.append_token_slot(&mut lp.table).unwrap_or_else(|e| {
                panic!(
                    "paged KV cache set_prefill: {e} (e.g. out of blocks for this prefill length)"
                )
            });
            let row_off = t * kv_width;
            let mut kr = vec![0.0f32; kv_width];
            let mut vr = vec![0.0f32; kv_width];
            ops.copy_contiguous_into(k, row_off, &mut kr);
            ops.copy_contiguous_into(v, row_off, &mut vr);
            lp.write_slot(block_id, slot, &kr, &vr);
        }
    }

For each of the prompt's tokens: append_token_slot finds (and if needed allocates) a (block, slot), copy_contiguous_into pulls that token's row out of the input tensor, and write_slot writes it into the block. A 100-token prompt fills 6 full blocks plus a partial 7th.

push_row is the decode-step append, one token:

src/cache/paged/paged_cache.rsRUST
    fn push_row(&mut self, layer: usize, k_row: &Tensor, v_row: &Tensor) {
        let ops = self.ops.as_ref();
        let w = self.layers[layer].kv_width;
        let mut kr = vec![0.0f32; w];
        let mut vr = vec![0.0f32; w];
        ops.copy_contiguous_into(k_row, 0, &mut kr);
        ops.copy_contiguous_into(v_row, 0, &mut vr);
        let alloc = &mut self.allocs[layer];
        let lp = &mut self.layers[layer];
        let (block_id, slot) = alloc
            .append_token_slot(&mut lp.table)
            .unwrap_or_else(|e| panic!("paged KV cache push_row: {e} (e.g. out of blocks)"));
        lp.write_slot(block_id, slot, &kr, &vr);
    }

This is the line that replaces BasicKvCache's concat_dim0. Where the basic cache reallocated and copied the whole history, push_row claims one slot and writes one row. O(1), every step. This is the win.

The two read methods just delegate to LayerPaged:

src/cache/paged/paged_cache.rsRUST
    fn materialize(&self, layer: usize) -> (Tensor, Tensor) {
        self.layers[layer].materialize(&self.allocs[layer])
    }
 
    fn seq_len(&self, layer: usize) -> usize {
        self.layers[layer].seq_len(&self.allocs[layer])
    }
 
    fn num_layers(&self) -> usize {
        self.layers.len()
    }
}

Selecting the cache at runtime

With two cache implementations, the codebase needs a way to choose. src/cache/kind.rs is a small enum plus a parser:

src/cache/kind.rsRUST
use std::fmt;
use std::sync::Arc;
 
use crate::backend::Backend;
 
use super::basic::BasicKvCache;
use super::kv_cache_trait::KvCache;
use super::paged::PagedKvCache;
 
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
pub(crate) enum KvCacheKind {
    Basic,
    Paged,
}
 
impl fmt::Display for KvCacheKind {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        match self {
            KvCacheKind::Basic => write!(f, "basic"),
            KvCacheKind::Paged => write!(f, "paged"),
        }
    }
}
 
pub(crate) fn parse_kv_mode(mode: &str) -> Result<KvCacheKind, String> {
    match mode.trim() {
        "basic" => Ok(KvCacheKind::Basic),
        "paged" => Ok(KvCacheKind::Paged),
        other => Err(format!(
            "unknown kv cache mode {other:?} (supported: basic, paged)"
        )),
    }
}
 
impl KvCacheKind {
    pub(crate) fn boxed(
        self,
        num_layers: usize,
        kv_width: usize,
        ops: Arc<dyn Backend>,
    ) -> Box<dyn KvCache> {
        match self {
            KvCacheKind::Basic => Box::new(BasicKvCache::new(num_layers, kv_width, ops)),
            KvCacheKind::Paged => Box::new(PagedKvCache::new(num_layers, kv_width, ops)),
        }
    }
}

boxed is the factory: given a kind, produce the right boxed KvCache. The cache factory from Act 2 now routes through it:

src/cache/factory.rsRUST
pub fn create_kv_cache(
    mode: &str,
    model: Arc<dyn Model>,
    backend: Arc<dyn Backend>,
) -> Result<Box<dyn KvCache>, String> {
    let kind = parse_kv_mode(mode)?;
    let (layers, width) = model.kv_cache_layers_and_width();
    Ok(kind.boxed(layers, width, backend))
}

The cache module wires in the two new files:

src/cache/mod.rsRUST
mod basic;
mod factory;
mod kind;
mod kv_cache_trait;
mod paged;

And the --kv flag, which since Act 2 already accepted basic, learns paged:

src/cli/args.rsRUST
        Some("paged") => {
            cursor.advance();
            "paged"
        }
        Some(s) if !s.starts_with('-') => {
            panic!("invalid --kv mode '{s}' (expected basic or paged)");
        }

The two binaries' usage strings update to advertise it:

src/bin/chat-repl.rsRUST
        "usage: chat-repl [--kv [basic|paged]] [--backend scalar|simd|parallel|metal] <gguf_path> <max_new_tokens> [<user message…>]"
src/bin/model-generate.rsRUST
        "usage: model-generate [--kv [basic|paged]] [--backend scalar|simd|parallel|metal] <gguf_path> [prompt] [max_new_tokens]"

Running it

Run chat-repl with the paged cache:

BASH
cargo run --release --bin chat-repl -- --kv paged path/to/qwen3-0.6b.gguf 128 "Write a haiku about Rust."
PLAINTEXT
backend: simd
kv cache: paged
 
assistant> Strong types guard the gate,
borrow checker holds the line,
fearless, the code runs.
ttft: 39.8 ms
decode: 31.4 tok/s

The output is identical to --kv basic, same model, same math, because the paged cache stores the same numbers; it only changes how they're laid out in memory. On a single short generation the speed difference is small (the basic cache's quadratic copying only bites at long sequences). The real payoff isn't visible from one CLI run: it's that 256 fixed blocks per layer can back several concurrent sequences without fragmenting, and every append is O(1) no matter how long the sequence gets.

Where this leaves us

The KV cache is now built for a server. Fixed-size blocks come from a shared pool, a sequence is a list of block ids, appends are O(1), and freed blocks return to the pool individually so memory never fragments. The snapshot/materialize plumbing is in place too, unused for now.

That plumbing is the hook for the next optimization. Most chat requests share a prefix (a system prompt, the earlier turns of a conversation), and the engine still re-prefills those shared tokens on every request. The next chapter builds a radix tree of KV snapshots so a shared prefix is computed once and reused.