III.4: Paged KV cache

The server from III.2 and III.3 works, and it will even run concurrent requests, each on its own blocking thread. But one Act 2 component was never designed for serving: the KV cache.

Recall what the KV cache does, from II.2. Attention at decode step n needs the key and value vectors of all n previous tokens. Without a cache, you'd recompute them every step: quadratic work. The cache stores them: each layer keeps a key tensor and a value tensor, you append one row per token, and decode becomes linear.

The Act 2 cache (BasicKvCache) stored each layer's keys and values as one contiguous tensor and grew it by reallocation. Appending a token row meant concat_dim0: allocate a new buffer one row larger, copy the entire old history into it, copy the new row on the end. For a single short CLI generation that's tolerable. For a server it is not. This chapter explains why, and replaces it with a paged KV cache, and makes the cache implementation selectable at runtime with --kv basic|paged.

Why contiguous storage fails a server

Two problems, both fatal at serving scale.

The copy cost compounds. Appending token n copies n rows. Over a full generation of N tokens that's 1 + 2 + ... + N row-copies, quadratic again, this time in memory bandwidth instead of compute. A 0.6B model has 28 layers; every decode step reallocates and copies 28 K-tensors and 28 V-tensors. The Act 2 KV cache eliminated quadratic compute but quietly reintroduced quadratic copying.

Contiguous blocks fragment. This is the serving-specific killer. Picture a server with several requests in flight. Each request's cache is a contiguous block sized to its sequence. Request A holds 200 tokens, B holds 1000, C holds 150. A finishes and frees its block. Now request D arrives needing 800 tokens, but the freed gap is only A's 200-token hole. The 800 tokens won't fit in the gap even though there might be 800 tokens of free memory scattered around. This is classic external fragmentation, and contiguous per-request allocation guarantees it.

Two memory strips: in the contiguous layout, freed space is split into gaps too small for a new request; in the paged layout, the same request fills scattered free blocks.

Figure: the failure and the fix. Fixed-size blocks make every free slot interchangeable, so "enough total memory" always means "enough memory."

The fix is the same one operating systems use for RAM: stop allocating contiguous variable-size blocks. Allocate fixed-size pages instead, and let a sequence's storage be a list of pages that need not be next to each other. In a full serving engine the pages come from one pool shared by every request: a sequence that needs 800 tokens grabs 800/page-size pages wherever they happen to be, and freed pages go back for anyone to reuse. No fragmentation, because every page is interchangeable.

This is what vLLM named paged attention. We build the same machinery (blocks, block tables, an allocator) at a smaller scope: in this chapter each request constructs its own cache with its own pool, so the cross-request sharing doesn't happen yet (a production engine hoists the pool above the requests). What a single request feels immediately is the other win: O(1) append, with no per-step reallocation or copying.

The shape of a paged cache

The design has a few moving parts. Per layer:

PLAINTEXT

free pool:  [ blk0 ][ blk1 ][ blk2 ][ blk3 ] ... [ blk255 ]   ← 256 fixed blocks
                                                                  each holds 16 tokens
 
a sequence's block table:  [ blk7 ][ blk2 ][ blk19 ]   ← a list of block ids
                              ↑       ↑        ↑
              tokens 0–15   tokens 16–31   tokens 32–...
 
block metadata (per block): ref_count, num_tokens, block_size, finalized

A block holds a fixed number of token slots (16). The actual K and V numbers live in the block's data arrays.
A block table is one sequence's ordered list of block ids. Token t lives in block table[t / 16], slot t % 16.
Block metadata tracks, per block: how many sequences reference it (ref_count), how many of its slots are filled (num_tokens), and whether it's finalized (frozen). ref_count and finalized are scaffolding for a design that shares blocks across sequences. This series doesn't build one, so ref_count stays 0 or 1 and no block is ever finalized.
The allocator owns the pool of free block ids and hands them out.

Appending a token: look at the last block in the table; if it's full, grab a fresh block from the pool and append its id; write the K/V row into the next free slot. That's O(1): no copying of history, ever. Reading the whole sequence back ("materialize") walks the block table and concatenates each block's filled portion.

The III.4 commit puts this in src/cache/paged/ with four files. We'll go bottom-up.

Block metadata and the allocator

src/cache/paged/block.rs holds the block table, the metadata, and the allocator. First the block table, just an ordered list of block ids:

RUST

use super::block_allocator_trait::BlockAllocator;
 
#[derive(Debug, Clone, Default)]
pub struct BlockTable {
    entries: Vec<usize>,
}
 
impl BlockTable {
    pub fn new() -> Self {
        Self::default()
    }
 
    pub fn last(&self) -> Option<usize> {
        self.entries.last().copied()
    }
 
    pub(super) fn push(&mut self, block_id: usize) {
        self.entries.push(block_id);
    }
 
    pub fn iter(&self) -> std::slice::Iter<'_, usize> {
        self.entries.iter()
    }
 
    pub fn drain(&mut self) -> Vec<usize> {
        std::mem::take(&mut self.entries)
    }
}

last is the block we're currently filling. iter walks all blocks for materialize. drain empties the table and returns its ids, used when freeing a sequence.

Per-block metadata:

RUST

#[derive(Debug, Clone)]
pub struct BlockMeta {
    pub ref_count: usize,
    pub num_tokens: usize,
    pub block_size: usize,
    pub finalized: bool,
}
 
impl BlockMeta {
    pub fn new(block_size: usize) -> Self {
        Self {
            ref_count: 0,
            num_tokens: 0,
            block_size,
            finalized: false,
        }
    }
 
    pub fn is_full(&self) -> bool {
        self.num_tokens >= self.block_size
    }
 
    pub fn is_free(&self) -> bool {
        self.ref_count == 0
    }
 
    pub fn reset(&mut self) {
        self.ref_count = 0;
        self.num_tokens = 0;
        self.finalized = false;
    }
}

is_full says whether the block can take another token. is_free says whether anyone still references it. reset returns a block to a clean state when it's recycled. ref_count would exceed 1 only in a design where several sequences hold the same block; nothing in this series does that, so it stays 0 or 1; the bookkeeping is block-sharing scaffolding, here from the start.

Allocation can fail; an error enum spells out how:

RUST

#[derive(Debug, Clone, PartialEq, Eq)]
pub(super) enum AllocError {
    OutOfBlocks,
    InvalidBlock(usize),
    AlreadyFinalized(usize),
}
 
impl std::fmt::Display for AllocError {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        match self {
            Self::OutOfBlocks => write!(f, "block allocator: out of physical blocks"),
            Self::InvalidBlock(id) => write!(f, "block allocator: invalid block id {id}"),
            Self::AlreadyFinalized(id) => {
                write!(f, "block allocator: cannot mutate finalized block {id}")
            }
        }
    }
}
 
impl std::error::Error for AllocError {}

OutOfBlocks is the pool being exhausted, a real possibility with a fixed pool and concurrent requests. InvalidBlock and AlreadyFinalized are programmer errors.

The allocator itself is a free-list: a vector of all block metadata, plus a stack of free ids:

RUST

#[derive(Debug)]
pub struct NaiveAllocator {
    blocks: Vec<BlockMeta>,
    free_ids: Vec<usize>,
}
 
impl NaiveAllocator {
    pub fn new(num_blocks: usize, block_size: usize) -> Self {
        Self {
            blocks: (0..num_blocks)
                .map(|_| BlockMeta::new(block_size))
                .collect(),
            free_ids: (0..num_blocks).rev().collect(),
        }
    }
}

free_ids starts holding every id. It's built .rev() so that popping the stack hands out ids 0, 1, 2, ... in order, purely cosmetic, but it makes debugging output readable.

The allocator implements a BlockAllocator trait so the data layer can be written against the interface, not the concrete NaiveAllocator:

RUST

impl BlockAllocator for NaiveAllocator {
    fn meta(&self) -> &[BlockMeta] {
        &self.blocks
    }
 
    fn allocate(&mut self, table: &mut BlockTable) -> Result<usize, AllocError> {
        let id = self.free_ids.pop().ok_or(AllocError::OutOfBlocks)?;
        self.blocks[id].reset();
        self.blocks[id].ref_count = 1;
        table.push(id);
        Ok(id)
    }
 
    fn free(&mut self, table: &mut BlockTable) -> Vec<usize> {
        let ids = table.drain();
        let mut freed = Vec::new();
        for id in ids {
            if self.blocks[id].ref_count == 0 {
                continue;
            }
            self.blocks[id].ref_count -= 1;
            if self.blocks[id].is_free() {
                self.blocks[id].reset();
                self.free_ids.push(id);
                freed.push(id);
            }
        }
        freed
    }
 
    fn increment_fill(&mut self, id: usize) -> Result<(usize, usize), AllocError> {
        let m = self
            .blocks
            .get_mut(id)
            .ok_or(AllocError::InvalidBlock(id))?;
        if m.finalized {
            return Err(AllocError::AlreadyFinalized(id));
        }
        let slot = m.num_tokens;
        m.num_tokens += 1;
        Ok((id, slot))
    }
}

allocate pops a free id, resets it, marks it referenced, and appends it to a sequence's table. free drains a sequence's table and, for each block, drops the ref count; a block that hits zero refs goes back on the free list. increment_fill claims the next slot in a block: it returns (block_id, slot_index) and bumps num_tokens. The ref_count decrement in free is why blocks can be shared later: only the last referencing sequence actually returns the block to the pool.

The allocator trait

src/cache/paged/block_allocator_trait.rs defines the interface, and folds in the one piece of logic every allocator shares: "append a token, growing the table if the last block is full":

RUST

use super::block::{AllocError, BlockMeta, BlockTable};
 
pub trait BlockAllocator: std::fmt::Debug + Send {
    fn meta(&self) -> &[BlockMeta];
 
    fn allocate(&mut self, table: &mut BlockTable) -> Result<usize, AllocError>;
 
    fn free(&mut self, table: &mut BlockTable) -> Vec<usize>;
 
    fn append_token_slot(&mut self, table: &mut BlockTable) -> Result<(usize, usize), AllocError> {
        let need_new = match table.last() {
            None => true,
            Some(id) => {
                let m = self.meta().get(id).ok_or(AllocError::InvalidBlock(id))?;
                m.is_full() || m.finalized
            }
        };
        if need_new {
            self.allocate(table)?;
        }
        let id = table.last().ok_or(AllocError::OutOfBlocks)?;
        self.increment_fill(id)
    }
 
    fn increment_fill(&mut self, block_id: usize) -> Result<(usize, usize), AllocError>;
}

append_token_slot is a default method, implemented once on the trait. It checks the last block: if the table is empty, or the last block is full, or finalized, allocate a fresh block. Then claim a slot in whichever block is last. This is the O(1) append: at most one new block allocation, no copying. Concrete allocators only have to supply meta, allocate, free, and increment_fill; append_token_slot comes for free.

The per-layer data

The metadata layer above doesn't store any actual numbers; BlockMeta knows a block has 5 tokens in it, but not what those tokens' K and V vectors are. src/cache/paged/layer.rs holds the real data:

RUST

use crate::tensor::Tensor;
 
use super::block::BlockTable;
use super::block_allocator_trait::BlockAllocator;
 
pub(crate) const BLOCK_SIZE: usize = 16;
pub(crate) const NUM_BLOCKS: usize = 256;
 
#[derive(Debug, Clone)]
pub(crate) struct KvBlockData {
    k_data: Vec<f32>,
    v_data: Vec<f32>,
}
 
impl KvBlockData {
    pub(crate) fn new(block_size: usize, kv_width: usize) -> Self {
        let cap = block_size * kv_width;
        Self {
            k_data: vec![0.0; cap],
            v_data: vec![0.0; cap],
        }
    }
}

BLOCK_SIZE = 16 is tokens per block; NUM_BLOCKS = 256 is the pool size. One layer's cache is therefore 256 × 16 = 4096 token-slots, and since each request constructs its own PagedKvCache, that is one sequence's budget: room for a 4096-token context. A KvBlockData is one block's storage: a flat f32 buffer for keys and one for values, each sized block_size × kv_width (kv_width is the per-token KV vector length, head_dim times the number of KV heads).

LayerPaged is one transformer layer's worth of paged storage:

RUST

#[derive(Debug)]
pub(crate) struct LayerPaged {
    pub(crate) blocks: Vec<KvBlockData>,
    pub(crate) table: BlockTable,
    pub(crate) kv_width: usize,
}
 
impl LayerPaged {
    pub(crate) fn new(num_blocks: usize, block_size: usize, kv_width: usize) -> Self {
        Self {
            blocks: (0..num_blocks)
                .map(|_| KvBlockData::new(block_size, kv_width))
                .collect(),
            table: BlockTable::new(),
            kv_width,
        }
    }

blocks is the physical storage, all 256 blocks, allocated up front. table is this layer's sequence of block ids. The physical blocks exist always; the table just records which ones currently belong to the sequence.

Writing one token's K and V into a known (block, slot):

RUST

    pub(crate) fn write_slot(&mut self, block_id: usize, slot: usize, k_row: &[f32], v_row: &[f32]) {
        debug_assert_eq!(k_row.len(), self.kv_width);
        debug_assert_eq!(v_row.len(), self.kv_width);
        let off = slot * self.kv_width;
        self.blocks[block_id].k_data[off..off + self.kv_width].copy_from_slice(k_row);
        self.blocks[block_id].v_data[off..off + self.kv_width].copy_from_slice(v_row);
    }

Slot s in a block lives at byte-offset s * kv_width into that block's buffer. copy_from_slice writes exactly kv_width floats, one token's K row and V row. This is the only copy in the whole append path, and it copies one row, not the history.

materialize is the reverse: gather every block's filled portion into one contiguous K tensor and one V tensor, the dense form attention needs:

RUST

    pub(crate) fn materialize(&self, alloc: &dyn BlockAllocator) -> (Tensor, Tensor) {
        let mut total_tokens = 0usize;
        for &bid in self.table.iter() {
            total_tokens += alloc.meta()[bid].num_tokens;
        }
        let mut k_out = Vec::with_capacity(total_tokens * self.kv_width);
        let mut v_out = Vec::with_capacity(total_tokens * self.kv_width);
        for &bid in self.table.iter() {
            let n = alloc.meta()[bid].num_tokens;
            let end = n * self.kv_width;
            k_out.extend_from_slice(&self.blocks[bid].k_data[..end]);
            v_out.extend_from_slice(&self.blocks[bid].v_data[..end]);
        }
        (
            Tensor::new(k_out, vec![total_tokens, self.kv_width]),
            Tensor::new(v_out, vec![total_tokens, self.kv_width]),
        )
    }
 
    pub(crate) fn seq_len(&self, alloc: &dyn BlockAllocator) -> usize {
        self.table
            .iter()
            .map(|&bid| alloc.meta()[bid].num_tokens)
            .sum()
    }
}

It first sums num_tokens across the table to size the output, then walks the table again copying each block's filled prefix (..end, not the whole 16-slot block, since the last block is usually partial). The result is a [total_tokens, kv_width] tensor identical to what BasicKvCache would have held. seq_len just sums the token counts.

Materialize does copy the whole history, but only when attention needs it, not on every append. The append path stays O(1); the copy happens once per decode step at read time, which is unavoidable since attention wants a dense matrix.

The paged cache

src/cache/paged/paged_cache.rs assembles the pieces into a type implementing the KvCache trait. Recall that trait from II.2: set_prefill, push_row, materialize, seq_len, plus a snapshot method (we'll come to that). The struct holds one allocator and one LayerPaged per transformer layer:

RUST

pub(crate) struct PagedKvCache {
    allocs: Vec<NaiveAllocator>,
    layers: Vec<LayerPaged>,
    ops: Arc<dyn Backend>,
}
 
impl Debug for PagedKvCache {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        f.debug_struct("PagedKvCache")
            .field("allocs", &self.allocs.len())
            .field("layers", &self.layers.len())
            .field("ops", &"<Arc<dyn Backend>>")
            .finish()
    }
}
 
impl PagedKvCache {
    pub fn new(num_layers: usize, kv_width: usize, ops: Arc<dyn Backend>) -> Self {
        Self {
            allocs: (0..num_layers)
                .map(|_| NaiveAllocator::new(NUM_BLOCKS, BLOCK_SIZE))
                .collect(),
            layers: (0..num_layers)
                .map(|_| LayerPaged::new(NUM_BLOCKS, BLOCK_SIZE, kv_width))
                .collect(),
            ops,
        }
    }
}

One allocator and one LayerPaged per layer, each with its own 256-block pool. ops is the backend, needed because writing into blocks means pulling rows out of tensors, which goes through backend ops.

The KvCache trait requires a snapshot: a frozen copy of the cache that can be stored and later re-materialized. The prefix cache in III.5 is the consumer; we build the plumbing here. A snapshot is just the materialized dense K/V tensors per layer:

RUST

#[derive(Clone)]
struct PagedKvSnapshot {
    kv_width: usize,
    k_by_layer: Vec<Tensor>,
    v_by_layer: Vec<Tensor>,
    ops: Arc<dyn Backend>,
}
 
impl std::fmt::Debug for PagedKvSnapshot {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        f.debug_struct("PagedKvSnapshot")
            .field("num_layers", &self.k_by_layer.len())
            .field("seq_len", &self.seq_len())
            .finish()
    }
}
 
impl KvSnapshot for PagedKvSnapshot {
    fn seq_len(&self) -> usize {
        self.k_by_layer.first().map(|t| t.shape()[0]).unwrap_or(0)
    }
 
    fn materialize(&self) -> Box<dyn KvCache> {
        let mut out = PagedKvCache::new(self.k_by_layer.len(), self.kv_width, self.ops.clone());
        for li in 0..self.k_by_layer.len() {
            out.set_prefill(li, &self.k_by_layer[li], &self.v_by_layer[li]);
        }
        Box::new(out)
    }
}

materialize on a snapshot builds a fresh PagedKvCache and replays the stored tensors into it via set_prefill, turning a saved snapshot back into a usable cache.

Now the KvCache impl. snapshot materializes every layer and bundles the tensors:

RUST

impl KvCache for PagedKvCache {
    fn snapshot(&self) -> Box<dyn KvSnapshot> {
        let mut k_by_layer = Vec::with_capacity(self.layers.len());
        let mut v_by_layer = Vec::with_capacity(self.layers.len());
        for li in 0..self.layers.len() {
            let (k, v) = self.materialize(li);
            k_by_layer.push(k);
            v_by_layer.push(v);
        }
        let kv_width = self.layers.first().map(|l| l.kv_width).unwrap_or(0);
        Box::new(PagedKvSnapshot {
            kv_width,
            k_by_layer,
            v_by_layer,
            ops: self.ops.clone(),
        })
    }
 
    fn kind(&self) -> &'static str {
        "paged"
    }

set_prefill loads an entire prompt's K/V for one layer, called once after the prefill forward pass. It frees whatever the layer held, then appends every token slot by slot:

RUST

    fn set_prefill(&mut self, layer: usize, k: &Tensor, v: &Tensor) {
        let ops = self.ops.as_ref();
        debug_assert_eq!(k.shape(), v.shape());
        debug_assert_eq!(k.shape().len(), 2);
        let seq_len = k.shape()[0];
        let kv_width = k.shape()[1];
        let alloc = &mut self.allocs[layer];
        let lp = &mut self.layers[layer];
 
        let _ = alloc.free(&mut lp.table);
        for t in 0..seq_len {
            let (block_id, slot) = alloc.append_token_slot(&mut lp.table).unwrap_or_else(|e| {
                panic!(
                    "paged KV cache set_prefill: {e} (e.g. out of blocks for this prefill length)"
                )
            });
            let row_off = t * kv_width;
            let mut kr = vec![0.0f32; kv_width];
            let mut vr = vec![0.0f32; kv_width];
            ops.copy_contiguous_into(k, row_off, &mut kr);
            ops.copy_contiguous_into(v, row_off, &mut vr);
            lp.write_slot(block_id, slot, &kr, &vr);
        }
    }

For each of the prompt's tokens: append_token_slot finds (and if needed allocates) a (block, slot), copy_contiguous_into pulls that token's row out of the input tensor, and write_slot writes it into the block. A 100-token prompt fills 6 full blocks plus a partial 7th.

push_row is the decode-step append, one token:

RUST

    fn push_row(&mut self, layer: usize, k_row: &Tensor, v_row: &Tensor) {
        let ops = self.ops.as_ref();
        let w = self.layers[layer].kv_width;
        let mut kr = vec![0.0f32; w];
        let mut vr = vec![0.0f32; w];
        ops.copy_contiguous_into(k_row, 0, &mut kr);
        ops.copy_contiguous_into(v_row, 0, &mut vr);
        let alloc = &mut self.allocs[layer];
        let lp = &mut self.layers[layer];
        let (block_id, slot) = alloc
            .append_token_slot(&mut lp.table)
            .unwrap_or_else(|e| panic!("paged KV cache push_row: {e} (e.g. out of blocks)"));
        lp.write_slot(block_id, slot, &kr, &vr);
    }

This is the line that replaces BasicKvCache's concat_dim0. Where the basic cache reallocated and copied the whole history, push_row claims one slot and writes one row. O(1), every step.

The two read methods just delegate to LayerPaged:

RUST

    fn materialize(&self, layer: usize) -> (Tensor, Tensor) {
        self.layers[layer].materialize(&self.allocs[layer])
    }
 
    fn seq_len(&self, layer: usize) -> usize {
        self.layers[layer].seq_len(&self.allocs[layer])
    }
 
    fn num_layers(&self) -> usize {
        self.layers.len()
    }
}

Selecting the cache at runtime

With two cache implementations, the codebase needs a way to choose. src/cache/kind.rs is a small enum plus a parser:

RUST

use std::fmt;
use std::sync::Arc;
 
use crate::backend::Backend;
 
use super::basic::BasicKvCache;
use super::kv_cache_trait::KvCache;
use super::paged::PagedKvCache;
 
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
pub(crate) enum KvCacheKind {
    Basic,
    Paged,
}
 
impl fmt::Display for KvCacheKind {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        match self {
            KvCacheKind::Basic => write!(f, "basic"),
            KvCacheKind::Paged => write!(f, "paged"),
        }
    }
}
 
pub(crate) fn parse_kv_mode(mode: &str) -> Result<KvCacheKind, String> {
    match mode.trim() {
        "basic" => Ok(KvCacheKind::Basic),
        "paged" => Ok(KvCacheKind::Paged),
        other => Err(format!(
            "unknown kv cache mode {other:?} (supported: basic, paged)"
        )),
    }
}
 
impl KvCacheKind {
    pub(crate) fn boxed(
        self,
        num_layers: usize,
        kv_width: usize,
        ops: Arc<dyn Backend>,
    ) -> Box<dyn KvCache> {
        match self {
            KvCacheKind::Basic => Box::new(BasicKvCache::new(num_layers, kv_width, ops)),
            KvCacheKind::Paged => Box::new(PagedKvCache::new(num_layers, kv_width, ops)),
        }
    }
}

boxed is the factory: given a kind, produce the right boxed KvCache. The cache factory from Act 2 now routes through it. Its import swaps from BasicKvCache (which kind.rs now names instead) to the parser:

RUST

use super::kind::parse_kv_mode;

RUST

pub fn create_kv_cache(
    mode: &str,
    model: Arc<dyn Model>,
    backend: Arc<dyn Backend>,
) -> Result<Box<dyn KvCache>, String> {
    let kind = parse_kv_mode(mode)?;
    let (layers, width) = model.kv_cache_layers_and_width();
    Ok(kind.boxed(layers, width, backend))
}

The cache module wires in the two new files:

RUST

mod basic;
mod factory;
mod kind;
mod kv_cache_trait;
mod paged;

And the --kv flag, which since Act 2 already accepted basic, learns paged:

RUST

        Some("paged") => {
            cursor.advance();
            "paged"
        }
        Some(s) if !s.starts_with('-') => {
            panic!("invalid --kv mode '{s}' (expected basic or paged)");
        }

The two binaries' usage strings update to advertise it:

RUST

        "usage: chat-repl [--kv [basic|paged]] [--backend scalar|simd|parallel|metal] <gguf_path> <max_new_tokens> [<user message…>]"

RUST

        "usage: model-generate [--kv [basic|paged]] [--backend scalar|simd|parallel|metal] <gguf_path> [prompt] [max_new_tokens]"

Running it

Run chat-repl with the paged cache:

BASH

cargo run --release --bin chat-repl -- --kv paged path/to/qwen3-0.6b-q8_0.gguf 512 "Write a haiku about Rust."

PLAINTEXT

backend: simd
kv cache: paged
 
assistant> === rendered prompt (exact string passed to tokenizer) ===
<|im_start|>user
Write a haiku about Rust.<|im_end|>
<|im_start|>assistant
 
 
=== prompt token ids (count = 15) ===
[151644, 872, 198, 7985, 264, 6386, 38242, 911, 33789, 13, 151645, 198, 151644, 77091, 198]
 
<think>
Okay, the user wants a haiku about Rust. Let me start by recalling what a haiku is. It's a traditional Japanese poem with three lines, syllable structure of 5-7-5. So I need to make sure each line has the right number of syllables.
 
[... ~220 more thinking tokens ...]
</think>
 
Rust's syntax sharpens the mind
Concurrency thrives in Rust's code
A language that builds with care.
chat: user_message_token_count=7 (raw user text, no template)
chat: templated_prompt_token_count=15 (after chat template, input to model prefill)
metrics:
  time_to_first_token_ms: 358.661
  decode_tokens_per_second: 29.680
  per_forward_ms: min 23.993  max 358.661  mean 34.773  (n=301)
    forward 1: 358.661 ms
    forward 2: 24.181 ms
    ...
    forward 301: 44.383 ms

The output is identical to --kv basic, same model, same math, because the paged cache stores the same numbers; it only changes how they're laid out in memory. On a single short generation the speed difference is small (the basic cache's quadratic copying only bites at long sequences). The payoff isn't visible from one CLI run: every append is now O(1) no matter how long the sequence gets (no reallocation, no copying of history), and the block, table, and allocator machinery is what a shared cross-request pool is made of, ready for an engine that hoists it above the requests.

Where this leaves us

The KV cache is now shaped like a server's. Fixed-size blocks come from a pool, a sequence is a list of block ids, appends are O(1) with no copying of history, and freed blocks return to the pool for reuse. The pool here lives inside each request (a production engine hoists one shared pool above all requests, which is where the anti-fragmentation story fully pays off), but the design is vLLM's, at per-request scope. The snapshot/materialize plumbing is in place too, unused for now.

That plumbing is the hook for the next optimization. Most chat requests share a prefix (a system prompt, the earlier turns of a conversation), and the engine still re-prefills those shared tokens on every request. The next chapter builds a radix tree of KV snapshots so a shared prefix is computed once and reused.