III.5: Radix prefix cache

III.4 gave us a KV cache built for serving (fixed-size blocks, O(1) append, no fragmentation), and it implemented the snapshot method the KvCache trait has carried since II.2: freeze a cache into a re-materializable form. This chapter is what that method was for.

Here is the waste we're attacking. A chat server's requests are not independent strings. They overlap, heavily:

A shared system prompt. Every request to one deployment usually begins with the same several-hundred-token system prompt.
Multi-turn conversations. Turn 3 of a chat contains all of turns 1 and 2 verbatim; recall from III.1 that each turn re-renders and re-prefills the whole message history.

The engine prefills every one of those shared tokens every single time, running the prompt through the model in one forward pass. Prefilling a 500-token shared prefix is real work: it's most of a request's time-to-first-token. Doing it identically for request after request is pure waste.

The fix: cache the result of prefilling a prefix (its KV snapshot) keyed by the token ids of that prefix. When a new request's prompt starts with a prefix we've already prefilled, skip straight past the shared tokens and reuse the saved KV. The data structure that makes "longest cached prefix of these token ids" a fast lookup is a radix tree.

What a radix tree buys us

A prefix cache needs one operation: given a list of token ids, find the longest prefix of that list we've stored, and return its saved KV. A hash map keyed on the full id list can't do this; it only matches exact keys, and we want the longest partial match.

A prefix tree is built for exactly this. (Strictly, what we build is a trie, one token per edge; a radix tree is the compressed variant that merges single-child chains into one edge. We skip the compression and keep the name, following SGLang's usage.) It's a tree where each edge is labeled with one token id, so a path from the root spells out a sequence of ids. Store the system prompt [A, B, C] and the tree has a path root → A → B → C. Now a request whose prompt is [A, B, C, D, E] walks the tree from the root: it follows A, B, C (matched) then hits no edge for D and stops. We matched the 3-token prefix [A, B, C] and learned the request only needs to prefill [D, E].

PLAINTEXT

root
 └─ A
     └─ B
         └─ C  ●  ← terminal: KV snapshot for [A,B,C] stored here
             └─ D
                 └─ E  ●  ← terminal: KV snapshot for [A,B,C,D,E]

Nodes marked ● are terminal: a prefix we actually prefilled and saved. A node can be non-terminal (a waypoint on the path to longer entries). Lookup walks ids one at a time, remembering the deepest terminal it passed; that's the longest cached prefix.

Tree with a root, a shared 291-token prefix node, and two terminal nodes for turn 1 and turn 2; a dashed arrow shows a new request reusing the shared prefix.

Figure: this chapter's demo conversation as a tree. Turn 2's prompt extends turn 1's, so its lookup walks straight through the stored prefix and only the new suffix needs prefill.

Two things make it production-shaped rather than a toy:

It's bounded. A server runs forever; the cache can't grow without limit. We cap the number of entries and evict the least-recently-used when full.
It's safe under concurrency. Several requests touch the cache at once, so it's behind a Mutex. And eviction must never yank an entry out from under a request that's still reading it, so each entry carries a pin count, incremented on lookup and decremented (via Drop) once the request has materialized the snapshot into its own cache. Pinned entries are skipped by the evictor.

The III.5 commit puts this in src/cache/prefix/radix.rs.

The nodes

src/cache/prefix/radix.rs starts with the tree's nodes:

RUST

use std::collections::{HashMap, VecDeque};
use std::sync::Arc;
use std::sync::Mutex;
use std::sync::atomic::{AtomicUsize, Ordering};
 
use crate::cache::KvSnapshot;
 
pub type TokenId = usize;
 
struct RadixNode {
    children: HashMap<TokenId, Box<RadixNode>>,
    terminal: Option<TerminalEntry>,
}
 
struct TerminalEntry {
    path: Arc<[TokenId]>,
    snapshot: Arc<dyn KvSnapshot>,
    next_id: usize,
    pin_count: Arc<AtomicUsize>,
}
 
impl RadixNode {
    fn new() -> Self {
        Self {
            children: HashMap::new(),
            terminal: None,
        }
    }
}

A RadixNode has children keyed by token id (the labeled edges) and an optional terminal. The HashMap for children means following an edge is an O(1) hash lookup; a node can branch into many children when different prompts diverge after a shared prefix.

TerminalEntry is the payload stored at a ● node:

path: the full id sequence from root to this node, kept so the LRU list and the evictor can identify the entry. Arc<[TokenId]> so it's cheaply shareable.
snapshot: the KvSnapshot from III.4: the frozen KV state after prefilling path.
next_id: the first token the model predicted after the prefix. Stored so a full hit (the new prompt equals a cached prefix exactly) needs zero forward passes; we already know the next token.
pin_count: an AtomicUsize counting live readers. Atomic so it can be bumped without holding the tree's Mutex.

The cache struct

The cache wraps a tree behind a Mutex and adds an LRU list:

RUST

pub struct RadixPrefixCache {
    inner: Mutex<Inner>,
    max_entries: usize,
}
 
struct Inner {
    root: Box<RadixNode>,
    lru: VecDeque<Arc<[TokenId]>>,
    lru_positions: HashMap<Arc<[TokenId]>, ()>,
    entries: usize,
}
 
impl RadixPrefixCache {
    pub fn new(max_entries: usize) -> Arc<RadixPrefixCache> {
        Arc::new(Self {
            inner: Mutex::new(Inner {
                root: Box::new(RadixNode::new()),
                lru: VecDeque::new(),
                lru_positions: HashMap::new(),
                entries: 0,
            }),
            max_entries,
        })
    }
 
    pub fn max_entries(&self) -> usize {
        self.max_entries
    }
 
    pub fn entries(&self) -> usize {
        self.inner.lock().map(|g| g.entries).unwrap_or(0)
    }

Inner is everything the Mutex guards: the tree root, the LRU ordering, and the entry count. lru is a deque of entry paths, front is least-recently-used, back is most-recent. lru_positions is a set of which paths are in the LRU (a HashMap<_, ()> used as a set) so we can check membership without scanning the deque. new returns the cache already wrapped in an Arc, since every part of the server shares one instance.

Lookup

lookup_longest walks the tree following the prompt's ids, tracking the deepest terminal:

RUST

    pub fn lookup_longest(&self, ids: &[TokenId]) -> Option<CacheHit> {
        if self.max_entries == 0 || ids.is_empty() {
            return None;
        }
 
        let mut guard = self.inner.lock().ok()?;
 
        let mut node: &RadixNode = &guard.root;
        let mut best: Option<(usize, &TerminalEntry)> = None;
        for (i, &tid) in ids.iter().enumerate() {
            let Some(next) = node.children.get(&tid) else {
                break;
            };
            node = next.as_ref();
            if let Some(ref t) = node.terminal {
                best = Some((i + 1, t));
            }
        }
 
        let (prefix_len, path_for_lru, pin_count, snapshot, next_id) = {
            let (prefix_len, term) = best?;
            (
                prefix_len,
                Arc::clone(&term.path),
                Arc::clone(&term.pin_count),
                Arc::clone(&term.snapshot),
                term.next_id,
            )
        };
 
        pin_count.fetch_add(1, Ordering::AcqRel);
        Self::touch_lru(&mut guard, &path_for_lru);
        drop(guard);
 
        Some(CacheHit {
            prefix_len,
            snapshot,
            next_id,
            pin_count,
        })
    }

The walk: for each id, try to follow the matching child edge; if there's no edge, stop. Every time the walk lands on a node with a terminal, record it as best along with i + 1 (the length of the prefix matched so far). When the loop ends, best holds the deepest terminal we passed, the longest cached prefix.

If we found one, three reference-counted handles are cloned out of the entry (path, pin_count, snapshot) plus next_id. Then pin_count.fetch_add(1, ...) pins the entry: a live reader now exists, and the evictor must leave this entry alone. touch_lru marks the entry most-recently-used. We drop the lock and hand back a CacheHit.

Insert

insert adds a freshly prefilled prefix to the tree:

RUST

    pub fn insert(
        &self,
        ids: Vec<TokenId>,
        snapshot: Arc<dyn KvSnapshot>,
        next_id: usize,
    ) -> Result<(), String> {
        if self.max_entries == 0 || ids.is_empty() {
            return Ok(());
        }
 
        let mut guard = self
            .inner
            .lock()
            .map_err(|_| "cache lock poisoned".to_string())?;
 
        let mut node = guard.root.as_mut();
        for &tid in &ids {
            node = node
                .children
                .entry(tid)
                .or_insert_with(|| Box::new(RadixNode::new()));
        }
 
        let is_new = node.terminal.is_none();
        let pin_count = if let Some(ref old) = node.terminal {
            Arc::clone(&old.pin_count)
        } else {
            Arc::new(AtomicUsize::new(0))
        };
 
        let path: Arc<[TokenId]> = match &node.terminal {
            Some(old) => Arc::clone(&old.path),
            None => ids.into(),
        };
 
        node.terminal = Some(TerminalEntry {
            path: Arc::clone(&path),
            snapshot,
            next_id,
            pin_count,
        });
 
        if is_new {
            guard.entries += 1;
            guard.lru.push_back(Arc::clone(&path));
            let _ = guard.lru_positions.insert(path, ());
        } else {
            Self::touch_lru(&mut guard, &path);
        }
 
        self.evict_if_needed(&mut guard);
        Ok(())
    }

It walks the ids from the root, creating any missing child node along the way (entry(...).or_insert_with(...)). When it reaches the final node it installs a TerminalEntry. If that node was already terminal (re-inserting an existing prefix) it reuses the old pin_count and path so any live reader's pin stays valid; otherwise it starts a fresh pin count at 0.

A new entry bumps the count and joins the back of the LRU; a re-insert just touches the LRU. Either way, evict_if_needed runs at the end to enforce the cap.

touch_lru moves an entry to the most-recently-used end:

RUST

    fn touch_lru(inner: &mut Inner, path: &Arc<[TokenId]>) {
        if inner.lru_positions.contains_key(path) {
            inner.lru.retain(|p| !Arc::ptr_eq(p, path));
            inner.lru.push_back(Arc::clone(path));
        }
    }

Remove the path from wherever it sits and push it to the back. Arc::ptr_eq compares by pointer identity, not by contents; the same allocation is being re-positioned.

Eviction

evict_if_needed drops least-recently-used entries until the cache is back under its cap, skipping any entry that's pinned:

RUST

    fn evict_if_needed(&self, inner: &mut Inner) {
        while inner.entries > self.max_entries {
            let mut evict_idx = None;
            for (i, p) in inner.lru.iter().enumerate() {
                if Self::is_pinned(inner, p) {
                    continue;
                }
                evict_idx = Some(i);
                break;
            }
            let Some(i) = evict_idx else {
                break;
            };
            let path = inner.lru.remove(i).expect("evict_idx valid");
            inner.lru_positions.remove(&path);
            if Self::remove_terminal(&mut inner.root, &path) {
                inner.entries -= 1;
            }
        }
    }

While over the cap, scan the LRU from the front (least-recent) for the first un-pinned entry. If every entry is pinned, give up; better to run slightly over the cap than to evict KV out from under a request mid-decode. Otherwise remove that entry from the LRU and from the tree.

is_pinned walks to a path's terminal and checks its count:

RUST

    fn is_pinned(inner: &Inner, path: &Arc<[TokenId]>) -> bool {
        let mut node: &RadixNode = &inner.root;
        for &tid in path.iter() {
            let Some(next) = node.children.get(&tid) else {
                return false;
            };
            node = next.as_ref();
        }
        node.terminal
            .as_ref()
            .map(|t| t.pin_count.load(Ordering::Acquire) > 0)
            .unwrap_or(false)
    }

remove_terminal clears the terminal marker at a path:

RUST

    fn remove_terminal(root: &mut RadixNode, path: &[TokenId]) -> bool {
        fn walk(node: &mut RadixNode, path: &[TokenId]) -> bool {
            if path.is_empty() {
                return node.terminal.take().is_some();
            }
            let Some(child) = node.children.get_mut(&path[0]) else {
                return false;
            };
            walk(child, &path[1..])
        }
        walk(root, path)
    }
}

walk recurses down one id at a time; at the end it take()s the terminal, dropping the stored snapshot. The intermediate nodes are left in place (they may be on the path to other entries); only the terminal marker is removed.

The hit handle and its pin

CacheHit is what lookup_longest returns, and it manages the pin via Drop:

RUST

pub struct CacheHit {
    pub prefix_len: usize,
    pub snapshot: Arc<dyn KvSnapshot>,
    pub next_id: usize,
    pin_count: Arc<AtomicUsize>,
}
 
impl CacheHit {
    pub fn is_full_hit(&self, prompt_len: usize) -> bool {
        self.prefix_len == prompt_len
    }
}
 
impl Drop for CacheHit {
    fn drop(&mut self) {
        self.pin_count.fetch_sub(1, Ordering::AcqRel);
    }
}
 
impl std::fmt::Debug for CacheHit {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        f.debug_struct("CacheHit")
            .field("prefix_len", &self.prefix_len)
            .field("next_id", &self.next_id)
            .finish()
    }
}

This is the RAII pin, RAII being the C++/Rust idiom of tying cleanup to a value's destructor. lookup_longest incremented pin_count; the Drop impl decrements it. As long as the CacheHit value is alive (a request that has looked the entry up but not yet finished materializing its snapshot) the entry counts as pinned and the evictor skips it. The moment the CacheHit is dropped, the pin releases automatically. No manual unpin call to forget; the type system guarantees the count balances. The pin covers the lookup-to-materialize window; beyond it, the Arc around the snapshot means even an evicted entry stays alive for anyone still holding it.

is_full_hit reports whether the cached prefix covered the entire prompt, the case where no prefill is needed at all.

The module re-export:

RUST

mod radix;
 
pub use radix::RadixPrefixCache;

RUST

pub(crate) mod prefix;

RUST

pub use prefix::RadixPrefixCache;

and lib.rs's cache re-export grows to match, so binaries can name the type:

RUST

pub use cache::{create_kv_cache, KvCache, RadixPrefixCache};

Using the cache in the prefill path

The prefix cache is wired into prepare_chat_decode_state from III.1, the function that builds the KV cache and runs prefill. It gains a prefix_cache parameter and a three-way branch.

RUST

pub(crate) fn prepare_chat_decode_state(
    model: Arc<dyn Model>,
    tokenizer: &dyn Tokenizer,
    backend: Arc<dyn Backend>,
    messages: &[ChatTemplateMessage],
    kv_mode: &str,
    prefix_cache: Option<&Arc<RadixPrefixCache>>,
) -> Result<ChatDecodeState, String> {
    let (prompt, prompt_ids, eos_token_id) = chat_prompt_details(tokenizer, messages)?;
    log_verbose_prompt(&prompt, &prompt_ids);
 
    let mut metrics = Metrics::default();
    let mut cache: Box<dyn KvCache>;
    let mut next_id: usize;
    let mut full_hit = false;

The first branch: there is a prefix cache, and lookup_longest found a hit.

RUST

    if let Some(pc) = prefix_cache {
        if let Some(hit) = pc.lookup_longest(&prompt_ids) {
            let prefix_len = hit.prefix_len;
            next_id = hit.next_id;
            cache = hit.snapshot.materialize();
            drop(hit);
 
            if prefix_len == prompt_ids.len() {
                full_hit = true;
            } else {
                let suffix = &prompt_ids[prefix_len..];
                let mut last_logits = None;
                for (i, &tid) in suffix.iter().enumerate() {
                    let pos = prefix_len + i;
                    let logits = metrics.record_timed(|| {
                        model
                            .as_ref()
                            .forward_decode_with_kv_cache(tid, pos, cache.as_mut())
                    });
                    last_logits = Some(logits);
                }
                let logits = last_logits.expect("non-empty suffix must produce logits");
                next_id = next_token_id_from_logits(backend.as_ref(), &logits).0;
            }
        } else {

On a hit: materialize rebuilds a usable KV cache from the saved snapshot, the shared prefix's KV state, not recomputed. Then drop(hit) releases the pin; we've copied the snapshot's contents into our own cache, so the cached entry no longer needs protecting from this request.

Two sub-cases. If the hit covered the whole prompt (prefix_len == prompt_ids.len()), it's a full_hit: no prefill at all, and next_id is the cached next_id. Otherwise the prompt has a suffix the cache didn't have; those tokens still need running, but with forward_decode_with_kv_cache one at a time (cheap decode steps) instead of a full prefill of the whole prompt. The last suffix token's logits give the first reply token.

The other two branches are the original prefill path, used on a cache miss and when there's no prefix cache at all:

RUST

            cache = create_kv_cache(kv_mode, model.clone(), backend.clone())?;
            let logits = metrics.record_timed(|| {
                model
                    .as_ref()
                    .forward_prefill_with_kv_cache(&prompt_ids, cache.as_mut())
            });
            next_id = next_token_id_from_logits(backend.as_ref(), &logits).0;
        }
    } else {
        cache = create_kv_cache(kv_mode, model.clone(), backend.clone())?;
        let logits = metrics.record_timed(|| {
            model
                .as_ref()
                .forward_prefill_with_kv_cache(&prompt_ids, cache.as_mut())
        });
        next_id = next_token_id_from_logits(backend.as_ref(), &logits).0;
    }

A fresh cache, a full prefill: the III.1 behavior.

Then, after prefill by whichever path, insert this prompt's result into the cache so the next request can hit it:

RUST

    if let Some(pc) = prefix_cache {
        if !full_hit {
            let snap = cache.snapshot();
            let _ = pc.insert(prompt_ids.clone(), Arc::from(snap), next_id);
        }
    }

We skip insertion on a full_hit; that exact prefix is already in the tree. Otherwise snapshot the cache and insert it keyed by the full prompt ids. The next request with this prompt as a prefix gets a hit.

The prefix_cache parameter threads up through run_chat_turn_with_prefix and run_chat_turn_streaming_with_prefix; both gain it and pass it down. For now the HTTP handlers pass None; the next chapter, which gives the server a real scheduler, is where the server wires in a live cache. chat-repl, though, can use it today.

Wiring chat-repl

chat-repl learns a --prefix-cache-max N flag, the entry cap, defaulting to 0 (disabled). It takes a number, not a string, so the parser first gains a tiny helper that reads the next argument as a usize or dies with a clear message:

RUST

fn parse_usize(cursor: &mut ArgCursor<'_>, flag: &str) -> usize {
    let s = cursor.expect_value(flag);
    s.parse()
        .unwrap_or_else(|_| panic!("{flag} must be a non-negative integer, got {s:?}"))
}

Then the flag itself, in the same four places as ever (field, local, arm, struct literal):

RUST

    prefix_cache_max: Option<usize>,
// ...
        let mut prefix_cache_max = None;
// ...
                Some("--prefix-cache-max") => {
                    cur.advance();
                    prefix_cache_max = Some(parse_usize(&mut cur, "--prefix-cache-max"));
                }
// ...
            prefix_cache_max,

RUST

    pub fn prefix_cache_max(&self, default: usize) -> usize {
        self.prefix_cache_max.unwrap_or(default)
    }

chat-repl's use inferno::{...} list gains RadixPrefixCache, and its usage string gains [--prefix-cache-max N]:

RUST

    Backend, ChatTemplateMessage, CliArgs, Metrics, Model, RadixPrefixCache, Tokenizer,
// ...
        "usage: chat-repl [--kv [basic|paged]] [--prefix-cache-max N] [--backend scalar|simd|parallel|metal] <gguf_path> <max_new_tokens> [<user message…>]"

main reads the flag, then builds the cache when it's positive (and a KV cache is on, since a prefix cache with no KV mode has nothing to snapshot):

RUST

    let prefix_cache_max = args.prefix_cache_max(0);
// ...
    let prefix_cache = if kv_mode.is_some() && prefix_cache_max > 0 {
        Some(RadixPrefixCache::new(prefix_cache_max))
    } else {
        None
    };
    eprintln!("prefix cache max: {}", prefix_cache_max);

and threads prefix_cache.as_ref() into run_one_shot / run_repl, which pass it on to run_chat_turn_streaming_with_prefix. In the REPL, where the conversation history grows every turn, this is the multi-turn case: turn 2's prompt has turn 1's prompt as a prefix, so turn 2 reuses turn 1's KV instead of re-prefilling it.

Running it

To see the cache pay off, the shared prefix has to be long. A two-word question costs almost nothing to re-prefill; a pasted document costs seconds. So the demo is: turn 1 pastes a ten-sentence document and asks a question about it; turn 2 asks a short follow-up. Turn 2's prompt re-renders the full history, so everything up through turn 1's prompt is a prefix of an entry already in the cache.

Run the REPL with a small prefix cache:

BASH

cargo run --release --bin chat-repl -- --kv paged --prefix-cache-max 16 path/to/qwen3-0.6b-q8_0.gguf 512

Turn 1 pastes the document, 291 tokens once templated (the transcript below elides the rendered-prompt echo and the middle of the think block):

PLAINTEXT

backend: simd
kv cache: paged
prefix cache max: 16
 
Enter user messages (empty line to quit). Ctrl-D EOF also exits.
user> I am going to paste a short excerpt from our deployment notes and then ask you questions about it. The inference server runs on a single Apple M2 Pro with 32 GB of unified memory. Models are stored as GGUF files on the local disk and are loaded exactly once at startup, then shared by every request. The default compute backend is the SIMD CPU path, which decodes at roughly 38 tokens per second on the Q8_0 export of Qwen3 0.6B. Prefill cost grows linearly with prompt length, so long prompts dominate time-to-first-token. The paged KV cache hands out fixed-size blocks of 16 tokens from a block pool, so appending a token never copies history. The radix prefix cache stores KV snapshots keyed by token ids and evicts the least-recently-used entry when it is full. Pinned entries are never evicted, because a live request may still be materializing them. Streaming responses are delivered over server-sent events as chat.completion.chunk objects, and the stream always ends with the sentinel [DONE]. The health check is served at /health, and the list of hosted models is served at /v1/models. The server binds to 127.0.0.1:8000 unless a --bind flag overrides it. First question: what hardware does the server run on?
assistant> === rendered prompt (exact string passed to tokenizer) ===
[... the same message, wrapped in <|im_start|>user ... <|im_end|> markup ...]
 
=== prompt token ids (count = 291) ===
[151644, 872, 198, 40, 1079, 2087, 311, 24937, ...] ... (195 ids omitted) ... [..., 3538, 1598, 389, 30, 151645, 198, 151644, 77091, 198]
 
<think>
Okay, let's see. The user is asking what hardware the server runs on. They provided a deployment note excerpt. Let me go through the text again to find the relevant information.
 
[... ~160 more thinking tokens ...]
</think>
 
The server runs on a **single Apple M2 Pro with 32 GB of unified memory**.
metrics:
  time_to_first_token_ms: 7436.796
  decode_tokens_per_second: 20.754
  per_forward_ms: min 40.966  max 7436.796  mean 81.467  (n=222)
    forward 1: 7436.796 ms
    forward 2: 41.544 ms
    ...

Prefilling that 291-token prompt in one forward pass took 7.4 seconds. That's the cost we want to stop paying twice. Now the follow-up:

PLAINTEXT

user> And what does the stream end with?
assistant> === rendered prompt (exact string passed to tokenizer) ===
[... the whole turn-1 exchange re-rendered, then: ...]
<|im_start|>user
And what does the stream end with?<|im_end|>
<|im_start|>assistant
 
 
=== prompt token ids (count = 330) ===
[151644, 872, 198, 40, 1079, 2087, 311, 24937, ...] ... (234 ids omitted) ... [..., 3036, 1128, 1558, 279, 4269, 835, 448, 30, 151645, 198, 151644, 77091, 198]
 
<think>
Okay, the user is asking about what the stream ends with. Let me recall the information provided.
 
[... ~110 more thinking tokens ...]
</think>
 
The stream ends with the sentinel **[DONE]**.
metrics:
  time_to_first_token_ms: 41.873
  decode_tokens_per_second: 20.941
  per_forward_ms: min 41.167  max 60.610  mean 47.722  (n=187)
    forward 1: 41.873 ms
    forward 2: 41.369 ms
    ...

Read turn 2's metrics carefully, because the headline number flatters us. Turn 2's prompt is 330 tokens; the deepest cached prefix is turn 1's 291-token prompt, so the cache hit skips the prefill entirely and only the 39-token suffix (turn 1's answer plus the new question) runs through the model, one cheap decode step at a time. The printed time_to_first_token_ms: 41.873 is just the first of those suffix steps; the time a user waits for the first reply token is all 39 of them, about 1.67 s in this run. The number that proves the win is max 60.610: no forward pass in turn 2 looked anything like a prefill.

The fair comparison is against the same conversation with the cache off. Rerun without --prefix-cache-max and turn 2 prints:

PLAINTEXT

metrics:
  time_to_first_token_ms: 8571.875
  decode_tokens_per_second: 20.308
  per_forward_ms: min 43.844  max 8571.875  mean 101.209  (n=164)

There's the prefill we skipped: 8.57 s to re-prefill the whole 330-token history, versus ~1.67 s of suffix decode steps with the cache on: a 5× cut in time-to-first-token, and the gap widens with every turn as the shared prefix grows. The document was prefilled once, in turn 1, and never again.

Bar chart of turn-2 time to first reply token: 8.57 seconds with the cache off versus 1.67 seconds on a cache hit.

Figure: what the cache buys on turn 2. With the cache off, the server re-prefills all 330 tokens of history; on a hit it only decodes the 39-token suffix.

Where this leaves us

Shared prompt prefixes are no longer recomputed. A radix tree keyed on token ids stores KV snapshots; a request whose prompt extends a cached prefix skips straight past the shared tokens. LRU eviction keeps the cache bounded, and RAII pins keep an entry resident for the window between lookup and materialize; after that the request has its own copy, and the Arc-owned snapshot means eviction can never hurt a reader.

But the HTTP handlers still pass None; the server can't use this yet, because there is no component with the lifetime to own a shared cache: each request lives and dies on its own blocking thread. The next chapter builds the decode scheduler: a background worker that owns the model, the prefix cache, and a set of slots, and runs concurrent requests under its control.