III.1: Chat pipeline

By the end of Act 2 the engine could take a string, run it through a fast Qwen3 forward pass, and generate a continuation. That is completion: you give it "Once upon a time" and it keeps writing the story.

A chat model is the same network, but it was fine-tuned to behave differently: after the base training, it was trained further on conversations. Not free text, but transcripts wrapped in a strict markup that labels who said what. If you hand a chat model raw text, it free-associates. If you hand it text in exactly the format it was trained on, it answers as an assistant.

This chapter builds the layer that produces that exact format. It is the bridge between "a list of {role, content} messages" (the shape every chat API speaks) and "the precise token sequence the model wants." We'll load the model's chat template, render messages through it, wire the result into a reusable chat turn, and ship an interactive chat-repl binary to try it out.

What a chat template is

When you call a chat API you send something like:

JSON

[
  {"role": "user", "content": "What is 2 + 2?"}
]

The model never sees that JSON. It sees a single flat string of tokens. Something has to turn the list into the string, and not just any string, but the one Qwen3 was fine-tuned on. For Qwen3 that string looks like:

PLAINTEXT

<|im_start|>user
What is 2 + 2?<|im_end|>
<|im_start|>assistant

Get this wrong (wrong markers, a missing newline, the assistant header omitted) and the model still produces tokens, but they're worse: it might continue the user's sentence, or echo the markup, or refuse to stop. The format is the contract the fine-tune was trained against.

The exact format differs per model family. So model authors ship the format with the model, as a chat template: a small program, written in the Jinja2 templating language, stored as a string in the GGUF metadata under the key tokenizer.chat_template. Our tokenizer already parses GGUF metadata (back in I.2), so the template string is already in hand; we just need to run it.

A Jinja template is text with {{ ... }} holes and {% ... %} control flow. Qwen3's looks roughly like {% for message in messages %}<|im_start|>{{ message.role }}\n{{ message.content }}<|im_end|>\n{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}. Rendering it with a list of messages produces the string above. We don't write a Jinja interpreter; we pull in the minijinja crate.

The crate

Four new dependencies and one new binary:

TOML

minijinja = { version = "2.19", features = ["serde"] }
minijinja-contrib = { version = "2.19", features = ["pycompat"] }
serde = { version = "1", features = ["derive"] }
serde_json = "1"

TOML

[[bin]]
name = "chat-repl"
path = "src/bin/chat-repl.rs"

minijinja is the Jinja2 engine. minijinja-contrib with the pycompat feature adds Python-style methods (.split(), .strip(), and friends) that real-world chat templates lean on; Qwen3's template calls some of them. serde and serde_json let us hand structured Rust values to the template as the rendering context.

The library root grows a chat module:

RUST

mod chat;

RUST

pub use chat::{ChatTemplateMessage, run_chat_turn_streaming_with_prefix};

ChatTemplateMessage is one {role, content} message; run_chat_turn_streaming_with_prefix is the function a binary calls to run one assistant turn. The chat module itself has three files: a template renderer, a prompt builder, and the turn driver:

RUST

mod generate;
mod prompt;
mod template;
 
pub use generate::{ChatTurnResult, run_chat_turn_streaming_with_prefix};
pub use template::ChatTemplateMessage;

Rendering the template

src/chat/template.rs defines the message type and the function that runs a template. First the message:

RUST

use serde::Serialize;
 
#[derive(Clone, Debug, Serialize)]
pub struct ChatTemplateMessage {
    pub role: String,
    pub content: String,
    #[serde(skip_serializing_if = "Option::is_none")]
    pub reasoning_content: Option<String>,
}
 
impl ChatTemplateMessage {
    pub fn pair(role: impl Into<String>, content: impl Into<String>) -> Self {
        Self {
            role: role.into(),
            content: content.into(),
            reasoning_content: None,
        }
    }
}

role is "user", "assistant", or "system". content is the text. reasoning_content is an optional field some templates expect for chain-of-thought turns, turns that carry the model's reasoning separately from its answer; skip_serializing_if means it simply vanishes from the rendering context when it's None. pair is a convenience constructor for the common case of just a role and content.

The #[derive(Serialize)] is what lets minijinja see the struct: the template will reference message.role and message.content, and serde is what exposes those fields by name.

Templates need more than the message list. They also expect a flag saying whether to append the trailing assistant header, and a (possibly empty) list of tools. We bundle all three into a context struct:

RUST

#[derive(Serialize)]
struct ChatTemplateContext {
    messages: Vec<ChatTemplateMessage>,
    add_generation_prompt: bool,
    tools: Vec<serde_json::Value>,
}

add_generation_prompt is the flag: when true, the template emits the dangling <|im_start|>assistant\n at the end. We always want that, because we're always about to generate a reply. tools is for function-calling templates; we don't use tools, so it's an empty Vec, but the field must exist or the template errors when it references tools.

Now the renderer:

RUST

pub(crate) fn render_chat_completion(template: &str, messages: &[ChatTemplateMessage]) -> String {
    assert!(!messages.is_empty());
 
    let ctx = ChatTemplateContext {
        messages: messages.to_vec(),
        add_generation_prompt: true,
        tools: vec![],
    };
 
    let mut env = minijinja::Environment::new();
    minijinja_contrib::add_to_environment(&mut env);
    env.set_unknown_method_callback(minijinja_contrib::pycompat::unknown_method_callback);
 
    let tmpl = env.template_from_str(template).unwrap();
    tmpl.render(&ctx).unwrap()
}

An Environment is minijinja's renderer. The two minijinja_contrib lines install the Python-compatibility layer: add_to_environment registers extra functions, and set_unknown_method_callback is what makes "foo".strip() work inside a template: when minijinja hits a method it doesn't recognize, it asks the pycompat callback. Without these, Qwen3's template fails to render.

template_from_str compiles the GGUF template string; render(&ctx) runs it against the context and returns the final prompt string. The .unwrap()s reflect the same philosophy as the rest of the codebase: a broken chat template baked into a model file is a programmer/packaging error, not a runtime condition to recover from.

From messages to token ids

Flowchart from a list of role and content messages, through the minijinja-rendered GGUF template, to a prompt string with im_start markers, the BPE tokenizer, 16 token ids, and the prefill and decode loop.

Figure: the chat path this chapter builds. The template turns structure into a string; the tokenizer turns the string into ids; everything downstream is Act 1 and 2 machinery.

src/chat/prompt.rs glues the renderer to the tokenizer. It answers one question: given a tokenizer and a message list, what should the model prefill?

RUST

use crate::chat::template::{ChatTemplateMessage, render_chat_completion};
use crate::tokenizer::Tokenizer;
 
pub(crate) fn chat_prompt_details(
    tokenizer: &dyn Tokenizer,
    messages: &[ChatTemplateMessage],
) -> Result<(String, Vec<usize>, usize), String> {
    let template = tokenizer
        .chat_template()
        .ok_or_else(|| "tokenizer has no tokenizer.chat_template in GGUF metadata".to_string())?;
    let prompt = render_chat_completion(template, messages);
    let prompt_ids = tokenizer.encode(&prompt);
    if prompt_ids.is_empty() {
        return Err("tokenizer returned no ids for rendered prompt".to_string());
    }
    Ok((prompt, prompt_ids, tokenizer.eos_token_id()))
}

Three steps. Pull the template string out of the GGUF metadata (error if the model didn't ship one, as a base completion model wouldn't). Render the messages through it. Tokenize the rendered string into ids. The function returns all three things a turn needs: the rendered prompt (handy for debugging), its token ids, and the model's end-of-sequence token id so the decode loop knows when to stop.

Driving one chat turn

src/chat/generate.rs is the largest file. It runs one assistant turn end to end: prefill the prompt, decode tokens one at a time until EOS or a length cap, and report what happened. We'll take it in pieces.

First, the result of a turn:

RUST

pub struct ChatTurnResult {
    pub text: String,
    pub rendered_prompt: String,
    pub prompt_tokens: usize,
    pub generated_tokens: usize,
    pub hit_stop: bool,
    pub metrics: Metrics,
    pub prompt_ids: Vec<usize>,
    pub full_ids: Vec<usize>,
}

text is the decoded reply. hit_stop is true if generation ended on EOS rather than the token cap, the difference between "stop" and "length" in OpenAI's vocabulary, which matters once we serve HTTP. metrics carries the timing numbers (TTFT, decode throughput) from Act 2's Metrics type. The rest is bookkeeping the caller may want.

A small helper trims trailing EOS tokens off the generated ids so they don't show up in the decoded text:

RUST

pub(crate) fn strip_trailing_stops(generated: &mut Vec<usize>, eos_token_id: usize) {
    while generated.last().is_some_and(|id| *id == eos_token_id) {
        generated.pop();
    }
}

The decode loop carries state between steps. We collect it in one struct:

RUST

pub(crate) struct ChatDecodeState {
    pub prompt: String,
    pub prompt_ids: Vec<usize>,
    pub eos_token_id: usize,
    pub cache: Box<dyn KvCache>,
    pub ids: Vec<usize>,
    pub next_id: usize,
    pub metrics: Metrics,
}

ids is the full token sequence so far (prompt plus everything generated). next_id is the token the last forward pass predicted, the one we're about to feed in. cache is the KV cache from II.2. Each decode step appends to ids, runs one forward pass, and replaces next_id.

The first step of a turn is prefill: running the whole prompt through the model in one pass to populate the KV cache and predict the first reply token:

RUST

pub(crate) fn prepare_chat_decode_state(
    model: Arc<dyn Model>,
    tokenizer: &dyn Tokenizer,
    backend: Arc<dyn Backend>,
    messages: &[ChatTemplateMessage],
    kv_mode: &str,
) -> Result<ChatDecodeState, String> {
    let (prompt, prompt_ids, eos_token_id) = chat_prompt_details(tokenizer, messages)?;
    log_verbose_prompt(&prompt, &prompt_ids);
 
    let mut metrics = Metrics::default();
    let mut cache: Box<dyn KvCache>;
    let next_id: usize;
 
    cache = create_kv_cache(kv_mode, model.clone(), backend.clone())?;
    let logits = metrics.record_timed(|| {
        model
            .as_ref()
            .forward_prefill_with_kv_cache(&prompt_ids, cache.as_mut())
    });
    next_id = next_token_id_from_logits(backend.as_ref(), &logits).0;
 
    let ids = prompt_ids.clone();
    Ok(ChatDecodeState {
        prompt,
        prompt_ids,
        eos_token_id,
        cache,
        ids,
        next_id,
        metrics,
    })
}

Render and tokenize, build a fresh KV cache, run the prefill pass, and read off the first predicted token from the returned logits. metrics.record_timed wraps the prefill so the time-to-first-token measurement is exactly the prefill duration. The state goes back with ids initialized to just the prompt and next_id holding token one of the reply.

Each subsequent step does the same dance: append the predicted token, check the stop conditions, run one decode forward pass. We split that into a pure "push" phase and the forward call, because the scheduler in III.6 will reuse the push phase on its own. First, the push phase:

RUST

#[derive(Debug)]
pub(crate) enum KvPushPhase {
    Finished(ChatDecodeStep),
    NeedForward { token_id: usize, position: usize },
}
 
pub(crate) fn kv_decode_push_phase(
    ids: &mut Vec<usize>,
    next_id: &mut usize,
    prompt_ids_len: usize,
    eos_token_id: usize,
    max_new_tokens: usize,
) -> KvPushPhase {
    let gen_before = ids.len().saturating_sub(prompt_ids_len);
    if gen_before >= max_new_tokens {
        return KvPushPhase::Finished(ChatDecodeStep::Finished { hit_stop: false });
    }
 
    ids.push(*next_id);
    if *next_id == eos_token_id {
        return KvPushPhase::Finished(ChatDecodeStep::Finished { hit_stop: true });
    }
 
    let gen_after = ids.len().saturating_sub(prompt_ids_len);
    if gen_after >= max_new_tokens {
        return KvPushPhase::Finished(ChatDecodeStep::Finished { hit_stop: false });
    }
 
    let pos = ids.len() - 1;
    let tid_at_pos = ids[pos];
    KvPushPhase::NeedForward {
        token_id: tid_at_pos,
        position: pos,
    }
}

This is purely decision-making; no model calls. It appends next_id to the sequence, then decides: did we hit the token cap before pushing? Was the token we pushed EOS? Did we hit the cap right after pushing? If any of those, the turn is Finished; hit_stop is true only for the EOS case. Otherwise it returns NeedForward with the token and its position, which is what the model needs to compute the next token. ChatDecodeStep is the small enum that says whether to continue:

RUST

#[derive(Clone, Copy, Debug, PartialEq, Eq)]
pub(crate) enum ChatDecodeStep {
    Continue,
    Finished { hit_stop: bool },
}

chat_decode_kv_one_step ties the push phase to the forward call:

RUST

pub(crate) fn chat_decode_kv_one_step(
    ids: &mut Vec<usize>,
    next_id: &mut usize,
    cache: &mut dyn KvCache,
    prompt_ids_len: usize,
    eos_token_id: usize,
    max_new_tokens: usize,
    model: &dyn Model,
    backend: &dyn Backend,
    metrics: &mut Metrics,
    mut on_token: impl FnMut(usize),
) -> Result<ChatDecodeStep, String> {
    let phase = kv_decode_push_phase(
        ids,
        next_id,
        prompt_ids_len,
        eos_token_id,
        max_new_tokens,
    );
    match phase {
        KvPushPhase::Finished(step) => Ok(step),
        KvPushPhase::NeedForward {
            token_id,
            position,
        } => {
            on_token(token_id);
            let logits = metrics.record_timed(|| {
                model.forward_decode_with_kv_cache(token_id, position, cache)
            });
            *next_id = next_token_id_from_logits(backend, &logits).0;
            Ok(ChatDecodeStep::Continue)
        }
    }
}

If the push phase says Finished, we're done. Otherwise: call on_token with the token just committed (this is the hook streaming uses), run one decode forward pass, and stash the prediction in next_id for the next call.

A thin wrapper hides the field-by-field unpacking when we have a ChatDecodeState:

RUST

fn chat_decode_one_step(
    state: &mut ChatDecodeState,
    model: &dyn Model,
    backend: &dyn Backend,
    max_new_tokens: usize,
    on_token: impl FnMut(usize),
) -> Result<ChatDecodeStep, String> {
    let prompt_len = state.prompt_ids.len();
    chat_decode_kv_one_step(
        &mut state.ids,
        &mut state.next_id,
        state.cache.as_mut(),
        prompt_len,
        state.eos_token_id,
        max_new_tokens,
        model,
        backend,
        &mut state.metrics,
        on_token,
    )
}

For debugging it's useful to see exactly what string went into the tokenizer and what ids came out. log_verbose_prompt prints the rendered prompt and a head/tail preview of the token ids:

RUST

fn log_verbose_prompt(prompt: &str, prompt_ids: &[usize]) {
    println!("=== rendered prompt (exact string passed to tokenizer) ===");
    println!("{prompt}");
    println!();
    println!("=== prompt token ids (count = {}) ===", prompt_ids.len());
    const SHOW: usize = 48;
    if prompt_ids.len() <= SHOW + SHOW {
        println!("{prompt_ids:?}");
    } else {
        let head = &prompt_ids[..SHOW];
        let tail = &prompt_ids[prompt_ids.len() - SHOW..];
        println!(
            "{head:?} ... ({} ids omitted) ... {tail:?}",
            prompt_ids.len() - 2 * SHOW
        );
    }
    println!();
}

When the loop ends, build_chat_turn_result packages everything into a ChatTurnResult:

RUST

pub(crate) fn build_chat_turn_result(
    prompt: String,
    prompt_ids: Vec<usize>,
    eos_token_id: usize,
    full_ids: Vec<usize>,
    tokenizer: &dyn Tokenizer,
    metrics: &Metrics,
) -> ChatTurnResult {
    let mut generated: Vec<usize> = full_ids[prompt_ids.len()..].to_vec();
    strip_trailing_stops(&mut generated, eos_token_id);
 
    let generated_tokens = full_ids.len().saturating_sub(prompt_ids.len());
    let hit_stop = full_ids.last().is_some_and(|id| *id == eos_token_id);
    let text = tokenizer.decode(&generated).trim().to_string();
 
    ChatTurnResult {
        text,
        rendered_prompt: prompt,
        prompt_tokens: prompt_ids.len(),
        generated_tokens,
        hit_stop,
        metrics: metrics.clone(),
        prompt_ids,
        full_ids,
    }
}

The generated ids are everything past the prompt. We strip trailing EOS tokens, decode the rest to text, and trim whitespace. hit_stop checks whether the last id was EOS. generated_tokens counts everything generated (including the EOS) because that's the token count a usage report should bill.

Now the turn driver. run_chat_turn_with_prefix runs a whole turn: prefill, then the decode loop:

RUST

pub(crate) fn run_chat_turn_with_prefix<F>(
    model: Arc<dyn Model>,
    tokenizer: &dyn Tokenizer,
    backend: Arc<dyn Backend>,
    messages: &[ChatTemplateMessage],
    max_new_tokens: usize,
    kv_mode: Option<&'static str>,
    metrics: &mut Metrics,
    mut on_token: F,
) -> Result<ChatTurnResult, String>
where
    F: FnMut(usize),
{
    let Some(mode) = kv_mode else {
        let (prompt, prompt_ids, eos_token_id) = chat_prompt_details(tokenizer, messages)?;
        log_verbose_prompt(&prompt, &prompt_ids);
        let mut cache = None;
        let full_ids = crate::decode::greedy_generate(
            model.as_ref(),
            backend.as_ref(),
            tokenizer,
            &prompt_ids,
            max_new_tokens,
            eos_token_id,
            metrics,
            &mut cache,
            on_token,
        );
        return Ok(build_chat_turn_result(
            prompt,
            prompt_ids,
            eos_token_id,
            full_ids,
            tokenizer,
            metrics,
        ));
    };
 
    let mut state = prepare_chat_decode_state(
        model.clone(),
        tokenizer,
        backend.clone(),
        messages,
        mode,
    )?;
    loop {
        match chat_decode_one_step(&mut state, model.as_ref(), backend.as_ref(), max_new_tokens, |t| {
            on_token(t)
        })? {
            ChatDecodeStep::Continue => {}
            ChatDecodeStep::Finished { .. } => break,
        }
    }
    let out = build_chat_turn_result(
        state.prompt,
        state.prompt_ids,
        state.eos_token_id,
        state.ids,
        tokenizer,
        &state.metrics,
    );
    *metrics = out.metrics.clone();
    Ok(out)
}

kv_mode is Option: None means "no KV cache"; fall back to the slow Act 1 path via greedy_generate, mostly useful for comparison. Some(mode) is the real path: prepare the decode state (which does the prefill), then loop chat_decode_one_step until a step reports Finished. The on_token callback fires once per generated token, the seam the next two layers stream through.

The token-id callback is awkward for a user-facing stream, though. A user wants text deltas, not token numbers, and one token doesn't always decode to a clean piece of text, since multi-byte UTF-8 characters span several tokens. stream_delta solves that: it decodes the whole sequence so far and returns only the new suffix of text since the last call:

RUST

pub(crate) fn stream_delta(
    tokenizer: &dyn Tokenizer,
    tokens: &[usize],
    prev_text: &mut String,
) -> Option<String> {
    let full = tokenizer.decode(tokens);
    let delta = (full.len() > prev_text.len() && full.starts_with(prev_text.as_str()))
        .then(|| full[prev_text.len()..].to_string())
        .filter(|d| !d.is_empty());
    *prev_text = full;
    delta
}

Decoding the whole sequence each step is cheap relative to a forward pass, and it sidesteps every partial-character problem: if the new token only completes half of a UTF-8 character, the decoded string doesn't grow yet, prev_text doesn't move, and we emit nothing until the character is whole.

Finally, the public function, the one binaries and HTTP handlers call. It wraps run_chat_turn_with_prefix and converts the token-id callback into a text-delta callback:

RUST

pub fn run_chat_turn_streaming_with_prefix<F>(
    model: Arc<dyn Model>,
    tokenizer: Arc<dyn Tokenizer>,
    backend: &Arc<dyn Backend>,
    messages: &[ChatTemplateMessage],
    max_new_tokens: usize,
    kv_mode: Option<&'static str>,
    metrics: &mut Metrics,
    mut on_delta: F,
) -> Result<ChatTurnResult, String>
where
    F: FnMut(&str),
{
    let mut generated_ids: Vec<usize> = Vec::new();
    let mut prev_text = String::new();
 
    run_chat_turn_with_prefix(
        model,
        tokenizer.as_ref(),
        backend.clone(),
        messages,
        max_new_tokens,
        kv_mode,
        metrics,
        |tid| {
            generated_ids.push(tid);
            if let Some(delta) = stream_delta(tokenizer.as_ref(), &generated_ids, &mut prev_text) {
                on_delta(&delta);
            }
        },
    )
}

The caller passes on_delta: FnMut(&str). Internally we keep a running generated_ids and prev_text; each token-id callback appends the id, recomputes the delta, and forwards any new text to on_delta. A binary printing to the terminal and an HTTP handler pushing SSE chunks both plug straight in here.

Generation gets a token hook

One small change in Act 1's greedy_generate: it gains an on_token callback so the no-cache fallback path can stream too.

RUST

pub fn greedy_generate(
    // ...
    metrics: &mut Metrics,
    cache: &mut Option<Box<dyn KvCache>>,
    mut on_token: impl FnMut(usize),
) -> Vec<usize> {

Inside the loop, after a token is accepted and the EOS check passes, the callback fires:

RUST

        on_token(next_id);

model-generate doesn't stream, so it passes an empty closure:

RUST

        &mut cache,
        |_| {},

The chat REPL

src/bin/chat-repl.rs is the binary. The name is the classic read-eval-print loop: read a user message, run a turn, print the reply, repeat. It runs either a single message (chat-repl model.gguf 128 "hello") or an interactive loop. The top of the file:

RUST

use std::io::{self, BufRead, Write};
use std::path::Path;
use std::sync::Arc;
 
use inferno::{
    Backend, ChatTemplateMessage, CliArgs, Metrics, Model, Tokenizer,
    create_backend, load_from_gguf_path, run_chat_turn_streaming_with_prefix,
    rust_log_enables_trace,
};
 
fn usage() -> ! {
    eprintln!(
        "usage: chat-repl [--kv [basic]] [--backend scalar|simd|parallel|metal] <gguf_path> <max_new_tokens> [<user message…>]"
    );
    std::process::exit(2);
}

main parses arguments, loads the model, and dispatches to one of two modes:

RUST

fn main() {
    tracing_subscriber::fmt()
        .with_env_filter(tracing_subscriber::EnvFilter::from_default_env())
        .init();
 
    let args = CliArgs::from_env();
    let backend_name = args.backend("simd");
    let kv_mode = args.kv_cache_mode();
 
    let positional = args.positionals();
    if positional.len() < 2 {
        usage();
    }
 
    let gguf_path = Path::new(&positional[0]);
    let max_new_tokens: usize = positional[1].parse().unwrap_or_else(|_| {
        eprintln!("error: max_new_tokens must be a positive integer");
        usage();
    });
    if max_new_tokens < 1 {
        eprintln!("error: max_new_tokens must be >= 1");
        std::process::exit(1);
    }
 
    let user_message: Option<String> = if positional.len() > 2 {
        Some(positional[2..].join(" "))
    } else {
        None
    };
 
    let backend = create_backend(&backend_name, rust_log_enables_trace()).unwrap_or_else(|e| {
        eprintln!("error: {e}");
        std::process::exit(2);
    });
    let (model, tokenizer) = load_from_gguf_path(gguf_path, backend.clone()).unwrap_or_else(|e| {
        eprintln!("error: {e}");
        std::process::exit(1);
    });
 
    if tokenizer.chat_template().is_none() {
        eprintln!("error: GGUF has no tokenizer.chat_template metadata (required for chat-repl)");
        std::process::exit(1);
    }
 
    eprintln!("backend: {}", backend_name);
    eprintln!("kv cache: {}", kv_mode.unwrap_or("off"));
    eprintln!();
 
    match user_message {
        Some(text) => run_one_shot(
            model.clone(),
            tokenizer.clone(),
            &backend,
            &text,
            max_new_tokens,
            kv_mode,
        ),
        None => run_repl(
            model.clone(),
            tokenizer.clone(),
            &backend,
            max_new_tokens,
            kv_mode,
        ),
    }
}

It refuses to run on a model with no chat template; there is nothing sensible to do without one. The explicit chat_template().is_none() check turns that into a clear error instead of a panic deep in the renderer.

One-shot mode runs a single turn and prints the metrics:

RUST

fn run_one_shot(
    model: Arc<dyn Model>,
    tokenizer: Arc<dyn Tokenizer>,
    backend: &Arc<dyn Backend>,
    user_text: &str,
    max_new_tokens: usize,
    kv_mode: Option<&'static str>,
) {
    let user_only_token_count = tokenizer.as_ref().encode(user_text).len();
    let messages = vec![ChatTemplateMessage::pair("user", user_text)];
 
    let mut metrics = Metrics::default();
 
    eprint!("assistant> ");
    let _ = io::stderr().flush();
    let result = run_chat_turn_streaming_with_prefix(
        model.clone(),
        tokenizer.clone(),
        backend,
        &messages,
        max_new_tokens,
        kv_mode,
        &mut metrics,
        |delta| {
            eprint!("{delta}");
            let _ = io::stderr().flush();
        },
    );
    eprintln!();
    let result = result.unwrap_or_else(|e| {
        eprintln!("error: {e}");
        std::process::exit(1);
    });
 
    eprintln!(
        "chat: user_message_token_count={} (raw user text, no template)",
        user_only_token_count
    );
    eprintln!(
        "chat: templated_prompt_token_count={} (after chat template, input to model prefill)",
        result.prompt_tokens
    );
 
    metrics.print_summary();
}

The on_delta closure just prints each text delta to stderr and flushes; that's what makes the reply appear token by token. Afterward it prints both token counts: the raw user message length, and the templated prompt length. The gap between them is the template's overhead (the role markers and special tokens), and seeing it is the whole point of this binary.

Interactive mode keeps a growing messages history so the conversation has memory:

RUST

fn run_repl(
    model: Arc<dyn Model>,
    tokenizer: Arc<dyn Tokenizer>,
    backend: &Arc<dyn Backend>,
    max_new_tokens: usize,
    kv_mode: Option<&'static str>,
) {
    let stdin = io::stdin();
    let mut messages: Vec<ChatTemplateMessage> = Vec::new();
    let mut line = String::new();
    let mut reader = stdin.lock();
 
    eprintln!("Enter user messages (empty line to quit). Ctrl-D EOF also exits.");
    loop {
        print!("user> ");
        let _ = io::stdout().flush();
        line.clear();
        if reader.read_line(&mut line).unwrap_or(0) == 0 {
            break;
        }
        let user_text = line.trim_end_matches(['\r', '\n']).trim();
        if user_text.is_empty() {
            break;
        }
 
        messages.push(ChatTemplateMessage::pair("user", user_text));
 
        let mut metrics = Metrics::default();
 
        print!("assistant> ");
        let _ = io::stdout().flush();
        let result = run_chat_turn_streaming_with_prefix(
            model.clone(),
            tokenizer.clone(),
            backend,
            &messages,
            max_new_tokens,
            kv_mode,
            &mut metrics,
            |delta| {
                print!("{delta}");
                let _ = io::stdout().flush();
            },
        );
        println!();
        let result = match result {
            Ok(r) => r,
            Err(e) => {
                eprintln!("error: {e}");
                messages.pop();
                continue;
            }
        };
 
        messages.push(ChatTemplateMessage::pair("assistant", &result.text));
        metrics.print_summary();
        println!();
    }
}

Each turn: read a line, push it as a user message, run the turn, push the reply back as an assistant message. Because the whole messages vector is re-rendered and re-prefilled every turn, the model sees the entire conversation each time, which is how a stateless model remembers context. (It also means a long chat re-prefills the same prefix repeatedly; III.5 is the chapter that fixes that waste.) If a turn errors, the failed user message is popped so the history stays consistent.

Running it

BASH

cargo run --release --bin chat-repl -- --kv basic path/to/qwen3-0.6b-q8_0.gguf 512 "What is 2 + 2?"

One thing to know before reading the output. Qwen3 is a thinking model: it was fine-tuned to reason before it answers. With this chat template, greedy decode always opens a <think>...</think> block first, so the first thing you'll see stream out is not the answer but the model talking itself through the problem. That's also why the command gives it a 512-token budget: the thinking costs far more tokens than the answer does.

PLAINTEXT

backend: simd
kv cache: basic
 
assistant> === rendered prompt (exact string passed to tokenizer) ===
<|im_start|>user
What is 2 + 2?<|im_end|>
<|im_start|>assistant
 
 
=== prompt token ids (count = 16) ===
[151644, 872, 198, 3838, 374, 220, 17, 488, 220, 17, 30, 151645, 198, 151644, 77091, 198]
 
<think>
Okay, so the question is asking, "What is 2 + 2?" Hmm, let me think. Well, addition is a basic arithmetic operation. So, when you add two numbers together, you're combining them. In this case, both numbers are 2.
 
[... ~190 more thinking tokens ...]
 
I guess that's all there is to it. The answer is 4.
</think>
 
2 + 2 equals 4.
 
**Answer:** 4
chat: user_message_token_count=8 (raw user text, no template)
chat: templated_prompt_token_count=16 (after chat template, input to model prefill)
metrics:
  time_to_first_token_ms: 368.217
  decode_tokens_per_second: 29.157
  per_forward_ms: min 24.551  max 368.217  mean 35.515  (n=274)
    forward 1: 368.217 ms
    forward 2: 24.551 ms
    ...
    forward 274: 46.329 ms

The rendered prompt shows the chat markup wrapped around the question, and the assistant header dangling at the end with nothing after it. The token-count lines make the template's cost concrete: 8 tokens of question became a 16-token prompt; the other 8 are <|im_start|>, <|im_end|>, the role names, and newlines. The model then completes the dangling assistant turn: thinking first, answering second.

Where this leaves us

The engine now speaks chat. It turns a message list into the exact prompt format Qwen3 was fine-tuned on, runs a turn, and streams text deltas back through a callback. The codebase is built so a binary and an HTTP handler share that whole pipeline; chat-repl is just the first caller.

The next chapter writes the second caller. III.2 wraps this pipeline in an axum HTTP server that speaks OpenAI's /v1/chat/completions protocol, so any OpenAI client can talk to it.