II.1: Benchmark harness

At the end of Act 1, model-generate works. It also takes the better part of a second per token, slower every step, and right now that's all we know: "slow," felt by watching the terminal. That is not a number you can optimize against. Before we change a single line of the forward pass, we need an instrument: something that turns "it feels slow" into "the first token took about 907 ms for this 17-token prompt, and decode is crawling at 0.8 tokens per second, and slowing as the sequence grows."

This chapter builds that instrument. It is the shortest chapter in the act and the most important, because every chapter after it is a claim ("the KV cache made decode 33× faster"), and a claim with no measurement behind it is just a vibe. We add a Metrics type that times every forward pass, computes the two numbers that matter, and prints a summary. We also wire in structured logging so we can watch tokens stream out as they're generated.

Nothing here makes the model faster. That is the point. This chapter establishes the baseline that the next five chapters are measured against.

The two numbers that matter

A generation request produces a sequence of forward passes. The first one is special: it processes the whole prompt and emits the first response token. Every forward pass after it processes one token and emits the next. So two numbers describe the request:

Time to first token (TTFT): how long the first forward pass took. This is prefill: the model reading your prompt. It's the latency a user feels before anything appears on screen.
Decode tokens per second: the throughput of every forward pass after the first. This is decode: the model writing the response. It's how fast text streams once it starts.

We keep them separate on purpose. They have different bottlenecks (the act intro explains why: prefill is compute-bound, decode is memory-bandwidth-bound), and over the rest of Act 2 they move for different reasons. A single "tokens per second" number that blends them would hide exactly the distinction we most need to see.

To compute both, we only need one raw measurement: the wall-clock duration of each individual forward pass. TTFT is the first entry. Decode throughput is derived from the rest. So the Metrics type is little more than a Vec<Duration> (one entry per forward pass) with some methods to slice it.

Two new dependencies

We pull in tracing for structured logging. It's the de-facto standard logging framework in the Rust ecosystem: instead of println!, you emit events with named fields, and a subscriber decides how to format and filter them. We use it so the per-token output (which token, what probability, prefill or decode) can be switched on and off with an environment variable without touching code.

TOML

[dependencies]
regex = "1"
tracing = "0.1"
tracing-subscriber = { version = "0.3", features = ["fmt", "env-filter"] }

tracing is the API you emit events with; tracing-subscriber is the implementation that prints them. The fmt feature gives a human-readable formatter; env-filter lets the RUST_LOG environment variable control which events appear. That second feature matters later: in II.3 we use RUST_LOG=trace to switch on a per-operation profiler.

The Metrics type

The whole metrics module lives in one new file, src/decode/metrics.rs. Start with the type and a small helper:

RUST

use std::time::{Duration, Instant};
 
#[derive(Debug, Clone, Default)]
pub struct Metrics {
    per_forward: Vec<Duration>,
}
 
impl Metrics {
    fn duration_ms(d: Duration) -> f64 {
        d.as_secs_f64() * 1_000.0
    }

Metrics holds one field: per_forward, the list of forward-pass durations in the order they happened. #[derive(Default)] means Metrics::default() gives us an empty one, with no constructor needed. duration_ms is a private convenience that converts a Duration into milliseconds as an f64, because every number we print is in milliseconds and Duration's own API is built around seconds and nanoseconds.

Next, the method that does the actual measuring:

RUST

    pub fn record_timed<F, T>(&mut self, f: F) -> T
    where
        F: FnOnce() -> T,
    {
        let start = Instant::now();
        let out = f();
        self.per_forward.push(start.elapsed());
        out
    }

record_timed is a timing wrapper. You hand it a closure (the closure being "run one forward pass") and it notes the time, runs the closure, records how long it took, and hands back whatever the closure returned. Instant::now() is a monotonic clock reading; start.elapsed() is the Duration since that reading. The generics are simple: F is the closure type, T is whatever it returns (here, the logits Tensor), and FnOnce means the closure runs exactly once.

The win of this shape is that the timing is invisible at the call site. The generation loop wraps model.forward(&ids) in record_timed(...) and otherwise looks identical. No Instant plumbing leaks into the loop.

RUST

    pub fn record_forward_share(&mut self, batch_wall: Duration, batch_size: usize) {
        if batch_size == 0 {
            return;
        }
        self.per_forward
            .push(batch_wall / batch_size as u32);
    }

record_forward_share is the alternative recorder. Some forward passes (much later, when Act 3 batches several sequences through the model together) don't have a clean one-call-one-token shape. There you time the whole batch and want to attribute an equal share to each sequence in it. This method takes the batch's wall time and its size and pushes the average. We don't call it in this chapter; it's here so the Metrics type is the one place forward-pass timing is recorded, regardless of which decode path produced it.

Now the two derived numbers. TTFT first:

RUST

    pub fn time_to_first_token_ms(&self) -> Option<f64> {
        self.per_forward.first().map(|d| Self::duration_ms(*d))
    }

The first forward pass is prefill, so its duration is the time to first token. first() returns an Option because the vector might be empty (a request with zero tokens generated), and we propagate that Option rather than guessing a value.

RUST

    pub fn decode_tokens_per_second(&self) -> Option<f64> {
        if self.per_forward.len() <= 1 {
            return None;
        }
        let decode = &self.per_forward[1..];
        let n = decode.len();
        let total: Duration = decode.iter().sum();
        let s = total.as_secs_f64();
        if s > 0.0 {
            Some(n as f64 / s)
        } else {
            Some(0.0)
        }
    }

Decode throughput is the count of decode forward passes divided by their total time. The two slicing decisions are the whole point of the method. self.per_forward[1..] drops the first entry, because prefill is not decode and folding it in would pollute the throughput number with a much slower pass. And len() <= 1 returns None: if only the prompt was processed and no response token was generated, "decode tokens per second" is undefined, not zero. The s > 0.0 guard is just defensive. On a fast enough machine with a coarse clock, total time could round to zero.

Finally, the summary printer:

RUST

    pub fn print_summary(&self) {
        println!("metrics:");
        match self.per_forward.first() {
            Some(d) => println!("  time_to_first_token_ms: {:.3}", Self::duration_ms(*d)),
            None => println!("  time_to_first_token_ms: —"),
        }
        match self.decode_tokens_per_second() {
            Some(tps) => println!("  decode_tokens_per_second: {:.3}", tps),
            None => println!("  decode_tokens_per_second: —"),
        }
        let n = self.per_forward.len();
        if n == 0 {
            println!("  per_forward_ms: —");
            return;
        }
        let ms: Vec<f64> = self
            .per_forward
            .iter()
            .map(|d| Self::duration_ms(*d))
            .collect();
        let min_ms = ms.iter().copied().fold(f64::INFINITY, f64::min);
        let max_ms = ms.iter().copied().fold(f64::NEG_INFINITY, f64::max);
        let mean_ms = ms.iter().sum::<f64>() / n as f64;
        println!(
            "  per_forward_ms: min {:.3}  max {:.3}  mean {:.3}  (n={})",
            min_ms, max_ms, mean_ms, n
        );
        for (i, m) in ms.iter().enumerate() {
            println!("    forward {}: {:.3} ms", i + 1, m);
        }
    }
}

print_summary reports TTFT, decode throughput, then the distribution of forward-pass durations (min, max, mean, count), and finally a per-pass dump. Two details. The empty case (n == 0) prints an em-dash and returns early rather than dividing by zero. And the min/max folds seed with f64::INFINITY and f64::NEG_INFINITY, the standard trick for finding extremes, since any real measurement beats positive infinity as a minimum and negative infinity as a maximum.

The per-pass dump (forward 1, forward 2, …) is verbose, but it's invaluable when reading these numbers. The KV cache chapter is the obvious case: with the cache off you'll see every forward pass getting slower than the last; with it on they flatten out. Seeing that pattern in the per-pass list is how you confirm an optimization did what you think it did, instead of trusting a single averaged number.

Wiring it into the decode module

src/decode/mod.rs gains the new module and re-exports Metrics:

RUST

mod greedy;
mod metrics;
mod sampling;
 
pub use greedy::greedy_generate;
pub use metrics::Metrics;
pub(crate) use sampling::next_token_id_from_logits;

And the crate root re-exports it so a binary can name it:

RUST

mod backend;
mod cli;
mod decode;
mod gguf;
mod model;
mod tensor;
mod tokenizer;
 
pub use cli::CliArgs;
pub use decode::{greedy_generate, Metrics};
pub use gguf::{GGUF, TensorInfo};
pub use model::{load_from_gguf_path, Model};
pub use backend::{Backend, create_backend};
pub use tokenizer::{BpeTokenizer, Tokenizer};

Threading metrics through the generation loop

The greedy loop from I.6 needs three changes: take a &mut Metrics, wrap the forward call in record_timed, and emit a structured event per token. Here's the whole updated greedy_generate:

RUST

use tracing::info;
 
use crate::backend::Backend;
use crate::decode::{Metrics, next_token_id_from_logits};
use crate::model::Model;
use crate::tokenizer::Tokenizer;
 
pub fn greedy_generate(
    model: &dyn Model,
    ops: &dyn Backend,
    tokenizer: &dyn Tokenizer,
    prompt_ids: &[usize],
    max_new_tokens: usize,
    eos_token_id: usize,
    metrics: &mut Metrics,
) -> Vec<usize> {
    let mut ids = prompt_ids.to_vec();
 
    for step in 0..max_new_tokens {
        let logits = metrics.record_timed(|| model.forward(&ids));
 
        let (next_id, prob) = next_token_id_from_logits(ops, &logits);
        ids.push(next_id);
 
        info!(
            step = step + 1,
            token_id = next_id,
            prob = prob * 100.0,
            token = %tokenizer.decode(&[next_id]),
        );
 
        if next_id == eos_token_id {
            info!("[EOS reached]");
            break;
        }
    }
 
    ids
}

The signature gains two parameters: tokenizer, so the loop can decode each token id back to text for logging, and metrics, the accumulator we just built. The loop body has three changes from the Act 1 version.

First, model.forward(&ids) is now wrapped: metrics.record_timed(|| model.forward(&ids)). The closure captures ids by reference, runs the forward pass, and record_timed notes the duration. The returned logits are then used exactly as before. The forward pass itself is untouched, and so it had better be, because this chapter measures the Act 1 baseline; changing the thing being measured would defeat the purpose.

Second, every accepted token emits a tracing event. info!(...) records an event at the "info" level with four named fields: step (1-based), token_id, prob (the model's confidence, as a percentage), and token, the decoded text. The % in token = %tokenizer.decode(...) tells tracing to record the value using its Display formatting rather than Debug, so the token prints as readable text rather than a quoted, escaped string.

Third, when the end-of-sequence token appears, an event marks it before the loop breaks.

The next_token_id_from_logits call is unchanged. Act 1 already returned a probability alongside the token id; we just kept it (prob) this time instead of discarding it, so it can go in the log line.

Wiring it into the binary

Last, src/bin/model-generate.rs. It needs to install a tracing subscriber, create a Metrics, pass it to greedy_generate, and print the summary:

RUST

use std::path::Path;
 
use inferno::{CliArgs, Metrics, create_backend, greedy_generate, load_from_gguf_path};
 
fn usage() -> ! {
    eprintln!("usage: model-generate <gguf_path> [prompt] [max_new_tokens]");
    std::process::exit(2);
}
 
fn main() {
    tracing_subscriber::fmt()
        .with_env_filter(tracing_subscriber::EnvFilter::from_default_env())
        .init();

The first thing main does is install the subscriber. tracing_subscriber::fmt() builds the human-readable formatter; .with_env_filter(EnvFilter::from_default_env()) makes it honor the RUST_LOG environment variable; .init() registers it as the global default. After this line, every info! in greedy_generate goes somewhere. Run with RUST_LOG=info and you see per-token events; run with no RUST_LOG and they're filtered out. The metrics: summary still prints, because that uses plain println!.

The middle of main (parsing arguments, creating the backend, loading the model) is unchanged from Act 1:

RUST

    let args = CliArgs::from_env();
    let positional = args.positionals();
    if positional.is_empty() {
        usage();
    }
 
    let gguf_path = Path::new(&positional[0]);
    let prompt = positional
        .get(1)
        .map(|s| s.as_str())
        .unwrap_or("hello world");
    let max_new_tokens = positional
        .get(2)
        .and_then(|s| s.parse().ok())
        .unwrap_or(20usize);
 
    let backend = create_backend("scalar").unwrap_or_else(|e| {
        eprintln!("error: {e}");
        std::process::exit(2);
    });
    let (model, tokenizer) = load_from_gguf_path(gguf_path, backend.clone()).unwrap_or_else(|e| {
        eprintln!("error: {e}");
        std::process::exit(1);
    });
    let eos_token_id = tokenizer.eos_token_id();
 
    let prompt_ids = tokenizer.encode(prompt);
    println!("prompt: {:?}", prompt);
    println!("prompt tokens ({}): {:?}", prompt_ids.len(), prompt_ids);
    println!("eos token id: {}", eos_token_id);
    println!();

The backend is still hard-coded to "scalar", the only backend that exists right now. II.3 makes that a --backend flag. And the end of main creates the Metrics, passes it and the tokenizer into greedy_generate, then prints the summary:

RUST

    let mut metrics = Metrics::default();
 
    let full_ids = greedy_generate(
        model.as_ref(),
        &*backend,
        tokenizer.as_ref(),
        &prompt_ids,
        max_new_tokens,
        eos_token_id,
        &mut metrics,
    );
 
    println!();
    metrics.print_summary();
    println!();
    let generated_ids = &full_ids[prompt_ids.len()..];
    println!(
        "generated tokens ({}): {:?}",
        generated_ids.len(),
        generated_ids
    );
    println!("generated text: {:?}", tokenizer.decode(generated_ids));
}

Metrics::default() makes the empty accumulator; &mut metrics hands it to the loop, which fills it; metrics.print_summary() prints the result. The rest (slicing off the generated tokens and decoding them to text) is the Act 1 ending, unchanged.

Running it

Build and run with RUST_LOG=info so the per-token events show up alongside the summary:

BASH

RUST_LOG=info cargo run --release --bin model-generate -- path/to/Qwen3-0.6B-FP32.gguf "Once upon a time, in a small village by the sea, there lived a baker" 16

You get the prompt header, a stream of per-token info events as each one is generated, then the metrics block:

PLAINTEXT

prompt: "Once upon a time, in a small village by the sea, there lived a baker"
prompt tokens (17): [12522, 5193, 264, 882, 11, 304, 264, 2613, 14126, 553, 279, 9396, 11, 1052, 12163, 264, 75828]
eos token id: 151645
 
2026-07-14T05:53:06.750249Z  INFO inferno::decode::greedy: step=1 token_id=6941 prob=43.262046813964844 token= named
2026-07-14T05:53:07.664597Z  INFO inferno::decode::greedy: step=2 token_id=4392 prob=5.320432186126709 token= Mr
2026-07-14T05:53:08.628011Z  INFO inferno::decode::greedy: step=3 token_id=13 prob=99.18336486816406 token=.
...
2026-07-14T05:53:26.129788Z  INFO inferno::decode::greedy: step=16 token_id=264 prob=71.0180892944336 token= a
 
metrics:
  time_to_first_token_ms: 906.980
  decode_tokens_per_second: 0.774
  per_forward_ms: min 906.980  max 1726.418  mean 1267.376  (n=16)
    forward 1: 906.980 ms
    forward 2: 913.800 ms
    forward 3: 962.899 ms
    forward 4: 1016.485 ms
    forward 5: 1070.827 ms
    forward 6: 1136.112 ms
    forward 7: 1174.295 ms
    forward 8: 1227.107 ms
    forward 9: 1272.132 ms
    forward 10: 1326.159 ms
    forward 11: 1381.036 ms
    forward 12: 1432.807 ms
    forward 13: 1500.357 ms
    forward 14: 1577.638 ms
    forward 15: 1652.961 ms
    forward 16: 1726.418 ms
 
generated tokens (16): [6941, 4392, 13, 9082, 13, 1260, 1030, 264, 2613, 8061, 448, 264, 22360, 1965, 323, 264]
generated text: " named Mr. Smith. He had a small shop with a wooden table and a"

These numbers are the baseline the rest of Act 2 demolishes. Time to first token: nearly a full second, and that's with a seventeen-token prompt. Decode throughput: 0.77 tokens per second, roughly one token every 1.3 seconds. And look at the per-forward dump: the durations grow as generation proceeds (forward 16, at 1726 ms, takes nearly twice as long as forward 1 at 907 ms; a smaller ratio than you'd see from a shorter prompt, because the sequence already starts 17 tokens long, but the same steady climb, more than 50 ms added per step). That steady climb is a bug in the algorithm showing through, not measurement noise, and the per-pass list is what makes it visible.

Where this leaves us

We have an instrument. model-generate now prints time to first token, decode tokens per second, and the duration of every forward pass, and it can stream per-token events when RUST_LOG=info is set. The model is exactly as slow as it was at the end of Act 1; we have changed nothing in the forward pass and that was deliberate.

But the harness already handed us our first clue. The forward passes get slower as the response gets longer. Generating token 16 takes more wall time than generating token 2, even though each one produces exactly one token. Something is doing work proportional to the length so far on every single step. The next chapter finds that something (the model is re-reading the entire sequence on every step) and fixes it with the single biggest speedup in the act: the KV cache.