II.1: Benchmark harness
At the end of Act 1, model-generate works. It also takes seconds per token, and right now that's all we know: "seconds," felt by watching the terminal. That is not a number you can optimize against. Before we change a single line of the forward pass, we need an instrument: something that turns "it feels slow" into "the first token took 4,210 ms and decode is running at 0.3 tokens per second."
This chapter builds that instrument. It is the shortest chapter in the act and the most important, because every chapter after it is a claim ("the KV cache made decode 40× faster"), and a claim with no measurement behind it is just a vibe. We add a Metrics type that times every forward pass, computes the two numbers that matter, and prints a summary. We also wire in structured logging so we can watch tokens stream out as they're generated.
Nothing here makes the model faster. That is the point. This chapter establishes the baseline that the next five chapters are measured against.
The two numbers that matter
A generation request produces a sequence of forward passes. The first one is special: it processes the whole prompt and emits the first response token. Every forward pass after it processes one token and emits the next. So two numbers describe the request:
- Time to first token (TTFT): how long the first forward pass took. This is prefill: the model reading your prompt. It's the latency a user feels before anything appears on screen.
- Decode tokens per second: the throughput of every forward pass after the first. This is decode: the model writing the response. It's how fast text streams once it starts.
We keep them separate on purpose. They have different bottlenecks (the act intro explains why: prefill is compute-bound, decode is memory-bandwidth-bound), and over the rest of Act 2 they move for different reasons. A single "tokens per second" number that blends them would hide exactly the distinction we most need to see.
To compute both, we only need one raw measurement: the wall-clock duration of each individual forward pass. TTFT is the first entry. Decode throughput is derived from the rest. So the Metrics type is, at its core, a Vec<Duration> (one entry per forward pass) with some methods to slice it.
Two new dependencies
We pull in tracing for structured logging. It's the de-facto standard logging framework in the Rust ecosystem: instead of println!, you emit events with named fields, and a subscriber decides how to format and filter them. We use it so the per-token output (which token, what probability, prefill or decode) can be switched on and off with an environment variable without touching code.
[dependencies]
regex = "1"
tracing = "0.1"
tracing-subscriber = { version = "0.3", features = ["fmt", "env-filter"] }tracing is the API you emit events with; tracing-subscriber is the implementation that prints them. The fmt feature gives a human-readable formatter; env-filter lets the RUST_LOG environment variable control which events appear. That second feature matters later: in II.3 we use RUST_LOG=trace to switch on a per-operation profiler.
The Metrics type
The whole metrics module lives in one new file, src/decode/metrics.rs. Start with the type and a small helper:
use std::time::{Duration, Instant};
#[derive(Debug, Clone, Default)]
pub struct Metrics {
per_forward: Vec<Duration>,
}
impl Metrics {
fn duration_ms(d: Duration) -> f64 {
d.as_secs_f64() * 1_000.0
}Metrics holds one field: per_forward, the list of forward-pass durations in the order they happened. #[derive(Default)] means Metrics::default() gives us an empty one, with no constructor needed. duration_ms is a private convenience that converts a Duration into milliseconds as an f64, because every number we print is in milliseconds and Duration's own API is built around seconds and nanoseconds.
Next, the method that does the actual measuring:
pub fn record_timed<F, T>(&mut self, f: F) -> T
where
F: FnOnce() -> T,
{
let start = Instant::now();
let out = f();
self.per_forward.push(start.elapsed());
out
}record_timed is a timing wrapper. You hand it a closure (the closure being "run one forward pass") and it notes the time, runs the closure, records how long it took, and hands back whatever the closure returned. Instant::now() is a monotonic clock reading; start.elapsed() is the Duration since that reading. The generics are simple: F is the closure type, T is whatever it returns (here, the logits Tensor), and FnOnce means the closure runs exactly once.
The win of this shape is that the timing is invisible at the call site. The generation loop wraps model.forward(&ids) in record_timed(...) and otherwise looks identical. No Instant plumbing leaks into the loop.
pub fn record_forward_share(&mut self, batch_wall: Duration, batch_size: usize) {
if batch_size == 0 {
return;
}
self.per_forward
.push(batch_wall / batch_size as u32);
}record_forward_share is the alternative recorder. Some forward passes (much later, when Act 3 batches several sequences through the model together) don't have a clean one-call-one-token shape. There you time the whole batch and want to attribute an equal share to each sequence in it. This method takes the batch's wall time and its size and pushes the average. We don't call it in this chapter; it's here so the Metrics type is the one place forward-pass timing is recorded, regardless of which decode path produced it.
Now the two derived numbers. TTFT first:
pub fn time_to_first_token_ms(&self) -> Option<f64> {
self.per_forward.first().map(|d| Self::duration_ms(*d))
}The first forward pass is prefill, so its duration is the time to first token. first() returns an Option because the vector might be empty (a request with zero tokens generated), and we propagate that Option rather than guessing a value.
pub fn decode_tokens_per_second(&self) -> Option<f64> {
if self.per_forward.len() <= 1 {
return None;
}
let decode = &self.per_forward[1..];
let n = decode.len();
let total: Duration = decode.iter().sum();
let s = total.as_secs_f64();
if s > 0.0 {
Some(n as f64 / s)
} else {
Some(0.0)
}
}Decode throughput is the count of decode forward passes divided by their total time. The two slicing decisions are the whole point of the method. self.per_forward[1..] drops the first entry, because prefill is not decode and folding it in would pollute the throughput number with a much slower pass. And len() <= 1 returns None: if only the prompt was processed and no response token was generated, "decode tokens per second" is genuinely undefined, not zero. The s > 0.0 guard is just defensive. On a fast enough machine with a coarse clock, total time could round to zero.
Finally, the summary printer:
pub fn print_summary(&self) {
println!("metrics:");
match self.per_forward.first() {
Some(d) => println!(" time_to_first_token_ms: {:.3}", Self::duration_ms(*d)),
None => println!(" time_to_first_token_ms: —"),
}
match self.decode_tokens_per_second() {
Some(tps) => println!(" decode_tokens_per_second: {:.3}", tps),
None => println!(" decode_tokens_per_second: —"),
}
let n = self.per_forward.len();
if n == 0 {
println!(" per_forward_ms: —");
return;
}
let ms: Vec<f64> = self
.per_forward
.iter()
.map(|d| Self::duration_ms(*d))
.collect();
let min_ms = ms.iter().copied().fold(f64::INFINITY, f64::min);
let max_ms = ms.iter().copied().fold(f64::NEG_INFINITY, f64::max);
let mean_ms = ms.iter().sum::<f64>() / n as f64;
println!(
" per_forward_ms: min {:.3} max {:.3} mean {:.3} (n={})",
min_ms, max_ms, mean_ms, n
);
for (i, m) in ms.iter().enumerate() {
println!(" forward {}: {:.3} ms", i + 1, m);
}
}
}print_summary reports TTFT, decode throughput, then the distribution of forward-pass durations (min, max, mean, count), and finally a per-pass dump. Two details worth a look. The empty case (n == 0) prints an em-dash and returns early rather than dividing by zero. And the min/max folds seed with f64::INFINITY and f64::NEG_INFINITY, the standard trick for finding extremes, since any real measurement beats positive infinity as a minimum and negative infinity as a maximum.
The per-pass dump (forward 1, forward 2, …) is verbose, but it's invaluable when reading these numbers. The KV cache chapter is the obvious case: with the cache off you'll see every forward pass getting slower than the last; with it on they flatten out. Seeing that pattern in the per-pass list is how you confirm an optimization did what you think it did, instead of trusting a single averaged number.
Wiring it into the decode module
src/decode/mod.rs gains the new module and re-exports Metrics:
mod greedy;
mod metrics;
mod sampling;
pub use greedy::greedy_generate;
pub use metrics::Metrics;
pub(crate) use sampling::next_token_id_from_logits;And the crate root re-exports it so a binary can name it:
mod backend;
mod cli;
mod decode;
mod gguf;
mod model;
mod tensor;
mod tokenizer;
pub use cli::CliArgs;
pub use decode::{greedy_generate, Metrics};
pub use gguf::{GGUF, TensorInfo};
pub use model::{load_from_gguf_path, Model};
pub use backend::{Backend, create_backend};
pub use tokenizer::{BpeTokenizer, Tokenizer};Threading metrics through the generation loop
The greedy loop from I.6 needs three changes: take a &mut Metrics, wrap the forward call in record_timed, and emit a structured event per token. Here's the whole updated greedy_generate:
use tracing::info;
use crate::backend::Backend;
use crate::decode::{Metrics, next_token_id_from_logits};
use crate::model::Model;
use crate::tokenizer::Tokenizer;
pub fn greedy_generate(
model: &dyn Model,
ops: &dyn Backend,
tokenizer: &dyn Tokenizer,
prompt_ids: &[usize],
max_new_tokens: usize,
eos_token_id: usize,
metrics: &mut Metrics,
) -> Vec<usize> {
let mut ids = prompt_ids.to_vec();
for step in 0..max_new_tokens {
let logits = metrics.record_timed(|| model.forward(&ids));
let (next_id, prob) = next_token_id_from_logits(ops, &logits);
ids.push(next_id);
info!(
step = step + 1,
token_id = next_id,
prob = prob * 100.0,
token = %tokenizer.decode(&[next_id]),
);
if next_id == eos_token_id {
info!("[EOS reached]");
break;
}
}
ids
}The signature gains two parameters: tokenizer, so the loop can decode each token id back to text for logging, and metrics, the accumulator we just built. The loop body has three changes from the Act 1 version.
First, model.forward(&ids) is now wrapped: metrics.record_timed(|| model.forward(&ids)). The closure captures ids by reference, runs the forward pass, and record_timed notes the duration. The returned logits are then used exactly as before. The forward pass itself is untouched, and so it had better be, because this chapter measures the Act 1 baseline; changing the thing being measured would defeat the purpose.
Second, every accepted token emits a tracing event. info!(...) records an event at the "info" level with four named fields: step (1-based), token_id, prob (the model's confidence, as a percentage), and token, the decoded text. The % in token = %tokenizer.decode(...) tells tracing to record the value using its Display formatting rather than Debug, so the token prints as readable text rather than a quoted, escaped string.
Third, when the end-of-sequence token appears, an event marks it before the loop breaks.
The next_token_id_from_logits call is unchanged. Act 1 already returned a probability alongside the token id; we just kept it (prob) this time instead of discarding it, so it can go in the log line.
Wiring it into the binary
Last, src/bin/model-generate.rs. It needs to install a tracing subscriber, create a Metrics, pass it to greedy_generate, and print the summary:
use std::path::Path;
use inferno::{CliArgs, Metrics, create_backend, greedy_generate, load_from_gguf_path};
fn usage() -> ! {
eprintln!("usage: model-generate <gguf_path> [prompt] [max_new_tokens]");
std::process::exit(2);
}
fn main() {
tracing_subscriber::fmt()
.with_env_filter(tracing_subscriber::EnvFilter::from_default_env())
.init();The first thing main does is install the subscriber. tracing_subscriber::fmt() builds the human-readable formatter; .with_env_filter(EnvFilter::from_default_env()) makes it honor the RUST_LOG environment variable; .init() registers it as the global default. After this line, every info! in greedy_generate goes somewhere. Run with RUST_LOG=info and you see per-token events; run with no RUST_LOG and they're filtered out. The metrics: summary still prints, because that uses plain println!.
The middle of main (parsing arguments, creating the backend, loading the model) is unchanged from Act 1:
let args = CliArgs::from_env();
let positional = args.positionals();
if positional.is_empty() {
usage();
}
let gguf_path = Path::new(&positional[0]);
let prompt = positional
.get(1)
.map(|s| s.as_str())
.unwrap_or("hello world");
let max_new_tokens = positional
.get(2)
.and_then(|s| s.parse().ok())
.unwrap_or(20usize);
let backend = create_backend("scalar").unwrap_or_else(|e| {
eprintln!("error: {e}");
std::process::exit(2);
});
let (model, tokenizer) = load_from_gguf_path(gguf_path, backend.clone()).unwrap_or_else(|e| {
eprintln!("error: {e}");
std::process::exit(1);
});
let eos_token_id = tokenizer.eos_token_id();
let prompt_ids = tokenizer.encode(prompt);
println!("prompt: {:?}", prompt);
println!("prompt tokens ({}): {:?}", prompt_ids.len(), prompt_ids);
println!("eos token id: {}", eos_token_id);
println!();The backend is still hard-coded to "scalar", the only backend that exists right now. II.3 makes that a --backend flag. And the end of main creates the Metrics, passes it and the tokenizer into greedy_generate, then prints the summary:
let mut metrics = Metrics::default();
let full_ids = greedy_generate(
model.as_ref(),
&*backend,
tokenizer.as_ref(),
&prompt_ids,
max_new_tokens,
eos_token_id,
&mut metrics,
);
println!();
metrics.print_summary();
println!();
let generated_ids = &full_ids[prompt_ids.len()..];
println!(
"generated tokens ({}): {:?}",
generated_ids.len(),
generated_ids
);
println!("generated text: {:?}", tokenizer.decode(generated_ids));
}Metrics::default() makes the empty accumulator; &mut metrics hands it to the loop, which fills it; metrics.print_summary() prints the result. The rest (slicing off the generated tokens and decoding them to text) is the Act 1 ending, unchanged.
Running it
Build and run with RUST_LOG=info so the per-token events show up alongside the summary:
RUST_LOG=info cargo run --release --bin model-generate -- path/to/qwen3-0.6b.gguf "Once upon a time" 16You get the prompt header, a stream of per-token info events as each one is generated, then the metrics block:
prompt: "Once upon a time"
prompt tokens (4): [12522, 5193, 264, 882]
eos token id: 151645
2026-05-19T22:14:03Z INFO model-generate: step=1 token_id=11 prob=46.2 token=,
2026-05-19T22:14:07Z INFO model-generate: step=2 token_id=1052 prob=38.9 token=" there"
2026-05-19T22:14:12Z INFO model-generate: step=3 token_id=572 prob=64.1 token=" was"
...
2026-05-19T22:15:09Z INFO model-generate: step=16 token_id=11 prob=29.7 token=,
metrics:
time_to_first_token_ms: 4218.553
decode_tokens_per_second: 0.271
per_forward_ms: min 3691.402 max 5402.881 mean 4012.774 (n=16)
forward 1: 4218.553 ms
forward 2: 3691.402 ms
forward 3: 3724.918 ms
...
forward 16: 5402.881 ms
generated tokens (16): [11, 1052, 572, ...]
generated text: ", there was a small village ..."Read those numbers, because they are the baseline the rest of Act 2 demolishes. Time to first token: over four seconds. Decode throughput: 0.27 tokens per second, roughly one token every four seconds. And look at the per-forward dump: the durations grow as generation proceeds (forward 16 is slower than forward 2). That last detail is not noise. It is a bug in the algorithm, hiding in plain sight, and the per-pass list is what makes it visible.
Where this leaves us
We have an instrument. model-generate now prints time to first token, decode tokens per second, and the duration of every forward pass, and it can stream per-token events when RUST_LOG=info is set. The model is exactly as slow as it was at the end of Act 1; we have changed nothing in the forward pass and that was deliberate.
But the harness already handed us our first clue. The forward passes get slower as the response gets longer. Generating token 16 takes more wall time than generating token 2, even though each one produces exactly one token. Something is doing work proportional to the length so far on every single step. The next chapter finds that something (the model is re-reading the entire sequence on every step) and fixes it with the single biggest speedup in the act: the KV cache.