I.6: Greedy generation
I.5 built the forward pass: hand it a prompt, get back logits: a score for every vocabulary entry, at every position. That is everything needed to predict one token. But a language model's job is to produce text, a whole continuation, dozens or hundreds of tokens. This chapter closes that gap. It's the shortest chapter in Act 1, and the most satisfying, because at the end of it the engine actually generates.
The mechanism is autoregressive generation, and it is exactly as simple as it sounds. Run the forward pass on the prompt. Look at the logits for the last position; those predict what comes after the prompt. Pick a token. Append it to the sequence. Run the forward pass again, now on the prompt-plus-one-new-token. Pick the next token. Append. Repeat. Each new token becomes part of the input for predicting the one after it; the model feeds on its own output. You stop when the model emits its end-of-sequence token, or when you hit a length cap.
"Pick a token" is the one decision point, and there are many strategies for it. We use the simplest: greedy decoding, always taking the single highest-scoring token. No randomness, no creativity knob; given a prompt, greedy decoding always produces the same output. That determinism is exactly what you want for a first, verifiable result. (Fancier sampling such as temperature, top-k, or top-p is a later concern; greedy is the correct foundation.)
Two more backend operations
Greedy decoding needs two new things from the Backend. We add them to the trait:
fn copy_row_2d(&self, x: &Tensor, row: usize) -> Tensor;
fn argmax_with_prob(&self, x: &Tensor) -> (usize, f32);copy_row_2d extracts a single row of a 2-D tensor as a small 1 × cols tensor; we need it because the forward pass returns logits for every position, but generation only cares about the last one. argmax_with_prob finds the largest element of a tensor and returns its index along with its value. That is "pick the highest-scoring token," the whole of greedy decoding.
The CPU implementations are both tiny:
fn copy_row_2d(&self, x: &Tensor, row: usize) -> Tensor {
let s = x.shape();
assert_eq!(s.len(), 2);
let cols = s[1];
let start = row * cols;
Tensor::new(
x.as_f32_slice()[start..start + cols].to_vec(),
vec![1, cols],
)
}
fn argmax_with_prob(&self, x: &Tensor) -> (usize, f32) {
x.as_f32_slice()
.iter()
.copied()
.enumerate()
.max_by(|(_, a), (_, b)| a.partial_cmp(b).unwrap_or(std::cmp::Ordering::Equal))
.expect("non-empty tensor")
}copy_row_2d uses the row-major layout from I.3: row r is the contiguous slice [r*cols .. (r+1)*cols], copied out as a fresh 1 × cols tensor. argmax_with_prob walks the flat buffer with enumerate (so each value carries its index) and takes the max_by value. The partial_cmp(...).unwrap_or(Equal) guards against NaN, which has no defined ordering; without that fallback max_by would panic on a NaN logit, but here a NaN simply doesn't win. It returns (index, value).
Picking the next token
The decode module is new; it holds the generation logic. sampling.rs is the "pick a token" step:
use crate::backend::Backend;
use crate::tensor::Tensor;
pub(crate) fn next_token_id_from_logits(ops: &dyn Backend, logits: &Tensor) -> (usize, f32) {
assert_eq!(logits.shape().len(), 2);
let seq = logits.shape()[0];
assert!(seq >= 1, "logits must have at least one position");
let last_row = ops.copy_row_2d(logits, seq - 1);
let probs = ops.softmax_rows(&last_row);
ops.argmax_with_prob(&probs)
}The forward pass returns a seq × vocab_size matrix. Generation wants the prediction for the next token, which is the last position's row, so copy_row_2d(logits, seq - 1) pulls row seq - 1. Then softmax_rows (from I.4) turns that row of raw logits into a probability distribution, and argmax_with_prob returns the index of the most probable token and its probability.
One subtlety worth noting: for greedy decoding the softmax is not strictly necessary; softmax is monotonic, so the argmax of the logits and the argmax of the probabilities are the same token. We run softmax anyway because the second return value, the probability, is a genuinely useful number: it tells you how confident the model is in its choice, which is informative when you're debugging output. The token id is what matters; the probability is a free diagnostic.
The generation loop
greedy.rs is the autoregressive loop itself, the predict-append-repeat described at the top:
use crate::backend::Backend;
use crate::decode::next_token_id_from_logits;
use crate::model::Model;
pub fn greedy_generate(
model: &dyn Model,
ops: &dyn Backend,
prompt_ids: &[usize],
max_new_tokens: usize,
eos_token_id: usize,
) -> Vec<usize> {
let mut ids = prompt_ids.to_vec();
for _ in 0..max_new_tokens {
let logits = model.forward(&ids);
let (next_id, _prob) = next_token_id_from_logits(ops, &logits);
ids.push(next_id);
if next_id == eos_token_id {
break;
}
}
ids
}This is the whole of generation. ids starts as a copy of the prompt and grows. Each iteration: run the full forward pass on the current ids; pick the next token; append it. If the picked token is the end-of-sequence token (the special token the model emits to signal "this completion is done," whose id the tokenizer gave us back in I.2) we stop early. Otherwise we stop after max_new_tokens to keep a runaway generation bounded. The function returns the full sequence, prompt and generated tokens together.
Notice what this loop is not doing, and it's the central fact of Act 1's performance story. Every iteration calls model.forward(&ids) on the entire sequence so far. To generate token 50 it re-runs the forward pass over all 49 previous tokens, recomputing every key and value for every one of them, even though those tokens haven't changed since the last step. The work per step grows with the sequence length, and almost all of it is repeated. That is the single biggest reason Act 1's engine is slow, and fixing it (caching those keys and values so each step only processes the one new token) is the KV cache, the first real optimization of Act 2. For now we generate the simple, wasteful, obviously-correct way. The signature of the I.5 attention function already returns the K/V tensors a cache would store; the seam is in place, we just don't use it yet.
The decode module file exports the two functions:
mod greedy;
mod sampling;
pub use greedy::greedy_generate;
pub(crate) use sampling::next_token_id_from_logits;A backend factory
We now want a binary that picks a backend by name. A one-function file does that:
use std::sync::Arc;
use super::Backend;
use super::CpuBackend;
pub fn create_backend(name: &str) -> Result<Arc<dyn Backend>, String> {
match name.trim() {
"scalar" => Ok(Arc::new(CpuBackend)),
other => Err(format!("unknown backend {other:?} (supported: scalar)")),
}
}create_backend maps a string to a backend. Today there's exactly one ("scalar", the CpuBackend from I.4) so the match looks like overkill. It is, deliberately. Act 2 adds "simd", "multicore", "metal" as new arms here, and a --backend flag will choose between them. Setting up the factory now means those additions touch only this file. The backend module file gains mod factory;, a pub use factory::create_backend;, and a pub(crate) use cpu::CpuBackend;.
The model-generate binary
Everything comes together in model-generate, the first program in this series that produces real text. It's declared in Cargo.toml:
[[bin]]
name = "model-generate"
path = "src/bin/model-generate.rs"src/lib.rs re-exports the three new public names (greedy_generate, create_backend, and (still) Backend):
mod backend;
mod cli;
mod decode;
mod gguf;
mod model;
mod tensor;
mod tokenizer;
pub use cli::CliArgs;
pub use decode::greedy_generate;
pub use gguf::{GGUF, TensorInfo};
pub use model::{load_from_gguf_path, Model};
pub use backend::{Backend, create_backend};
pub use tokenizer::{BpeTokenizer, Tokenizer};The binary parses arguments, loads the model, and generates:
use std::path::Path;
use inferno::{CliArgs, create_backend, greedy_generate, load_from_gguf_path};
fn usage() -> ! {
eprintln!("usage: model-generate <gguf_path> [prompt] [max_new_tokens]");
std::process::exit(2);
}
fn main() {
let args = CliArgs::from_env();
let positional = args.positionals();
if positional.is_empty() {
usage();
}
let gguf_path = Path::new(&positional[0]);
let prompt = positional
.get(1)
.map(|s| s.as_str())
.unwrap_or("hello world");
let max_new_tokens = positional
.get(2)
.and_then(|s| s.parse().ok())
.unwrap_or(20usize);It takes three positional arguments (the GGUF path (required), an optional prompt, and an optional token cap) using the CliArgs helper from I.1. The prompt defaults to "hello world" and the cap to 20.
let backend = create_backend("scalar").unwrap_or_else(|e| {
eprintln!("error: {e}");
std::process::exit(2);
});
let (model, tokenizer) = load_from_gguf_path(gguf_path, backend.clone()).unwrap_or_else(|e| {
eprintln!("error: {e}");
std::process::exit(1);
});
let eos_token_id = tokenizer.eos_token_id();It builds the scalar backend, then calls load_from_gguf_path (from I.5), the one call that parses the file, builds the tokenizer, loads all the weights, and hands back a ready Model and Tokenizer. The backend is shared between the two via backend.clone() (cloning an Arc is cheap; it bumps a reference count). The EOS token id comes from the tokenizer; the generation loop needs it to know when to stop.
let prompt_ids = tokenizer.encode(prompt);
println!("prompt: {:?}", prompt);
println!("prompt tokens ({}): {:?}", prompt_ids.len(), prompt_ids);
println!("eos token id: {}", eos_token_id);
println!();
let full_ids = greedy_generate(
model.as_ref(),
&*backend,
&prompt_ids,
max_new_tokens,
eos_token_id,
);
let generated_ids = &full_ids[prompt_ids.len()..];
println!(
"generated tokens ({}): {:?}",
generated_ids.len(),
generated_ids
);
println!("generated text: {:?}", tokenizer.decode(generated_ids));
}The prompt is encoded to ids (the BPE encoder from I.2), some diagnostics are printed, and greedy_generate runs the loop. It returns the prompt and the generated tokens; slicing off the prompt's length leaves just what the model produced. tokenizer.decode turns those ids back into a string, and that string is the model's completion. Every stage of the pipeline built across Act 1 is in those few lines: parse, tokenize, forward, sample, detokenize.
Running it
cargo run --release --bin model-generate -- path/to/qwen3-0.6b.gguf "Once upon a time" 20The --release flag matters here; a debug build of a scalar matmul is painfully slow. Even released, this will not be fast: it's a single-threaded, scalar, FP32 forward pass re-run from scratch every token. Expect to wait. The output:
prompt: "Once upon a time"
prompt tokens (4): [12522, 5193, 264, 882]
eos token id: 151645
generated tokens (20): [11, 1052, 572, 264, 2613, 7459, 6342, 13362, 13, 1340, 6342, ...]
generated text: ", there was a small village named Lily. She named ..."A real Qwen3 checkpoint, loaded and run by code you wrote top to bottom, continuing a prompt into coherent English. No ML framework anywhere in the stack: just the GGUF parser, the BPE tokenizer, the Tensor, the CpuBackend, and the forward pass, with this chapter's autoregressive loop tying them together.
Where this leaves us
Act 1 is complete. model-generate takes a prompt and produces a sensible continuation from a real model, and every number that moved through it went through code in this repository. The engine works.
It is also slow (seconds per token) and the loop in this chapter is the prime suspect: re-running the full forward pass over the entire sequence on every single step, recomputing keys and values that never changed. That is not a flaw to be embarrassed about; it is the honest, simple baseline, and it is exactly the thing Act 2 is built to attack. The Act 1 recap takes stock of everything built here and lays out, concretely, what's slow and why.