I.6: Greedy generation

I.5 built the forward pass: hand it a prompt, get back logits: a score for every vocabulary entry, at every position. That is everything needed to predict one token. But a language model's job is to produce text, a whole continuation, dozens or hundreds of tokens. This chapter closes that gap. It's the shortest chapter in Act 1, and the most satisfying, because at the end of it the engine actually generates.

The mechanism is autoregressive generation, and it is as simple as it sounds. Run the forward pass on the prompt. Look at the logits for the last position; those predict what comes after the prompt. Pick a token. Append it to the sequence. Run the forward pass again, now on the prompt-plus-one-new-token. Pick the next token. Append. Repeat. Each new token becomes part of the input for predicting the one after it; the model feeds on its own output. You stop when the model emits its end-of-sequence token, or when you hit a length cap.

"Pick a token" is the one decision point, and there are many strategies for it. We use the simplest: greedy decoding, always taking the single highest-scoring token. No randomness, no creativity knob; given a prompt, greedy decoding always produces the same output. That determinism is what you want for a first, verifiable result. (Fancier sampling such as temperature, top-k, or top-p is a later concern; greedy is the correct foundation.)

Two more backend operations

Greedy decoding needs two new things from the Backend. We add them to the trait:

RUST

    fn copy_row_2d(&self, x: &Tensor, row: usize) -> Tensor;
 
    fn argmax_with_prob(&self, x: &Tensor) -> (usize, f32);

copy_row_2d extracts a single row of a 2-D tensor as a small 1 × cols tensor; we need it because the forward pass returns logits for every position, but generation only cares about the last one. argmax_with_prob finds the largest element of a tensor and returns its index along with its value. That is "pick the highest-scoring token," the whole of greedy decoding.

The CPU implementations are both tiny:

RUST

    fn copy_row_2d(&self, x: &Tensor, row: usize) -> Tensor {
        let s = x.shape();
        assert_eq!(s.len(), 2);
        let cols = s[1];
        let start = row * cols;
        Tensor::new(
            x.as_f32_slice()[start..start + cols].to_vec(),
            vec![1, cols],
        )
    }
 
    fn argmax_with_prob(&self, x: &Tensor) -> (usize, f32) {
        x.as_f32_slice()
            .iter()
            .copied()
            .enumerate()
            .max_by(|(_, a), (_, b)| a.partial_cmp(b).unwrap_or(std::cmp::Ordering::Equal))
            .expect("non-empty tensor")
    }

copy_row_2d uses the row-major layout from I.3: row r is the contiguous slice [r*cols .. (r+1)*cols], copied out as a fresh 1 × cols tensor. argmax_with_prob walks the flat buffer with enumerate (so each value carries its index) and takes the max_by value. The partial_cmp(...).unwrap_or(Equal) guards against NaN, which has no defined ordering; without that fallback max_by would panic on a NaN logit. The fallback only avoids the panic (it doesn't guarantee a NaN can't be picked), but logits from our own forward pass shouldn't be NaN in the first place. It returns (index, value).

Picking the next token

The decode module is new; it holds the generation logic. sampling.rs is the "pick a token" step:

RUST

use crate::backend::Backend;
use crate::tensor::Tensor;
 
pub(crate) fn next_token_id_from_logits(ops: &dyn Backend, logits: &Tensor) -> (usize, f32) {
    assert_eq!(logits.shape().len(), 2);
    let seq = logits.shape()[0];
    assert!(seq >= 1, "logits must have at least one position");
    let last_row = ops.copy_row_2d(logits, seq - 1);
    let probs = ops.softmax_rows(&last_row);
    ops.argmax_with_prob(&probs)
}

The forward pass returns a seq × vocab_size matrix. Generation wants the prediction for the next token, which is the last position's row, so copy_row_2d(logits, seq - 1) pulls row seq - 1. Then softmax_rows (from I.4) turns that row of raw logits into a probability distribution, and argmax_with_prob returns the index of the most probable token and its probability.

One subtlety: for greedy decoding the softmax is not strictly necessary; softmax is monotonic, so the argmax of the logits and the argmax of the probabilities are the same token. We run softmax anyway because the second return value, the probability, is a useful number: it tells you how confident the model is in its choice, which is informative when you're debugging output. The token id is what matters; the probability is a free diagnostic.

A row of logit bars over candidate tokens with one clear winner and an arrow showing argmax selecting the same token before and after softmax

Figure: picking the winner. Softmax rescales the bars but never reorders them, so argmax lands on the same token with or without it.

The generation loop

greedy.rs is the autoregressive loop itself, the predict-append-repeat described at the top:

RUST

use crate::backend::Backend;
use crate::decode::next_token_id_from_logits;
use crate::model::Model;
 
pub fn greedy_generate(
    model: &dyn Model,
    ops: &dyn Backend,
    prompt_ids: &[usize],
    max_new_tokens: usize,
    eos_token_id: usize,
) -> Vec<usize> {
    let mut ids = prompt_ids.to_vec();
 
    for _ in 0..max_new_tokens {
        let logits = model.forward(&ids);
        let (next_id, _prob) = next_token_id_from_logits(ops, &logits);
        ids.push(next_id);
        if next_id == eos_token_id {
            break;
        }
    }
 
    ids
}

This is the whole of generation. ids starts as a copy of the prompt and grows. Each iteration: run the full forward pass on the current ids; pick the next token; append it. If the picked token is the end-of-sequence token (the special token the model emits to signal "this completion is done," whose id the tokenizer gave us back in I.2) we stop early. Otherwise we stop after max_new_tokens to keep a runaway generation bounded. The function returns the full sequence, prompt and generated tokens together.

Notice what this loop is not doing, and it's the central fact of Act 1's performance story. Every iteration calls model.forward(&ids) on the entire sequence so far. To generate token 50 it re-runs the forward pass over all 49 previous tokens, recomputing every key and value for every one of them, even though those tokens haven't changed since the last step. The work per step grows with the sequence length, and almost all of it is repeated. That is the single biggest reason Act 1's engine is slow, and fixing it (caching those keys and values so each step only processes the one new token) is the KV cache, the first real optimization of Act 2. For now we generate the simple, wasteful, obviously-correct way. The signature of the I.5 attention function already returns the K/V tensors a cache would store; the seam is in place, we just don't use it yet.

The greedy decode cycle: token ids feed a forward pass, logits feed argmax, the chosen token is appended to the ids, and the loop repeats with the longer sequence

Figure: the autoregressive loop. Each trip around appends one token, then runs forward over the whole grown sequence again; that repeated attention over prior tokens is what the KV cache removes in Act 2.

Figure: three trips around the loop. Each picked token joins the sequence and rides through the next forward pass.

The decode module file exports the two functions:

RUST

mod greedy;
mod sampling;
 
pub use greedy::greedy_generate;
pub(crate) use sampling::next_token_id_from_logits;

A backend factory

We now want a binary that picks a backend by name. A one-function file does that:

RUST

use std::sync::Arc;
 
use super::Backend;
use super::CpuBackend;
 
pub fn create_backend(name: &str) -> Result<Arc<dyn Backend>, String> {
    match name.trim() {
        "scalar" => Ok(Arc::new(CpuBackend)),
        other => Err(format!("unknown backend {other:?} (supported: scalar)")),
    }
}

create_backend maps a string to a backend. Today there's exactly one ("scalar", the CpuBackend from I.4) so the match looks like overkill. It is, deliberately. Act 2 adds "simd", "multicore", "metal" as new arms here, and a --backend flag will choose between them. Setting up the factory now means those additions touch only this file. The backend module file gains mod factory;, a pub use factory::create_backend;, and a pub(crate) use cpu::CpuBackend;.

The model-generate binary

Everything comes together in model-generate, the first program in this series that produces real text. It's declared in Cargo.toml:

TOML

[[bin]]
name = "model-generate"
path = "src/bin/model-generate.rs"

src/lib.rs re-exports the three new public names (greedy_generate, create_backend, and (still) Backend):

RUST

mod backend;
mod cli;
mod decode;
mod gguf;
mod model;
mod tensor;
mod tokenizer;
 
pub use cli::CliArgs;
pub use decode::greedy_generate;
pub use gguf::{GGUF, TensorInfo};
pub use model::{load_from_gguf_path, Model};
pub use backend::{Backend, create_backend};
pub use tokenizer::{BpeTokenizer, Tokenizer};

The binary parses arguments, loads the model, and generates:

RUST

use std::path::Path;
 
use inferno::{CliArgs, create_backend, greedy_generate, load_from_gguf_path};
 
fn usage() -> ! {
    eprintln!("usage: model-generate <gguf_path> [prompt] [max_new_tokens]");
    std::process::exit(2);
}
 
fn main() {
    let args = CliArgs::from_env();
    let positional = args.positionals();
    if positional.is_empty() {
        usage();
    }
 
    let gguf_path = Path::new(&positional[0]);
    let prompt = positional
        .get(1)
        .map(|s| s.as_str())
        .unwrap_or("hello world");
    let max_new_tokens = positional
        .get(2)
        .and_then(|s| s.parse().ok())
        .unwrap_or(20usize);

It takes three positional arguments (the GGUF path (required), an optional prompt, and an optional token cap) using the CliArgs helper from I.1. The prompt defaults to "hello world" and the cap to 20.

RUST

    let backend = create_backend("scalar").unwrap_or_else(|e| {
        eprintln!("error: {e}");
        std::process::exit(2);
    });
    let (model, tokenizer) = load_from_gguf_path(gguf_path, backend.clone()).unwrap_or_else(|e| {
        eprintln!("error: {e}");
        std::process::exit(1);
    });
    let eos_token_id = tokenizer.eos_token_id();

It builds the scalar backend, then calls load_from_gguf_path (from I.5), the one call that parses the file, builds the tokenizer, loads all the weights, and hands back a ready Model and Tokenizer. The backend is shared between the two via backend.clone() (cloning an Arc is cheap; it bumps a reference count). The EOS token id comes from the tokenizer; the generation loop needs it to know when to stop.

RUST

    let prompt_ids = tokenizer.encode(prompt);
    println!("prompt: {:?}", prompt);
    println!("prompt tokens ({}): {:?}", prompt_ids.len(), prompt_ids);
    println!("eos token id: {}", eos_token_id);
    println!();
 
    let full_ids = greedy_generate(
        model.as_ref(),
        &*backend,
        &prompt_ids,
        max_new_tokens,
        eos_token_id,
    );
 
    let generated_ids = &full_ids[prompt_ids.len()..];
    println!(
        "generated tokens ({}): {:?}",
        generated_ids.len(),
        generated_ids
    );
    println!("generated text: {:?}", tokenizer.decode(generated_ids));
}

The prompt is encoded to ids (the BPE encoder from I.2), some diagnostics are printed, and greedy_generate runs the loop. It returns the prompt and the generated tokens; slicing off the prompt's length leaves just what the model produced. tokenizer.decode turns those ids back into a string, and that string is the model's completion. Every stage of the pipeline built across Act 1 is in those few lines: parse, tokenize, forward, sample, detokenize.

Running it

BASH

cargo run --release --bin model-generate -- path/to/Qwen3-0.6B-FP32.gguf "Once upon a time, in a small village by the sea, there lived a baker" 20

The --release flag matters here; a debug build of a scalar matmul is painfully slow. Even released, this will not be fast: it's a single-threaded, scalar, FP32 forward pass re-run from scratch every token. Expect to wait. The output:

PLAINTEXT

prompt: "Once upon a time, in a small village by the sea, there lived a baker"
prompt tokens (17): [12522, 5193, 264, 882, 11, 304, 264, 2613, 14126, 553, 279, 9396, 11, 1052, 12163, 264, 75828]
eos token id: 151645
 
generated tokens (20): [6941, 4392, 13, 9082, 13, 1260, 1030, 264, 2613, 8061, 448, 264, 22360, 1965, 323, 264, 22360, 10496, 13, 3776]
generated text: " named Mr. Smith. He had a small shop with a wooden table and a wooden chair. One"

A real Qwen3 checkpoint, loaded and run by code you wrote top to bottom, continuing a prompt into coherent English. (Greedy decoding always picks the single most likely token, and given a proper story opening the model earnestly writes a story: the baker gets a name and a modest shop with wooden furniture. Deterministic, reproducible: you should see these exact twenty ids.) No ML framework anywhere in the stack: just the GGUF parser, the BPE tokenizer, the Tensor, the CpuBackend, and the forward pass, with this chapter's autoregressive loop tying them together.

Where this leaves us

Act 1 is complete. model-generate takes a prompt and produces a sensible continuation from a real model, and every number that moved through it went through code in this repository. The engine works.

It is also slow (most of a second per token on an M2 Pro, and it gets worse as the sequence grows) and the loop in this chapter is the prime suspect: re-running the full forward pass over the entire sequence on every single step, recomputing keys and values that never changed. That is not a flaw to be embarrassed about; it is the honest, simple baseline, and it is the thing Act 2 is built to attack. The Act 1 recap takes stock of everything built here and lays out, concretely, what's slow and why.