I.1: GGUF

A trained language model, on disk, is just a large bag of numbers: a few hundred matrices (the weights) plus a description of how they fit together. Before we can run one, we have to read that file. So that's where we start: not with math, not with a tokenizer, but with a binary file format.

The model we target throughout this series, Qwen3 0.6B, ships in a format called GGUF, the format used by llama.cpp and its ecosystem. A GGUF file is a single binary blob: a tiny header, a block of typed key-value metadata (hyperparameters, the tokenizer vocabulary, the chat template), an index of every tensor (an n-dimensional block of numbers; the index records each one's name, shape, and where its bytes live), and finally the tensor payloads themselves concatenated into one region.

It is simple enough to parse with the Rust standard library alone, no crates, which is what we want. The whole premise of this project is that nothing in the load path is hidden behind someone else's wrapper.

This chapter's goal is deliberately narrow. We parse a GGUF file (header, metadata, tensor index) and build a small command-line tool, gguf-inspect, that dumps what's inside. We do not read tensor data into memory yet, and we don't touch the quantized tensor layouts (formats that store weights in fewer bits to shrink the file): a Tensor type and weight loading arrive in I.3. For now we just learn the format, because the first time you open an unfamiliar model file you want to see its shape.

The crate

From the empty crate the act intro set up, the whole of Cargo.toml for this chapter is package metadata plus a declaration of the one binary we ship:

TOML

[package]
name = "inferno"
version = "0.1.0"
edition = "2024"
 
[[bin]]
name = "gguf-inspect"
path = "src/bin/gguf-inspect.rs"

The library crate's root, src/lib.rs, is just a table of contents: it declares the modules we add this chapter and re-exports the handful of types a binary needs to name:

RUST

mod cli;
mod gguf;
 
pub use cli::CliArgs;
pub use gguf::{GGUF, TensorInfo};

mod gguf is the file parser. mod cli is a tiny command-line argument helper. Every binary in this series takes arguments, so we set up one shared parser now and grow it as later chapters add flags. We'll write cli last; first, the format.

One convention for the whole series: every module directory has a mod.rs that is nothing but a table of contents like this one, mod lines plus re-exports. When the text says a module "declares" or "re-exports" something without showing a block, that's the whole change; the file is always in the repo if you want to compare. Everything with actual behavior gets shown in full.

File layout

GGUF is straight binary, little-endian, and version-tagged. Top to bottom:

PLAINTEXT

┌─ magic "GGUF" (4 bytes)
│  version       : u32     (we accept 2 and 3)
│  tensor_count  : u64
│  metadata_count: u64
├─ metadata entries  (metadata_count of them)
│    key   : length-prefixed UTF-8 string
│    type  : u32
│    value : depends on type (primitive, string, or array)
├─ tensor index      (tensor_count of them)
│    name   : length-prefixed UTF-8 string
│    n_dims : u32
│    dims   : u64 × n_dims
│    type   : u32   (0 = F32; others are quantized layouts we meet later)
│    offset : u64   (byte offset into the tensor data region)
├─ padding to a `general.alignment`-byte boundary
└─ tensor data region  (every tensor payload, concatenated)

Two details matter. First, every metadata value is type-tagged with a small enum: u8, i8, u16, … f64, plus String and a recursive Array. You read the tag, then you know how to read the value. Second, a tensor's offset is relative to the start of the tensor data region, not the start of the file. To turn an offset into a real file position we need tensor_data_start: the byte where the index ends, rounded up to the alignment boundary (typically 32).

Annotated GGUF file layout showing the header, metadata key-value pairs, tensor index, alignment padding, and raw tensor data region, with zoom-ins on one metadata entry and one tensor index entry whose offset points into the data region

Figure: the same layout, annotated. A metadata entry is a key, a type tag, and a value; a tensor index entry is a name, a shape, a type, and an offset that points into the tensor data region at the bottom of the file.

Everything below is just reading those fields in order.

Reading primitives

GGUF is full of fixed-width little-endian integers and length-prefixed strings. Rather than scatter byte-twiddling everywhere, we put five small readers in their own file. Each takes anything that implements std::io::Read and pulls one value:

RUST

use std::io::Read;
 
pub(crate) fn read_arr<const N: usize>(r: &mut impl Read) -> [u8; N] {
    let mut b = [0u8; N];
    r.read_exact(&mut b).expect("read GGUF");
    b
}
 
pub(crate) fn read_u8(r: &mut impl Read) -> u8 {
    read_arr::<1>(r)[0]
}
 
pub(crate) fn read_u32(r: &mut impl Read) -> u32 {
    u32::from_le_bytes(read_arr(r))
}
 
pub(crate) fn read_u64(r: &mut impl Read) -> u64 {
    u64::from_le_bytes(read_arr(r))
}
 
pub(crate) fn read_string(r: &mut impl Read) -> String {
    let n = read_u64(r) as usize;
    let mut v = vec![0u8; n];
    r.read_exact(&mut v).expect("read string");
    String::from_utf8_lossy(&v).into_owned()
}

read_arr is the workhorse: it reads exactly N bytes into a fixed array, and the integer readers just hand that array to from_le_bytes. A GGUF string is a u64 length followed by that many UTF-8 bytes; read_string reads the count, then the bytes.

A note on the .expect(...) calls. This crate treats a corrupt or truncated GGUF file as a programmer error in whatever produced the file, not a runtime condition to recover from. If the bytes are wrong we panic with a clear message. The payoff is that the rest of the codebase never has to thread a Result through weight loading just in case the file is broken.

Metadata and tensor types

Next, the data types the parser produces. Two of them, in src/gguf/types.rs. TensorInfo is one entry in the tensor index: everything we know about a tensor except its actual numbers:

RUST

use std::io::{Read, Seek};
 
use super::read::{read_arr, read_string, read_u32, read_u64, read_u8};
 
#[derive(Debug, Clone)]
pub struct TensorInfo {
    pub name: String,
    pub dims: Vec<u64>,
    pub ggml_type: u32,
    pub offset: u64,
}
 
#[derive(Debug, Clone)]
pub enum MetadataValue {
    Uint8(u8),
    Int8(i8),
    Uint16(u16),
    Int16(i16),
    Uint32(u32),
    Int32(i32),
    Float32(f32),
    Bool(bool),
    String(String),
    Array(Vec<MetadataValue>),
    Uint64(u64),
    Int64(i64),
    Float64(f64),
}

MetadataValue is a Rust enum mirroring GGUF's type tags one-to-one. The Array variant holds a Vec of more MetadataValues; that recursion is how the file stores things like the tokenizer's 150k-entry vocabulary (an array of strings).

Reading a value means reading its u32 tag, then dispatching on it:

RUST

impl MetadataValue {
    pub fn read_metadata_value<R: Read + Seek>(r: &mut R, ty: u32) -> Self {
        match ty {
            0 => MetadataValue::Uint8(read_u8(r)),
            1 => MetadataValue::Int8(i8::from_le_bytes(read_arr(r))),
            2 => MetadataValue::Uint16(u16::from_le_bytes(read_arr(r))),
            3 => MetadataValue::Int16(i16::from_le_bytes(read_arr(r))),
            4 => MetadataValue::Uint32(read_u32(r)),
            5 => MetadataValue::Int32(i32::from_le_bytes(read_arr(r))),
            6 => MetadataValue::Float32(f32::from_le_bytes(read_arr(r))),
            7 => {
                let b = read_u8(r);
                assert!(b <= 1, "GGUF bool must be 0 or 1, got {b}");
                MetadataValue::Bool(b != 0)
            }
            8 => MetadataValue::String(read_string(r)),
            9 => {
                let elem_type = read_u32(r);
                let n = read_u64(r);
                let mut elements = Vec::with_capacity(n as usize);
                for _ in 0..n {
                    elements.push(Self::read_metadata_value(r, elem_type));
                }
                MetadataValue::Array(elements)
            }
            10 => MetadataValue::Uint64(read_u64(r)),
            11 => MetadataValue::Int64(i64::from_le_bytes(read_arr(r))),
            12 => MetadataValue::Float64(f64::from_le_bytes(read_arr(r))),
            x => panic!("unknown GGUF metadata value type {x}"),
        }
    }
}

Tag 9 (Array) is the only recursive case: it reads the element type once, then calls back into read_metadata_value for each of the n elements.

Once values are in hand, you want easy ways to pull them back out. The same logical number can be stored under any of several integer tags depending on which tool wrote the file, so we add one getter that accepts all of them and yields a u64:

RUST

impl MetadataValue {
    pub fn as_u64(&self) -> Option<u64> {
        match self {
            MetadataValue::Uint8(x) => Some(*x as u64),
            MetadataValue::Uint16(x) => Some(*x as u64),
            MetadataValue::Uint32(x) => Some(*x as u64),
            MetadataValue::Uint64(x) => Some(*x),
            MetadataValue::Int8(x) if *x >= 0 => Some(*x as u64),
            MetadataValue::Int16(x) if *x >= 0 => Some(*x as u64),
            MetadataValue::Int32(x) if *x >= 0 => Some(*x as u64),
            MetadataValue::Int64(x) if *x >= 0 => Some(*x as u64),
            _ => None,
        }
    }
}

This chapter only needs as_u64, for one field. Later chapters add more getters (as_str, as_f32, …) to the same impl block as the tokenizer and model config start mining the metadata.

The parser

Now the file itself. src/gguf/gguf.rs holds the GGUF struct (the parsed result) and the single function that fills it in:

RUST

use std::collections::HashMap;
use std::fs::File;
use std::io::{BufReader, Read, Seek};
use std::path::Path;
 
use super::read::{read_string, read_u32, read_u64};
use super::types::{MetadataValue, TensorInfo};
 
#[derive(Debug)]
pub struct GGUF {
    pub version: u32,
    pub metadata: HashMap<String, MetadataValue>,
    pub tensor_data_start: u64,
    pub tensors: Vec<TensorInfo>,
}

GGUF is everything the file describes: its version, the metadata map, the tensor index, and tensor_data_start (the absolute file offset where tensor payloads begin). Notice what it doesn't hold: any tensor numbers. Parsing tells us what's in the file and where; pulling the actual weight bytes into memory is a separate job, and it waits until I.3, where there's a Tensor to put them in.

parse reads the file straight through, in spec order. The header first:

RUST

impl GGUF {
    pub fn parse(path: &Path) -> Self {
        let file = File::open(path).expect("open gguf");
        let mut r = BufReader::new(file);
 
        let mut magic = [0u8; 4];
        r.read_exact(&mut magic).expect("read magic");
        assert_eq!(&magic, b"GGUF", "invalid GGUF magic");
 
        let version = read_u32(&mut r);
        assert!(
            (2..=3).contains(&version),
            "unsupported GGUF version {version}"
        );
 
        let tensor_count = read_u64(&mut r) as usize;
        let metadata_count = read_u64(&mut r);

Four bytes of magic that must spell GGUF, a version we check is 2 or 3, then the two counts. Those counts drive the next two loops. Metadata first: read a key, a type tag, and a value, metadata_count times:

RUST

        let mut metadata = HashMap::new();
        for _ in 0..metadata_count {
            let key = read_string(&mut r);
            let ty = read_u32(&mut r);
            let value = MetadataValue::read_metadata_value(&mut r, ty);
            metadata.insert(key, value);
        }

Then the tensor index: for each tensor, its name, its shape (n_dims then that many u64 dimensions), its type tag, and its offset:

RUST

        let mut tensors = Vec::with_capacity(tensor_count);
        for _ in 0..tensor_count {
            let name = read_string(&mut r);
            let n = read_u32(&mut r) as usize;
            let mut dims = Vec::with_capacity(n);
            for _ in 0..n {
                dims.push(read_u64(&mut r));
            }
            let ggml_type = read_u32(&mut r);
            let offset = read_u64(&mut r);
            tensors.push(TensorInfo {
                name,
                dims,
                ggml_type,
                offset,
            });
        }

After the index, the reader's cursor sits at the end of the header region. The tensor data is padded to start on an aligned boundary, so the final step computes that boundary:

RUST

        let pos = r.stream_position().expect("stream position");
        let alignment = metadata
            .get("general.alignment")
            .and_then(MetadataValue::as_u64)
            .unwrap_or(32);
        let tensor_data_start = match alignment {
            0 => pos,
            a => pos + (a - pos % a) % a,
        };
 
        Self {
            version,
            metadata,
            tensor_data_start,
            tensors,
        }
    }
}

The alignment value lives in metadata (general.alignment), defaulting to 32 if absent, so we have to finish parsing metadata before we can compute it. The only tricky line is (a - pos % a) % a: "round pos up to the next multiple of a, but if it's already a multiple, add zero, not a." The outer % a is what handles the already-aligned case.

A tensor's real file position is then tensor_data_start + tensor.offset. We don't use that yet, but it's the bridge to I.3.

Wiring the module

src/gguf/mod.rs ties the three files together and decides what the outside world can see:

RUST

mod gguf;
mod read;
mod types;
 
pub use gguf::GGUF;
pub use types::TensorInfo;

read stays entirely private; those byte helpers are an implementation detail. GGUF and TensorInfo are re-exported because the gguf-inspect binary needs to name them. MetadataValue stays internal for now; it surfaces later when the tokenizer and model config need it.

A command line to look inside

Two more files and we can run something. First the argument helper, src/cli/args.rs. This chapter's needs are minimal (gguf-inspect takes a single file path), but rather than reach for std::env::args() ad hoc in every binary, we wrap it once. ArgCursor walks the argument vector; CliArgs is the parsed result:

RUST

struct ArgCursor<'a> {
    args: &'a [String],
    i: usize,
}
 
impl<'a> ArgCursor<'a> {
    fn new(args: &'a [String]) -> Self {
        Self { args, i: 1 }
    }
 
    fn has_more(&self) -> bool {
        self.i < self.args.len()
    }
 
    fn advance(&mut self) -> Option<&str> {
        if self.i >= self.args.len() {
            return None;
        }
        let s = self.args[self.i].as_str();
        self.i += 1;
        Some(s)
    }
 
    fn take(&mut self) -> String {
        self.advance()
            .map(str::to_string)
            .expect("ArgCursor::take called past end of argv")
    }
}
 
pub struct CliArgs {
    positionals: Vec<String>,
}
 
impl CliArgs {
    pub fn from_env() -> Self {
        Self::parse(std::env::args().collect())
    }
 
    pub fn parse(args: Vec<String>) -> Self {
        let mut positionals = Vec::new();
 
        let mut cur = ArgCursor::new(&args);
        while cur.has_more() {
            positionals.push(cur.take());
        }
 
        Self { positionals }
    }
 
    pub fn positionals(&self) -> &[String] {
        &self.positionals
    }
}

The cursor starts at index 1 to skip the program name. Right now parse does the simplest thing: every argument is a positional. There are no flags yet. That's intentional: later binaries want --backend, --kv, and friends, and when they do we add a match arm here. Introducing the structure now means those additions are purely additive. The module file just exposes CliArgs:

RUST

mod args;
 
pub use args::CliArgs;

Now the binary itself, src/bin/gguf-inspect.rs. It takes a path, parses the file, and prints a summary plus a preview of the tensor index:

RUST

use std::collections::HashMap;
use std::path::Path;
use std::process;
 
use inferno::{CliArgs, GGUF, TensorInfo};
 
const TENSOR_PREVIEW: usize = 24;
 
fn usage() -> ! {
    eprintln!("usage: gguf-inspect <model.gguf>");
    process::exit(2);
}
 
fn main() {
    let args = CliArgs::from_env();
    let path = args
        .positionals()
        .first()
        .map(|s| s.as_str())
        .unwrap_or_else(|| usage());
    let path = Path::new(path);
    let g = GGUF::parse(path);
 
    println!("file:                  {}", path.display());
    println!("version:               {}", g.version);
    println!("metadata_keys:         {}", g.metadata.len());
    println!("tensors:               {}", g.tensors.len());
    println!("tensor_data_start:     {}", g.tensor_data_start);
 
    let counts: HashMap<u32, usize> = g.tensors.iter().fold(HashMap::new(), |mut m, t| {
        *m.entry(t.ggml_type).or_default() += 1;
        m
    });
    let mut pairs: Vec<_> = counts.into_iter().collect();
    pairs.sort_by_key(|(k, _)| *k);
    let type_summary = pairs
        .iter()
        .map(|(id, n)| format!("{id} -> {n}"))
        .collect::<Vec<_>>()
        .join(", ");
    println!("tensor data types (id → count): [{}]", type_summary);
 
    let n = g.tensors.len().min(TENSOR_PREVIEW);
    print_tensor_table(&g.tensors[..n]);
    if g.tensors.len() > TENSOR_PREVIEW {
        println!("... +{} more", g.tensors.len() - TENSOR_PREVIEW);
    }
}

main resolves the path from the first positional argument (printing usage and exiting if there is none), parses the file, and prints the header fields. The counts fold tallies how many tensors carry each ggml_type tag, a quick way to see whether a file is all-FP32 (type 0) or quantized. The last helper just lays the preview out in aligned columns:

RUST

fn print_tensor_table(tensors: &[TensorInfo]) {
    let max_name_len = tensors.iter().map(|t| t.name.len()).max().unwrap_or(0);
    let max_dim_len = tensors
        .iter()
        .map(|t| format!("{:?}", t.dims).len())
        .max()
        .unwrap_or(0);
 
    println!(
        "{:<name$}  {:<dim$}  type  offset",
        "name",
        "dims",
        name = max_name_len,
        dim = max_dim_len
    );
    for t in tensors {
        println!(
            "{:<name$}  {:<dim$}  {:4}  {}",
            t.name,
            format!("{:?}", t.dims),
            t.ggml_type,
            t.offset,
            name = max_name_len,
            dim = max_dim_len
        );
    }
}

Running it

BASH

cargo run --bin gguf-inspect -- path/to/Qwen3-0.6B-FP32.gguf

You get a header (version, how many metadata keys, how many tensors, where the data region starts), a per-type count, and a preview of the index:

PLAINTEXT

file:                  path/to/Qwen3-0.6B-FP32.gguf
version:               3
metadata_keys:         34
tensors:               311
tensor_data_start:     5951648
tensor data types (id → count): [0 -> 311]
name                      dims            type  offset
output.weight             [1024, 151936]     0  0
output_norm.weight        [1024]             0  622329856
token_embd.weight         [1024, 151936]     0  622333952
blk.0.attn_k.weight       [1024, 1024]       0  1244663808
blk.0.attn_k_norm.weight  [128]              0  1248858112
blk.0.attn_norm.weight    [1024]             0  1248858624
blk.0.attn_output.weight  [2048, 1024]       0  1248862720
blk.0.attn_q.weight       [1024, 2048]       0  1257251328
blk.0.attn_q_norm.weight  [128]              0  1265639936
blk.0.attn_v.weight       [1024, 1024]       0  1265640448
blk.0.ffn_down.weight     [3072, 1024]       0  1269834752
blk.0.ffn_gate.weight     [1024, 3072]       0  1282417664
blk.0.ffn_norm.weight     [1024]             0  1295000576
blk.0.ffn_up.weight       [1024, 3072]       0  1295004672
...

Those tensor names are a convention we'll lean on hard later: per-layer tensors are blk.<i>.<rest> (for example, blk.0.attn_q.weight, blk.1.ffn_gate.weight), and the count of distinct blk.<i> prefixes tells you how many transformer layers the model has. The type column is the GGML type id: 0 means FP32, and since Act 1 runs the FP32 export of Qwen3 0.6B, that's the only type here. Quantized layouts like Q8_0 (type 8) are what II.6 handles; when we switch to the Q8_0 export there, you can point this same tool at it and watch the type column change.

Where this leaves us

The "weights live in an opaque binary on disk" problem is now a plain function call: GGUF::parse(path) hands back the version, every metadata key, and an index of all 311 tensors. We haven't read a single weight value into memory, and we've deliberately ignored quantized layouts, but we can see the file, and gguf-inspect proves the parser agrees with it byte for byte.

The metadata map we just parsed holds more than hyperparameters; it also carries the tokenizer's entire vocabulary and merge table. That's exactly what the next chapter needs: a byte-pair encoder that turns text into the token ids a model actually consumes.