I.3: Tensor

Everything a transformer does is arithmetic on arrays of numbers. The weights are arrays. The activations flowing between layers are arrays. The attention scores, the intermediate MLP values, the final logits: all arrays. So before we can run a single layer, we need a type to hold one: a tensor.

The word "tensor" sounds like it should be heavy mathematical machinery. For our purposes it is not. A tensor, in an inference engine, is just two things bolted together: a flat block of numbers, and a shape that says how to interpret that flat block as a multi-dimensional grid. A 1024×151936 weight matrix and a 1024-element bias vector are both, underneath, a single contiguous Vec<f32>; the shape is what tells them apart.

This chapter is short and foundational. We define the Tensor type, and we teach the GGUF parser one new trick: actually reading tensor data off disk. I.1 parsed the tensor index (every tensor's name, shape, and offset) but deliberately never touched the payloads. Now we have somewhere to put them, so GGUF::load_tensor reads the bytes for a named tensor and hands back a Tensor.

Row-major layout: the one idea that matters

A computer's memory is one-dimensional: a long line of bytes addressed 0, 1, 2, …. A matrix is two-dimensional. To store a matrix in memory you have to flatten it, and there is a choice in how. We use row-major order: lay down the first row, then the second row immediately after it, and so on. The matrix

PLAINTEXT
  ┌ 1  2  3 ┐
  │ 4  5  6 │
  └ 7  8  9 ┘

becomes the flat buffer [1, 2, 3, 4, 5, 6, 7, 8, 9]. The element at row r, column c of an R × C matrix lives at flat index r * C + c. That single formula, r * C + c, is the entire contract. Every kernel we write in the next chapter assumes it. Every weight we load assumes it. Getting one routine to disagree about it is the kind of bug that produces plausible-looking garbage, so we pin it down here, once, and never revisit the question.

Row-major also has a practical consequence: walking a row is walking consecutive memory, which is exactly what a CPU's cache wants. Walking a column jumps C elements each step. This is why some of the kernels in I.4 are written the way they are, but that's getting ahead of ourselves.

GGUF itself stores tensor dimensions in the opposite order to how we want them; it lists the fastest-varying dimension first. A weight matrix the model treats as [rows, cols] appears in the GGUF index as dims = [cols, rows]. We'll reverse that when we load. Keeping our internal convention consistent (shape[0] is rows, the slowest-varying axis) is worth the small reversal at the door.

The Tensor type

The new tensor module is two files. mod.rs is the usual one-liner:

src/tensor/mod.rsRUST
mod tensor;
 
pub use tensor::Tensor;

tensor.rs holds the type. We start with the data itself, wrapped in an enum:

src/tensor/tensor.rsRUST
#[derive(Clone)]
pub(crate) enum TensorData {
    Fp32(Vec<f32>),
}
 
impl std::fmt::Debug for TensorData {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        match self {
            TensorData::Fp32(v) => write!(f, "Fp32({} elems)", v.len()),
        }
    }
}

TensorData is an enum with exactly one variant today: Fp32, a Vec of 32-bit floats. A single-variant enum looks pointless, and right now it nearly is. But it is a deliberate hook. In II.6 we add a Q8_0 variant for quantized weights, where the numbers are stored as 8-bit integers plus per-block scale factors. Making TensorData an enum now means that change is purely additive (a new variant, new match arms) rather than a type rename that ripples through every file. The hand-written Debug keeps log output short: a tensor with a million elements prints Fp32(1000000 elems), not a million numbers.

Now the Tensor struct itself:

src/tensor/tensor.rsRUST
#[derive(Clone)]
pub struct Tensor {
    pub(in crate::tensor) data: TensorData,
    pub(in crate::tensor) shape: Vec<usize>,
}

Two fields: the flat data, and the shape. The shape is a Vec<usize> ([1024] for a vector, [1024, 151936] for a matrix). The pub(in crate::tensor) visibility means these fields are reachable from anywhere inside the tensor module but nowhere else. Other modules go through methods. That keeps the "flat buffer plus shape, and the two always agree" invariant enforceable in one place.

A custom Debug prints a tensor compactly: its dimensions and its memory footprint:

src/tensor/tensor.rsRUST
impl std::fmt::Debug for Tensor {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        let dims: Vec<_> = self.shape.iter().map(|d| d.to_string()).collect();
        let (bytes, unit) = match &self.data {
            TensorData::Fp32(v) => {
                let b = v.len() * 4;
                fmt_bytes(b)
            }
        };
        write!(f, "[{}]({}{})", dims.join(","), bytes, unit)
    }
}
 
fn fmt_bytes(bytes: usize) -> (u64, &'static str) {
    if bytes >= 1 << 20 {
        ((bytes as f64 / (1 << 20) as f64) as u64, "MB")
    } else if bytes >= 1 << 10 {
        ((bytes as f64 / (1 << 10) as f64) as u64, "KB")
    } else {
        (bytes as u64, "B")
    }
}

A tensor prints as something like [1024,151936](593MB): shape, then size in the largest sensible unit. fmt_bytes does the unit selection. Trivial, but when you're debugging a model with hundreds of tensors, a one-line summary per tensor is the difference between a readable log and a wall of numbers.

Finally the one constructor:

src/tensor/tensor.rsRUST
impl Tensor {
    pub fn new(data: Vec<f32>, shape: Vec<usize>) -> Self {
        let expected: usize = shape.iter().product();
        assert_eq!(data.len(), expected, "tensor shape mismatch");
        Self {
            data: TensorData::Fp32(data),
            shape,
        }
    }
}

Tensor::new takes a flat Vec<f32> and a shape, and crucially checks they agree. The number of elements a shape implies is the product of its dimensions; if the data length doesn't match, that is a bug, and we panic immediately rather than let a mis-shaped tensor wander into a kernel and corrupt something far away. This single assertion is the guardrail behind the row-major contract. Every tensor in the system is born here, and every one is checked.

That's the whole type for now. No arithmetic; that's the Backend's job in I.4. Tensor is pure data structure. Its job is to be the one shape every weight and activation flows through.

Loading tensor data from GGUF

Now the second half of the chapter: getting real numbers off disk into a Tensor. This is an extension of the gguf module. First a tiny utility file:

src/gguf/util.rsRUST
use super::types::TensorInfo;
 
pub(crate) fn tensor_numel(t: &TensorInfo) -> u64 {
    t.dims
        .iter()
        .copied()
        .try_fold(1u64, |acc, d| acc.checked_mul(d))
        .expect("tensor size overflow")
}

tensor_numel computes how many elements a tensor has: the product of its dimensions, taken from the TensorInfo in the index. The try_fold with checked_mul is overflow-safe: a corrupt file claiming absurd dimensions overflows the multiply and panics with a clear message rather than wrapping silently into a tiny number and under-allocating. The gguf module file gains the line mod util; to pull it in.

The real work happens in gguf.rs. The GGUF struct grows one field:

src/gguf/gguf.rsRUST
#[derive(Debug)]
pub struct GGUF {
    pub(crate) reader: BufReader<File>,
    pub version: u32,
    pub metadata: HashMap<String, MetadataValue>,
    pub tensor_data_start: u64,
    pub tensors: Vec<TensorInfo>,
}

The new field is reader, the buffered file handle, kept open after parsing. In I.1 the reader was a local variable inside parse; it was dropped the moment parsing finished, because we only needed the index. Now we need to come back and read payloads on demand, so the GGUF value owns the open file. parse changes only in that it stores r into self.reader instead of letting it drop:

src/gguf/gguf.rsRUST
        Self {
            reader: r,
            version,
            metadata,
            tensor_data_start,
            tensors,
        }

The interesting new method is load_tensor. Given a tensor's name, it finds the index entry, works out the shape, reads the bytes, and builds a Tensor:

src/gguf/gguf.rsRUST
    pub fn load_tensor(&mut self, name: &str) -> Tensor {
        let info = self
            .tensors
            .iter()
            .find(|t| t.name == name)
            .expect(&format!("GGUF tensor not found: {name}"));
        let ggml_type = info.ggml_type;
        let offset = info.offset;
 
        let shape: Vec<usize> = match info.dims.len() {
            1 => vec![info.dims[0] as usize],
            2 => vec![info.dims[1] as usize, info.dims[0] as usize],
            n => panic!("{name}: unsupported tensor rank {n}"),
        };
        let numel = tensor_numel(info);

First it looks the tensor up by name in the index, panicking if there's no such tensor, which during model loading means the file isn't the model we think it is. Then the shape. This is where the dimension-order reversal happens, exactly as flagged earlier. A 1-D tensor's shape is taken as-is. A 2-D tensor's GGUF dims are [cols, rows], so we reverse them to [rows, cols] ([info.dims[1], info.dims[0]]) to match our row-major shape[0] = rows convention. Ranks other than 1 or 2 don't occur in this model, so we panic on them rather than guess. numel is the element count, used next to size the read.

src/gguf/gguf.rsRUST
        match ggml_type {
            0 => {
                let bytes = self.read_tensor_bytes(offset, numel as usize * 4);
                let mut data = Vec::with_capacity(numel as usize);
                for chunk in bytes.chunks_exact(4) {
                    data.push(f32::from_le_bytes([chunk[0], chunk[1], chunk[2], chunk[3]]));
                }
                Tensor::new(data, shape)
            }
            ty => panic!("{name}: unsupported GGML type {ty}"),
        }
    }

ggml_type is the GGUF type tag from I.1; 0 means FP32. For FP32 we read numel * 4 bytes (four bytes per float), then walk them four at a time with chunks_exact(4), decoding each group with f32::from_le_bytes. The decoded Vec<f32> and the computed shape go into Tensor::new, whose assertion confirms they agree. Any other type tag panics; the model we use in Act 1 is FP32 throughout, and the quantized Q8_0 layout (type 8) is decoded in II.6, where adding it is just a second match arm.

The actual byte read is one small private helper:

src/gguf/gguf.rsRUST
    fn read_tensor_bytes(&mut self, offset: u64, byte_count: usize) -> Vec<u8> {
        let start = self
            .tensor_data_start
            .checked_add(offset)
            .expect("tensor offset overflow");
        self.reader
            .seek(SeekFrom::Start(start))
            .expect("seek tensor data");
        let mut raw = vec![0u8; byte_count];
        self.reader.read_exact(&mut raw).expect("read tensor bytes");
        raw
    }

This is where the I.1 groundwork pays off. Recall a tensor's offset is relative to the tensor data region, not the file start. The real file position is tensor_data_start + offset; the checked_add guards against a corrupt offset overflowing. We seek there, allocate a buffer of exactly the right size, and read_exact fills it. The buffered reader (BufReader) means a model with a few hundred tensors isn't making a few hundred raw syscalls' worth of unbuffered reads.

The gguf module file gets two new lines (mod util; and the use for Tensor) and src/lib.rs adds mod tensor;. Note Tensor is not re-exported from lib.rs yet: no binary needs to name it this chapter. It becomes part of the public surface later, when binaries start handling model outputs directly.

Where this leaves us

We now have a number container: Tensor, a flat Vec<f32> plus a shape, with the row-major layout r * C + c nailed down, and a way to fill it from a real model file. GGUF::load_tensor("blk.0.attn_q.weight") returns the actual weights of that matrix, shaped correctly, ready to use.

What we can't do yet is anything with a tensor. There's no add, no multiply, no matmul. A tensor just sits there. The next chapter fixes that: it defines the Backend trait (every numeric operation a transformer needs) and a plain scalar CPU implementation of all of them. That trait is the seam every faster backend in Act 2 will plug into.