III.6: Decode scheduler

Since III.2, every request has run inside its own spawn_blocking closure: a whole chat turn (prefill plus every decode step) start to finish on its own thread. And it does work concurrently. The weights are immutable, Model and Backend are Send + Sync, and each request owns its own KV cache, so two simultaneous turns run in parallel without stepping on each other.

The problem is that nothing is in charge. Every request gets its own blocking thread (tokio's blocking pool will spawn up to 512 of them), so ten concurrent turns means ten full decode loops fighting over a handful of cores, with the OS scheduler deciding who makes progress. There is no queue, no admission control (nothing decides when a request may start), no fairness. And because each thread runs its own private forward pass, there is no point in the system where two requests' decode steps could ever be combined: the batching III.7 is building toward is structurally impossible in this shape.

So concurrency needs a different shape. Instead of "each request runs its own turn to completion," we want: one component owns the model, holds several requests' worth of state side by side, and advances all of them by interleaving their decode steps: request A's step, request B's step, A's step, B's step. To the model it's still one caller; to the users it's concurrent, and this time it's concurrency the server controls.

That component is the decode scheduler. This chapter builds it: a background worker thread that owns the model, holds a fixed set of decode slots, admits requests into free slots, and runs the decode loop across all active slots. The HTTP handlers stop calling the model; they submit a job and await a result.

The architecture

PLAINTEXT

HTTP handler ──submit(job)──> ┌─────────────────────────────┐
HTTP handler ──submit(job)──> │  scheduler worker thread     │
HTTP handler ──submit(job)──> │   owns: model, tokenizer,    │
        ▲                     │         backend, prefix cache│
        │                     │                              │
        │                     │   slots: [A][B][ ][ ]        │
        │                     │   waiting queue: [C][D]      │
        └────result───────────│                              │
                              │   loop: drain submits        │
                              │         evict finished       │
                              │         admit from queue     │
                              │         decode all slots     │
                              └─────────────────────────────┘

One worker thread owns everything heavy. HTTP handlers communicate with it over channels: a handler sends a SubmitRequest (the job plus a reply channel) and blocks on the reply. The worker has a fixed array of slots (each slot is one in-flight request's full decode state) and a waiting queue for jobs that arrived when all slots were full.

The worker runs a loop: pull new submissions into the queue, retire any finished slots (sending their results back), admit queued jobs into freed slots, then run one decode step on every active slot. Round and round. Each pass advances every active request by one token, so N requests make progress together.

This chapter interleaves the slots; each slot still gets its own forward pass per tick. III.7 fuses those per-slot passes into one batched pass. The slot and worker structure built here is what III.7 extends.

The III.6 commit adds a scheduler module with four files: a job type, the scheduler handle, the slot, and the worker loop.

RUST

mod job;
mod scheduler;
mod slot;
mod worker;
 
pub use job::ChatJob;
pub use scheduler::InferenceScheduler;

The job

src/scheduler/job.rs defines what a caller submits and the envelope it travels in:

RUST

use std::sync::mpsc::Sender;
 
use tokio::sync::mpsc::UnboundedSender;
 
use crate::chat::{ChatTemplateMessage, ChatTurnResult};
 
pub struct ChatJob {
    pub messages: Vec<ChatTemplateMessage>,
    pub max_new_tokens: usize,
    pub kv_mode: &'static str,
    pub stream_delta: Option<UnboundedSender<String>>,
}
 
pub(crate) struct SubmitRequest {
    pub job: ChatJob,
    pub reply: Sender<Result<ChatTurnResult, String>>,
}

A ChatJob is one chat turn's request: the messages, the token cap, the KV cache mode. stream_delta is an optional tokio channel sender. If the request is streaming (SSE), the handler passes a sender here, and the slot pushes each text delta into it as tokens are generated. If non-streaming, it's None and the slot just builds the final result.

SubmitRequest wraps a job with a reply channel, a std::sync::mpsc::Sender the worker uses to send the finished ChatTurnResult back to whoever submitted. Note the two channel kinds: std::sync::mpsc for the reply (the worker thread and the blocking submitter are both plain threads) and tokio::sync::mpsc for stream deltas (the async runtime consumes those).

The scheduler handle

src/scheduler/scheduler.rs is the public face, what HTTP handlers hold. It's a thin handle around a channel to the worker:

RUST

use std::sync::Arc;
use std::sync::mpsc::{self, Sender};
 
use crate::cache::RadixPrefixCache;
use crate::chat::ChatTurnResult;
use crate::model::Model;
use crate::backend::Backend;
use crate::tokenizer::Tokenizer;
 
use super::job::{ChatJob, SubmitRequest};
use super::worker;
 
pub struct InferenceScheduler {
    submit_tx: Sender<SubmitRequest>,
}
 
impl InferenceScheduler {
    pub fn new(
        max_slots: usize,
        model: Arc<dyn Model>,
        tokenizer: Arc<dyn Tokenizer>,
        backend: Arc<dyn Backend>,
        prefix_cache: Option<Arc<RadixPrefixCache>>,
    ) -> Arc<Self> {
        assert!(max_slots >= 1, "scheduler max_slots must be >= 1");
        let submit_tx = worker::spawn(max_slots, model, tokenizer, backend, prefix_cache);
        Arc::new(Self { submit_tx })
    }

InferenceScheduler holds just one thing: submit_tx, the sending end of the channel into the worker. new spawns the worker thread (handing it the model, tokenizer, backend, and prefix cache; the worker owns them now) and keeps the returned sender. The prefix cache from III.5 finally has a home: the worker owns it, so every request the worker admits shares one cache.

Two submit methods. submit is synchronous, for callers already on a blocking thread:

RUST

    pub fn submit(&self, job: ChatJob) -> Result<ChatTurnResult, String> {
        let (reply_tx, reply_rx) = mpsc::channel();
        self.submit_tx
            .send(SubmitRequest {
                job,
                reply: reply_tx,
            })
            .map_err(|_| "inference scheduler shut down".to_string())?;
        reply_rx
            .recv()
            .map_err(|_| "scheduler reply channel closed".to_string())?
    }

Make a fresh one-shot reply channel, send the job plus the reply sender into the worker, then block on reply_rx.recv() until the worker sends the result back. The caller's thread parks; the worker does the work.

submit_async is the version an axum handler calls; it wraps submit in spawn_blocking so the async runtime isn't blocked:

RUST

    pub async fn submit_async(self: &Arc<Self>, job: ChatJob) -> Result<ChatTurnResult, String> {
        let sched = Arc::clone(self);
        match tokio::task::spawn_blocking(move || sched.submit(job)).await {
            Ok(result) => result,
            Err(e) => Err(format!("scheduler join error: {e}")),
        }
    }
}

The spawn_blocking here is not running the model; it's only parking on reply_rx.recv(). The model runs on the worker thread. The blocking thread this uses just waits.

The slot

src/scheduler/slot.rs is the in-flight state of one request inside the worker. It's essentially the ChatDecodeState from III.1 plus the bookkeeping a server needs: where to send the reply, whether to stream, whether it's done.

The scheduler drives III.1's chat internals directly, so src/chat/mod.rs opens them up to the rest of the crate with a pub(crate) re-export block (this is the full file now):

RUST

mod generate;
mod prompt;
mod template;
 
pub use generate::{ChatTurnResult, run_chat_turn_streaming_with_prefix};
pub use template::ChatTemplateMessage;
 
pub(crate) use generate::{
    ChatDecodeStep, KvPushPhase, build_chat_turn_result, chat_decode_kv_one_step,
    kv_decode_push_phase, prepare_chat_decode_state, stream_delta,
};

RUST

use std::sync::mpsc::Sender;
use std::sync::Arc;
 
use tokio::sync::mpsc::UnboundedSender;
 
use crate::cache::{KvCache, RadixPrefixCache};
use crate::chat::{
    ChatDecodeStep, ChatTurnResult, build_chat_turn_result, chat_decode_kv_one_step,
    prepare_chat_decode_state, stream_delta,
};
use crate::decode::Metrics;
use crate::model::Model;
use crate::backend::Backend;
use crate::tokenizer::Tokenizer;
 
use super::job::SubmitRequest;
 
pub(crate) struct Slot {
    pub prompt: String,
    pub prompt_ids: Vec<usize>,
    pub eos_token_id: usize,
    pub ids: Vec<usize>,
    pub next_id: usize,
    pub cache: Box<dyn KvCache>,
    pub metrics: Metrics,
    pub max_new_tokens: usize,
    pub reply: Sender<Result<ChatTurnResult, String>>,
    pub stream: Option<UnboundedSender<String>>,
    pub prev_text: String,
    pub finished: bool,
    pub err: Option<String>,
}

The first chunk, prompt through max_new_tokens, is the decode state: the prompt and its ids, the running id sequence, the predicted next token, the KV cache, timing, the token cap. The rest is server bookkeeping: reply is where the final result goes; stream is the optional delta channel; prev_text tracks streamed text for the delta computation; finished and err say whether and how the slot ended.

Building a slot runs the prefill: III.1's prepare_chat_decode_state, now with the prefix cache wired in:

RUST

impl Slot {
    /// Builds a slot, or sends the failure on the reply channel and returns `None`.
    pub fn build(
        req: SubmitRequest,
        model: &Arc<dyn Model>,
        tokenizer: &dyn Tokenizer,
        backend: &Arc<dyn Backend>,
        prefix_cache: Option<&Arc<RadixPrefixCache>>,
    ) -> Option<Slot> {
        let job = req.job;
 
        let st = match prepare_chat_decode_state(
            Arc::clone(model),
            tokenizer,
            Arc::clone(backend),
            &job.messages,
            job.kv_mode,
            prefix_cache,
        ) {
            Ok(st) => st,
            Err(e) => {
                let _ = req.reply.send(Err(e));
                return None;
            }
        };
 
        Some(Slot {
            prompt: st.prompt,
            prompt_ids: st.prompt_ids,
            eos_token_id: st.eos_token_id,
            ids: st.ids,
            next_id: st.next_id,
            cache: st.cache,
            metrics: st.metrics,
            max_new_tokens: job.max_new_tokens,
            reply: req.reply,
            stream: job.stream_delta,
            prev_text: String::new(),
            finished: false,
            err: None,
        })
    }
 
    pub fn is_active(&self) -> bool {
        !self.finished && self.err.is_none()
    }

build runs prefill, including a prefix-cache lookup since prepare_chat_decode_state does that now. If prefill fails, the error goes straight back on the reply channel and build returns None (no slot created). Otherwise the decode state moves field-by-field into a fresh Slot. is_active is the worker's "still has work" predicate.

decode_one advances the slot by one token, the per-tick step:

RUST

    pub fn decode_one(
        &mut self,
        model: &dyn Model,
        tokenizer: &dyn Tokenizer,
        backend: &dyn Backend,
    ) {
        if !self.is_active() {
            return;
        }
 
        let ids_before = self.ids.len();
        let step = chat_decode_kv_one_step(
            &mut self.ids,
            &mut self.next_id,
            self.cache.as_mut(),
            self.prompt_ids.len(),
            self.eos_token_id,
            self.max_new_tokens,
            model,
            backend,
            &mut self.metrics,
            |_| {},
        );
        self.finish_step(tokenizer, ids_before, step);
    }
 
    pub fn fail(&mut self, err: String) {
        self.err = Some(err);
        self.finished = true;
    }

decode_one calls chat_decode_kv_one_step from III.1, the same single-step function chat-repl uses. It records ids_before so finish_step can tell whether a token was produced, then hands the step result to finish_step.

finish_step handles streaming and end-of-turn:

RUST

    pub fn finish_step(
        &mut self,
        tokenizer: &dyn Tokenizer,
        ids_before: usize,
        step: Result<ChatDecodeStep, String>,
    ) {
        if self.stream.is_some() && self.ids.len() > ids_before {
            self.emit_stream_delta(tokenizer);
        }
        match step {
            Ok(ChatDecodeStep::Continue) => {}
            Ok(ChatDecodeStep::Finished { .. }) => self.finished = true,
            Err(e) => {
                self.err = Some(e);
                self.finished = true;
            }
        }
    }

If this slot is streaming and a token landed, push a delta. Then act on the step: Continue does nothing, Finished marks the slot done, an error records the error and marks it done.

When a slot finishes, send_result builds the final ChatTurnResult and sends it on the reply channel:

RUST

    pub fn send_result(self, tokenizer: &dyn Tokenizer) {
        if let Some(e) = self.err {
            let _ = self.reply.send(Err(e));
            return;
        }
 
        let turn = build_chat_turn_result(
            self.prompt,
            self.prompt_ids,
            self.eos_token_id,
            self.ids,
            tokenizer,
            &self.metrics,
        );
        let _ = self.reply.send(Ok(turn));
    }
 
    fn emit_stream_delta(&mut self, tokenizer: &dyn Tokenizer) {
        let new_tokens = &self.ids[self.prompt_ids.len()..];
        if let Some(delta) = stream_delta(tokenizer, new_tokens, &mut self.prev_text)
            && let Some(tx) = &self.stream
        {
            let _ = tx.send(delta);
        }
    }
}

send_result takes self by value; the slot is consumed. On error it sends the error; otherwise it builds the result with III.1's build_chat_turn_result and sends it. emit_stream_delta reuses III.1's stream_delta, decoding the generated tokens, diffing against prev_text, sending only the new text, and pushing it down the slot's stream channel.

Lifecycle diagram: a request arrives on a channel, waits in the queue, is admitted when a slot frees, prefills on the worker, decodes interleaved with other slots, and finishes, freeing the slot for the next request.

Figure: one request's path through the scheduler. The queue and the fixed slot count are the admission control; everything after admission runs on the single worker thread.

The worker loop

src/scheduler/worker.rs is the engine. It spawns the thread and runs the loop. First, spawn:

RUST

use std::collections::VecDeque;
use std::sync::Arc;
use std::sync::mpsc::{self, Receiver, Sender, TryRecvError};
use std::thread;
 
use crate::cache::RadixPrefixCache;
use crate::model::Model;
use crate::backend::Backend;
use crate::tokenizer::Tokenizer;
 
use super::job::SubmitRequest;
use super::slot::Slot;
 
/// Spawns the worker thread and returns the submit sender.
pub(super) fn spawn(
    max_slots: usize,
    model: Arc<dyn Model>,
    tokenizer: Arc<dyn Tokenizer>,
    backend: Arc<dyn Backend>,
    prefix_cache: Option<Arc<RadixPrefixCache>>,
) -> Sender<SubmitRequest> {
    let (ready_tx, ready_rx) = mpsc::channel();
    thread::Builder::new()
        .name("inferno-scheduler".into())
        .spawn(move || {
            let (submit_tx, submit_rx) = mpsc::channel();
            ready_tx
                .send(submit_tx)
                .expect("hand off submit sender to InferenceScheduler");
            let mut worker = WorkerLoop {
                submit_rx,
                model,
                tokenizer,
                backend,
                prefix_cache,
                slots: (0..max_slots).map(|_| None).collect(),
                waiting_queue: VecDeque::new(),
                submit_gone: false,
            };
            worker.main_loop();
        })
        .expect("spawn inference scheduler thread");
    ready_rx
        .recv()
        .expect("scheduler worker failed to start")
}

spawn starts a named thread. Inside it, the worker creates its own submit channel (submit_tx/submit_rx) and hands submit_tx back to spawn over a one-shot ready channel. spawn returns that sender; it's what InferenceScheduler keeps. The dance ensures the channel is created on the worker side and the caller only gets the handle once the worker thread is running.

WorkerLoop is the state the loop owns:

RUST

struct WorkerLoop {
    submit_rx: Receiver<SubmitRequest>,
    model: Arc<dyn Model>,
    tokenizer: Arc<dyn Tokenizer>,
    backend: Arc<dyn Backend>,
    prefix_cache: Option<Arc<RadixPrefixCache>>,
    slots: Vec<Option<Slot>>,
    waiting_queue: VecDeque<SubmitRequest>,
    submit_gone: bool,
}

slots is a fixed-length Vec<Option<Slot>>, where None is a free slot and Some is occupied. waiting_queue holds submissions when every slot is full. submit_gone tracks whether all schedulers were dropped (time to shut down).

The main loop:

RUST

impl WorkerLoop {
    fn main_loop(&mut self) {
        loop {
            if !self.submit_gone {
                self.drain_submits();
            }
 
            self.evict_finished();
            self.admit();
 
            let any_active = self
                .slots
                .iter()
                .any(|s| s.as_ref().is_some_and(Slot::is_active));
 
            if any_active {
                self.decode_tick();
            } else if self.waiting_queue.is_empty() && self.slots.iter().all(Option::is_none) {
                if self.submit_gone {
                    return;
                }
                // Fully idle: block until the next job arrives.
                match self.submit_rx.recv() {
                    Ok(req) => self.waiting_queue.push_back(req),
                    Err(_) => {
                        self.submit_gone = true;
                        return;
                    }
                }
            }
        }
    }

Each iteration: drain newly submitted jobs into the queue, retire finished slots, admit queued jobs into free slots, then, if any slot is active, run one decode step on all of them. If nothing is active and nothing is queued, the worker is fully idle, so it blocks on submit_rx.recv() instead of spinning, sleeping until the next request wakes it. When all schedulers are dropped the channel closes and the worker returns.

The four steps. drain_submits non-blockingly pulls every queued submission:

RUST

    fn drain_submits(&mut self) {
        loop {
            match self.submit_rx.try_recv() {
                Ok(req) => self.waiting_queue.push_back(req),
                Err(TryRecvError::Empty) => return,
                Err(TryRecvError::Disconnected) => {
                    self.submit_gone = true;
                    return;
                }
            }
        }
    }

try_recv doesn't block; it grabs everything available and returns the moment the channel is empty, so a busy worker keeps decoding rather than waiting on new work.

evict_finished retires done slots:

RUST

    fn evict_finished(&mut self) {
        let tokenizer = self.tokenizer.as_ref();
        for entry in self.slots.iter_mut() {
            let done = entry.as_ref().is_some_and(|s| s.finished);
            if done {
                entry.take().unwrap().send_result(tokenizer);
            }
        }
    }

For each finished slot, take() removes it (leaving None) and send_result consumes it, sending the final result and freeing the slot for the next job.

admit fills free slots from the queue:

RUST

    fn admit(&mut self) {
        for entry in self.slots.iter_mut() {
            if entry.is_some() {
                continue;
            }
            let Some(req) = self.waiting_queue.pop_front() else { return };
 
            if let Some(slot) = Slot::build(
                req,
                &self.model,
                self.tokenizer.as_ref(),
                &self.backend,
                self.prefix_cache.as_ref(),
            ) {
                *entry = Some(slot);
            }
        }
    }

For each empty slot, pull the next queued job and Slot::build it, which runs that request's prefill. If build returns None (prefill failed; the error already went back to the caller), the slot stays empty for the next job.

decode_tick advances every active slot one token:

RUST

    fn decode_tick(&mut self) {
        for slot in self.slots.iter_mut().flatten() {
            slot.decode_one(
                self.model.as_ref(),
                self.tokenizer.as_ref(),
                self.backend.as_ref(),
            );
        }
    }
}

One decode_one per occupied slot. This is the interleave: across many ticks, every active request advances by one token per tick, so they all make steady progress. (It's also the function III.7 replaces, fusing the per-slot forward passes into one.)

Wiring the server to the scheduler

ChatServerState no longer holds the model directly; it holds the scheduler. The router's imports swap accordingly. The model/backend/tokenizer imports go, the scheduler comes in:

RUST

use crate::chat::ChatTemplateMessage;
// ...
use crate::scheduler::{ChatJob, InferenceScheduler};

RUST

pub struct ChatServerState {
    pub scheduler: Arc<InferenceScheduler>,
    pub default_max_tokens: usize,
    pub model_label: String,
    pub kv_cache_mode: &'static str,
}

The non-streaming handler stops calling the model and submits a job instead:

RUST

    let job = ChatJob {
        messages,
        max_new_tokens: max_tokens,
        kv_mode: state.kv_cache_mode,
        stream_delta: None,
    };
    let out = state.scheduler.submit_async(job).await.unwrap();

stream_delta: None, non-streaming. submit_async hands the job to the worker and awaits the result. The handler is now tiny: build a job, submit, shape the response.

Streaming is similar. The handler's streaming branch now passes the scheduler instead of the whole state:

RUST

        return stream_response(
            messages,
            max_tokens,
            model_name,
            Arc::clone(&state.scheduler),
            state.kv_cache_mode,
        );

and stream_response itself bridges the slot's tokio delta channel to the SSE channel from III.3:

RUST

fn stream_response(
    messages: Vec<ChatTemplateMessage>,
    max_tokens: usize,
    model_name: String,
    sched: Arc<InferenceScheduler>,
    kv_mode: &'static str,
) -> Response {
    let (sse_tx, sse_rx) = tokio::sync::mpsc::unbounded_channel::<String>();
    let (delta_tx, mut delta_rx) = tokio::sync::mpsc::unbounded_channel::<String>();
 
    let id = new_completion_id();
    let created = unix_secs();
 
    {
        let tx = sse_tx.clone();
        let id = id.clone();
        let model = model_name.clone();
        tokio::spawn(async move {
            while let Some(token) = delta_rx.recv().await {
                let chunk = build_chunk(&id, created, &model, None, Some(token), None);
                if tx.send(chunk).is_err() {
                    break;
                }
            }
        });
    }

Two channels now. delta_tx/delta_rx is the slot's raw-text delta channel; sse_tx/sse_rx is the SSE chunk channel. A small tokio::spawned task bridges them: read each raw delta from delta_rx, wrap it with build_chunk into a chat.completion.chunk, push it to sse_tx. The slot doesn't know about SSE framing; it just sends text.

The job is submitted with delta_tx as its stream_delta:

RUST

    tokio::task::spawn_blocking(move || {
        sse_tx
            .send(build_chunk(
                &id,
                created,
                &model_name,
                Some("assistant"),
                None,
                None,
            ))
            .unwrap();
 
        let job = ChatJob {
            messages,
            max_new_tokens: max_tokens,
            kv_mode,
            stream_delta: Some(delta_tx),
        };
        let out = sched.submit(job).unwrap();
 
        let reason = if out.hit_stop { "stop" } else { "length" };
        let _ = sse_tx.send(build_chunk(
            &id,
            created,
            &model_name,
            None,
            None,
            Some(reason),
        ));
        let _ = sse_tx.send("[DONE]".into());
    });

Send the opening role chunk, submit the job (Some(delta_tx); the worker's slot now streams deltas through it), block on the result, then send the finish chunk and [DONE]. The rest of stream_response, turning sse_rx into the SSE body, is unchanged from III.3.

The chat-server binary

For binaries to name the scheduler, lib.rs declares the new module and re-exports its two public types:

RUST

mod scheduler;
// ...
pub use scheduler::{ChatJob, InferenceScheduler};

chat-server builds an InferenceScheduler instead of holding the model. Its import list grows to match:

RUST

    ChatServerState, CliArgs, InferenceScheduler, RadixPrefixCache, chat_router, create_backend,
    load_from_gguf_path, rust_log_enables_trace,

It learns --max-concurrent for the slot count, defaulting to the machine's parallelism:

RUST

fn default_max_concurrent() -> usize {
    std::thread::available_parallelism()
        .map(|n| n.get().max(1))
        .unwrap_or(4)
}

and --prefix-cache-max from III.5 is read here too:

RUST

    let prefix_cache_max = args.prefix_cache_max(0);

RUST

    let max_concurrent = args.max_concurrent(default_max_concurrent());

--max-concurrent is one more flag in the CLI parser (field, local, arm, struct literal, and a getter), the same four-places-plus-getter as every flag before it:

RUST

    max_concurrent: Option<usize>,
// ...
        let mut max_concurrent = None;
// ...
                Some("--max-concurrent") => {
                    cur.advance();
                    let n = parse_usize(&mut cur, "--max-concurrent");
                    assert!(n >= 1, "--max-concurrent must be >= 1");
                    max_concurrent = Some(n);
                }
// ...
            max_concurrent,

RUST

    pub fn max_concurrent(&self, default: usize) -> usize {
        self.max_concurrent.unwrap_or(default)
    }

After loading the model, main builds the prefix cache and the scheduler, then puts the scheduler in the state:

RUST

    let prefix_cache = if prefix_cache_max > 0 {
        Some(RadixPrefixCache::new(prefix_cache_max))
    } else {
        None
    };
 
    eprintln!(
        "model: {model_label} ({}), backend: {}, kv: {kv_cache_mode}, prefix max: {}, slots: {}",
        gguf_path.display(),
        backend_name,
        prefix_cache_max,
        max_concurrent,
    );
 
    let scheduler =
        InferenceScheduler::new(max_concurrent, model, tokenizer, backend, prefix_cache);
 
    let state = Arc::new(ChatServerState {
        scheduler,
        default_max_tokens,
        model_label,
        kv_cache_mode,
    });

InferenceScheduler::new consumes the model, tokenizer, backend, and prefix cache, spawns the worker thread, and returns the handle. From here on, every request flows through the scheduler.

Running it

Start the server with 4 slots and a prefix cache:

BASH

cargo run --release --bin chat-server -- \
  --kv paged --max-concurrent 4 --prefix-cache-max 32 path/to/qwen3-0.6b-q8_0.gguf 512

PLAINTEXT

model: qwen3-0.6b-q8_0 (path/to/qwen3-0.6b-q8_0.gguf), backend: simd, kv: paged, prefix max: 32, slots: 4
listening on http://127.0.0.1:8000

Now fire several requests at once, three curls backgrounded in one line:

BASH

for q in "Name a color." "Name an animal." "Name a planet."; do
  curl -s http://127.0.0.1:8000/v1/chat/completions \
    -H 'content-type: application/json' \
    -d "{\"messages\":[{\"role\":\"user\",\"content\":\"$q\"}]}" &
done; wait

PLAINTEXT

{"id":"chatcmpl-18c20f6b2e26e908","object":"chat.completion",...,"choices":[{"index":0,"message":{"role":"assistant","content":"<think>\nOkay, the user asked me to name an animal. Let me think. [... ~170 more thinking tokens ...] \n</think>\n\nA common and easy-to-remember animal is a **cat**."},"finish_reason":"stop"}],"usage":{"prompt_tokens":12,"completion_tokens":195,"total_tokens":207}}
{"id":"chatcmpl-18c20f6d8dd967e8","object":"chat.completion",...,"choices":[{"index":0,"message":{"role":"assistant","content":"<think>\nOkay, the user asked me to name a planet. Let me think. [... ~290 more thinking tokens ...] \n</think>\n\nOne of the planets in our solar system is **Mercury**."},"finish_reason":"stop"}],"usage":{"prompt_tokens":12,"completion_tokens":321,"total_tokens":333}}
{"id":"chatcmpl-18c20f6d8e2c3338","object":"chat.completion",...,"choices":[{"index":0,"message":{"role":"assistant","content":"<think>\nOkay, the user asked me to name a color. Let me think. [... ~280 more thinking tokens ...] \n</think>\n\nHello! Here are some common color names:  \n- **Red**  \n- **Blue**  \n- **Green**  [...]"},"finish_reason":"stop"}],"usage":{"prompt_tokens":12,"completion_tokens":321,"total_tokens":333}}

All three are admitted into separate slots and decoded together; the worker interleaves their steps instead of running them strictly back to back, and the responses come back in completion order, not submission order: the animal question thought for fewer tokens, so it finished first. With --max-concurrent 1 the same three requests would queue and run one after another; with 4 slots they overlap.

Where this leaves us

The server handles concurrent requests. A background worker owns the model, tokenizer, backend, and prefix cache; HTTP handlers submit jobs and await results over channels. A fixed pool of slots holds in-flight requests, a waiting queue holds the overflow, and the worker loop admits, decodes, and retires slots in a steady cycle.

But the decode_tick runs each slot's forward pass separately: four active slots means four independent forward passes per tick. Each of those passes does a matrix-times-vector projection, leaving the hardware mostly idle. Run four sequences as one batched pass and the projections become a matrix-times-matrix, the work the hardware is actually built for. The last chapter fuses the per-slot passes into one batched decode.