agentic / local-models

Model bank — bench metrics

RTX 5090 32 GB·ollama 0.31.1 (CUDA)·UD-Q4_K_XL quants·256-token decode, warm·2026-07-04

Peak decode

217tok/s
GLM-4.7-Flash

Best MTP gain

+70%
Qwen3.6-35B · 58 → 99

Models banked

5
76 GB · mirrored to HF

Engine upgrade

+56%
gemma4 · 0.20.5 → 0.31.1

Decode speed by model

tokens/second, single stream · MTP = multi-token-prediction speculative decoding (draft-mtp)

baseline decode with MTP drafting

Draft acceptance by position

P(token k accepted | k−1 accepted) · Qwen3.6-27B, draft_num_predict 4

Acceptance summary

per-run drafting statistics from the serve log

modelacceptancemean lenspeedup
Qwen3.6-27B0.553.00+60%
Qwen3.6-35B-A3B0.472.89+70%
Qwen3.5-9B0.36–0.472.4–2.9n/m

Acceptance decays fast past position 2 — draft_num_predict 2–3 is the tuning lever if verification overhead shows up at higher batch sizes.

Full table

everything measured, including what the charts summarize

modelsizebaseline tok/sMTP tok/sΔacceptancenote
GLM-4.7-Flash17 GB217fastest banked; MoE, no MTP head
Qwen3.5-9B-MTP6 GBn/m1740.36–0.47T0 utility tier
Qwen3.6-27B-MTP17 GB68110+60%0.55dense T1 daily driver
Qwen3.6-35B-A3B-MTP22 GB5899+70%0.47T2 flagship; needs OLLAMA_LOAD_TIMEOUT=15m
Devstral-Small-2-24B14 GB86agent-tuned; banked for tool discipline