agentic / local-models

Model bank — bench metrics

RTX 5090 32 GB·ollama 0.31.1 (CUDA)·UD-Q4_K_XL quants·256-token decode, warm·2026-07-04

Peak decode

217tok/s

GLM-4.7-Flash

Best MTP gain

+70%

Qwen3.6-35B · 58 → 99

Models banked

76 GB · mirrored to HF

Engine upgrade

+56%

gemma4 · 0.20.5 → 0.31.1

Decode speed by model

tokens/second, single stream · MTP = multi-token-prediction speculative decoding (draft-mtp)

baseline decode with MTP drafting

P(token k accepted | k−1 accepted) · Qwen3.6-27B, draft_num_predict 4

per-run drafting statistics from the serve log

model	acceptance	mean len	speedup
Qwen3.6-27B	0.55	3.00	+60%
Qwen3.6-35B-A3B	0.47	2.89	+70%
Qwen3.5-9B	0.36–0.47	2.4–2.9	n/m

Acceptance decays fast past position 2 — draft_num_predict 2–3 is the tuning lever if verification overhead shows up at higher batch sizes.

everything measured, including what the charts summarize

model	size	baseline tok/s	MTP tok/s	Δ	acceptance	note
GLM-4.7-Flash	17 GB	217	—	—	—	fastest banked; MoE, no MTP head
Qwen3.5-9B-MTP	6 GB	n/m	174	—	0.36–0.47	T0 utility tier
Qwen3.6-27B-MTP	17 GB	68	110	+60%	0.55	dense T1 daily driver
Qwen3.6-35B-A3B-MTP	22 GB	58	99	+70%	0.47	T2 flagship; needs OLLAMA_LOAD_TIMEOUT=15m
Devstral-Small-2-24B	14 GB	86	—	—	—	agent-tuned; banked for tool discipline

Source of truth: packages/local-models/MODELS.md (commits 481ed3c, ef98c24) · weights mirrored at hf.co/todie/model-bank · n/m = not measured