agentic / local-models
Peak decode
Best MTP gain
Models banked
Engine upgrade
tokens/second, single stream · MTP = multi-token-prediction speculative decoding (draft-mtp)
P(token k accepted | k−1 accepted) · Qwen3.6-27B, draft_num_predict 4
per-run drafting statistics from the serve log
| model | acceptance | mean len | speedup |
|---|---|---|---|
| Qwen3.6-27B | 0.55 | 3.00 | +60% |
| Qwen3.6-35B-A3B | 0.47 | 2.89 | +70% |
| Qwen3.5-9B | 0.36–0.47 | 2.4–2.9 | n/m |
Acceptance decays fast past position 2 — draft_num_predict 2–3 is the tuning lever if verification overhead shows up at higher batch sizes.
everything measured, including what the charts summarize
| model | size | baseline tok/s | MTP tok/s | Δ | acceptance | note |
|---|---|---|---|---|---|---|
| GLM-4.7-Flash | 17 GB | 217 | — | — | — | fastest banked; MoE, no MTP head |
| Qwen3.5-9B-MTP | 6 GB | n/m | 174 | — | 0.36–0.47 | T0 utility tier |
| Qwen3.6-27B-MTP | 17 GB | 68 | 110 | +60% | 0.55 | dense T1 daily driver |
| Qwen3.6-35B-A3B-MTP | 22 GB | 58 | 99 | +70% | 0.47 | T2 flagship; needs OLLAMA_LOAD_TIMEOUT=15m |
| Devstral-Small-2-24B | 14 GB | 86 | — | — | — | agent-tuned; banked for tool discipline |