TMB Scoreboard
“The cheapest model is the one that doesn’t create rework.” — The Monocle Bear
The TMB Scoreboard (The Monocle Bear) is how we decide which models to actually deploy. Not a leaderboard scraped from a paper — real, demanding tasks graded on strict rubrics, run on the same hardware we serve from (an M3 Ultra cluster, single-Mac runtimes, and cloud APIs side by side). It is the evidence base behind every routing decision in the stack — and the source of truth for CoeOS, the benchmark-composed virtual model.
Why we benchmark this way
Section titled “Why we benchmark this way”Public benchmarks answer “which model is smartest in the abstract”. We need a different answer: which model do I run, for which task, on the hardware I own. So TMB tests the work we actually do —
- General: long-form creative writing, GDPR / AI-Act legal analysis, Python.
- Coding: 4 languages across 8–9 tasks (CLI, debug, architecture, data pipeline, React, Bash/infra, refactoring, Swift CLI, SwiftUI).
- Planning: the ability to plan before coding — decomposition, spotting the trap, edge cases, an executable handoff.
Every task has a hand-written rubric, a single run at temperature 0, and an immediate evaluation. The result is not “model X is good” but “model X scores 49/50 on debug, 33/50 on Swift” — granular enough to route a request to the model proven best at that skill.
The three scoreboards
Section titled “The three scoreboards”| Scoreboard | What it measures | Tests |
|---|---|---|
| General | Creative, legal/GDPR, Python — the everyday surface | T01–T07 |
| Coding | 4 languages, 9 tasks, write + debug + architect | C01–C09 |
| Planning | Plan before code — decompose, trap, edge cases, spec | P01–P04 |
The models that keep winning, across the boards:
- MiniMax M3 Q6 — the frontier local. #1 on coding (97.3%), #2 general (94.2%), top planner. The first local model to dominate writing and debug at once. One Mac Studio, no cluster.
- Kimi K2.7 Code — the reasoning coder. #2 coding (94.9%), absolute records on architecture (C03) and data pipeline (C04). The Aider model.
- GLM 5.2 — split personality: record creative (T01 480/500, above Opus 4.7) in the cloud, and the best debug + refactoring analysis (C02/C06). Quantization guts the creative, spares the analytical.
- Qwen 3.5 (397B / 122B) — the analyst polyglot. #1 on GDPR (T02 50/50); the 122B at 38 tok/s is the fast ACT workhorse.
- Nex N2 Pro Q9 — the agentic fine-tune that beats its own base by ~5 points, #1 local on architecture.
Frontier reference (out of the local ranking): Claude Opus 4.6/4.7 is the ceiling — 95.0% general, 94.8% coding. M3 Q6 sits 0.8 pt below it, and matches or beats it on several individual tasks.
Models tested
Section titled “Models tested”The panel spans local and cloud, across many quantizations and infra setups:
- MiniMax — M3 (Q6), M2.7 (8-bit, 12k/36k)
- Qwen — 3.5 (397B BF16 / Q9, 122B Q8), 3-6, 3-Next (INST/THNK), 3-Coder-Next-80B, 3.5-35B-A3B
- GLM — 5.1 (754B, OR/Q8), 5.2 (Cloud / Q8 / Q6)
- Tencent Hy3-preview (Q9)
- Kimi — K2.7 Code, K2.5 (Q8)
- Nex N2 Pro (Q9)
- Coder-480B (Q8 / FP16, tensor / pipeline / thinking)
- Mistral — Large, Medium 3.1, Small 2603; Devstral 2512; Ministral 14B
- Frontier reference — Claude Opus 4.6/4.7, Gemini 3.1 Pro / 3.5 Flash
Infrastructure & evaluators
Section titled “Infrastructure & evaluators”Runs on the Odysseus cluster (4× M3 Ultra), single-node M3 Ultra 512 (OdyssAI-X / Telemak), Inferencer ×2, plus OpenRouter and the Anthropic API. Graded by Claude Opus 4.6 / Sonnet 4.6 and the TMB Benchmark Evaluator (MiniMax M3), against TMB v2 rubrics. Companion memory is disabled for every benchmark run.
Read next
Section titled “Read next”- General scoreboard →
- Coding scoreboard →
- Planning scoreboard →
- CoeOS → — how these scores become a virtual model.