TMB Scoreboard

“The cheapest model is the one that doesn’t create rework.” — The Monocle Bear

The TMB Scoreboard (The Monocle Bear) is how we decide which models to actually deploy. Not a leaderboard scraped from a paper — real, demanding tasks graded on strict rubrics, run on the same hardware we serve from (an M3 Ultra cluster, single-Mac runtimes, and cloud APIs side by side). It is the evidence base behind every routing decision in the stack — and the source of truth for CoeOS, the benchmark-composed virtual model.

Why we benchmark this way

Public benchmarks answer “which model is smartest in the abstract”. We need a different answer: which model do I run, for which task, on the hardware I own. So TMB tests the work we actually do —

General: long-form creative writing, GDPR / AI-Act legal analysis, Python.
Coding: 4 languages across 8–9 tasks (CLI, debug, architecture, data pipeline, React, Bash/infra, refactoring, Swift CLI, SwiftUI).
Planning: the ability to plan before coding — decomposition, spotting the trap, edge cases, an executable handoff.

Every task has a hand-written rubric, a single run at temperature 0, and an immediate evaluation. The result is not “model X is good” but “model X scores 49/50 on debug, 33/50 on Swift” — granular enough to route a request to the model proven best at that skill.

The three scoreboards

Scoreboard	What it measures	Tests
General	Creative, legal/GDPR, Python — the everyday surface	T01–T07
Coding	4 languages, 9 tasks, write + debug + architect	C01–C09
Planning	Plan before code — decompose, trap, edge cases, spec	P01–P04

Top 5

The models that keep winning, across the boards:

MiniMax M3 Q6 — the frontier local. #1 on coding (97.3%), #2 general (94.2%), top planner. The first local model to dominate writing and debug at once. One Mac Studio, no cluster.
Kimi K2.7 Code — the reasoning coder. #2 coding (94.9%), absolute records on architecture (C03) and data pipeline (C04). The Aider model.
GLM 5.2 — split personality: record creative (T01 480/500, above Opus 4.7) in the cloud, and the best debug + refactoring analysis (C02/C06). Quantization guts the creative, spares the analytical.
Qwen 3.5 (397B / 122B) — the analyst polyglot. #1 on GDPR (T02 50/50); the 122B at 38 tok/s is the fast ACT workhorse.
Nex N2 Pro Q9 — the agentic fine-tune that beats its own base by ~5 points, #1 local on architecture.

Frontier reference (out of the local ranking): Claude Opus 4.6/4.7 is the ceiling — 95.0% general, 94.8% coding. M3 Q6 sits 0.8 pt below it, and matches or beats it on several individual tasks.

Models tested

The panel spans local and cloud, across many quantizations and infra setups:

MiniMax — M3 (Q6), M2.7 (8-bit, 12k/36k)
Qwen — 3.5 (397B BF16 / Q9, 122B Q8), 3-6, 3-Next (INST/THNK), 3-Coder-Next-80B, 3.5-35B-A3B
GLM — 5.1 (754B, OR/Q8), 5.2 (Cloud / Q8 / Q6)
Tencent Hy3-preview (Q9)
Kimi — K2.7 Code, K2.5 (Q8)
Nex N2 Pro (Q9)
Coder-480B (Q8 / FP16, tensor / pipeline / thinking)
Mistral — Large, Medium 3.1, Small 2603; Devstral 2512; Ministral 14B
Frontier reference — Claude Opus 4.6/4.7, Gemini 3.1 Pro / 3.5 Flash

Infrastructure & evaluators

Runs on the Odysseus cluster (4× M3 Ultra), single-node M3 Ultra 512 (OdyssAI-X / Telemak), Inferencer ×2, plus OpenRouter and the Anthropic API. Graded by Claude Opus 4.6 / Sonnet 4.6 and the TMB Benchmark Evaluator (MiniMax M3), against TMB v2 rubrics. Companion memory is disabled for every benchmark run.