TMB Scoreboard — Coding
The Monocle Bear · April–June 2026
- Tests: TMB Coding C01–C08 (4 languages, /460) + C09 Mandelbrot Tkinter (/50)
- Evaluators: Claude Opus 4.6 / Sonnet 4.6 — TMB Coding v2 rubrics
- Infrastructure: Odysseus cluster 4× M3 Ultra + Inferencer + Ultra-512 MLX + OpenRouter
The battery
Section titled “The battery”8 tests, 4 languages, /460. A single model can finish all 8 in a day.
| Test | Language | Subject | Max |
|---|---|---|---|
| C01 | Python | CLI — LLM Inference Benchmarker | /50 |
| C02 | Python | Debug & Fix — 7 hidden bugs | /50 |
| C03 | Python | Architecture — Cluster Monitor System | /60 |
| C04 | Python | Data Pipeline — Benchmark Cleaner | /50 |
| C04-2 | React/JS | Dashboard — TMB Scoreboard Viewer | /50 |
| C05 | Bash + Python | System & Infra — MLX Model Deployer | /50 |
| C06 | Python | Refactoring — Scoreboard Analyzer | /50 |
| C07 | Swift | macOS CLI — Cluster Health Monitor | /50 |
| C08 | SwiftUI | iOS App — EXO Model Manager | /50 |
| TOTAL | /460 |
Global thresholds: 360+ (88%+) Full-stack autonomous · 290+ (71%+) Assisted senior · 225+ (55%+) Junior · <160 Unusable.
Scoreboard C01–C08
Section titled “Scoreboard C01–C08”Nine models. MiniMax M3 Q6 = #1 (97.3%). Kimi K2.7 Code = #2 on 7 comparable tests (96.4%, C07/C08 pending). Claude Opus 4.6 = frontier reference.
| Test | Max | M3 Q6 ★ | Kimi K2.7 ◆ | Hy3 Q9 | Coder-480B Q8 | Qwen3.5-397B BF16 | GLM 5.2 Cloud | Nex N2 Pro Q9 | Qwen3.5-122B Q8 | Qwen3-Coder-Next-80B | Opus 4.6 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| C01 Python CLI | 50 | 49.5 | 49 | 42 | 47 | 40 | 45 | 44 | 44 | 41 | 50 |
| C02 Debug Fix | 50 | 50 | 47 | 34 | 28 | 44 | 49 | 48 | 47 | 42 | 50 |
| C03 Architecture | 60 | 55 | 59 ⭐ | 40 | 39 | 46 | 47 | 49 | 50 | 45 | 55 |
| C04 Data Pipeline | 50 | 49.5 | 50 ⭐ | 47 | 48 | 42 | 49 | 48 | 44 | 45 | 49 |
| C04-2 React | 50 | 48.5 | 46 | 44 | 42 | 36 | 44 | 44 | 43 | — | 46 |
| C05 Bash+Py | 50 | 47.5 | 48 | 36 | 35.5 | 41 | 46 | 45 | 44 | 37 | 47 |
| C06 Refactoring | 50 | 49.5 | 48 | 37 | 31 | 43 | 49 | 47 | 44 | 46 | 47 |
| C07 Swift CLI | 50 | 49.5 | 43 | — | 38.5 | 40 | 38 | 38 | 33 | 32 | 43 |
| C08 SwiftUI | 50 | 48.5 | 46.5 | — | 26 | 32.5 | 43 | 44 | 36 | 37 | 49 |
| TOTAL | 460 | 447.5 (97.3%) | 436.5 (94.9%) ◆ | 280/360 (77.8%) ‡ | 335 (72.8%) | 364.5 (79.2%) | 410 (89.1%) | 407 (88.5%) | 385 (83.7%) | 325/410 (79.3%) † | 436 (94.8%) |
⭐ Best-in-class on that test. MiniMax M3 Q6 = #1 absolute on 8/9 tests. Kimi K2.7 = #1 on C03 and C04. ◆ Kimi K2.7 Code (reasoning, local 4-bit, ~17.8 tok/s): C01–C08 + C09 Mandelbrot (50/50 ⭐ absolute record). Role: coder model for Aider. † Qwen3-Coder-Next-80B: C04-2 React not tested — score over /410. ‡ Hy3-preview Q9: C07/C08 not tested — score over /360.
Infra & speed
Section titled “Infra & speed”| Model | Active params | Quant | RAM | Avg speed | Mode |
|---|---|---|---|---|---|
| Coder-480B Q8 | 35B/480B MoE | Q8 | ~480 GB | 13.3 t/s | Exo Tensor RDMA 4 nodes |
| Qwen3.5-397B BF16 | 17B/397B MoE | BF16 | ~715 GB | 16.8 t/s | Exo Pipeline RDMA 4 nodes |
| Qwen3.5-122B Q8 HD16 | ~10B/122B MoE | Q8 + BF16 heads | ~120 GB | 38.3 t/s | 1× M3 Ultra (telecode) — multi-instance |
| Nex N2 Pro Q9 | 17B/397B MoE | MLX-9bit | ~415 GB | 24.3 t/s | Nex MLX 3 nodes M3 Ultra 256 GB |
| MiniMax M3 Q6 | 21B/427B MoE | MLX 6-bit | ~320 GB | ~20 t/s | 1× M3 Ultra 512 GB solo |
| Hy3-preview Q9 | 21B/295B MoE | Q9 | ~332 GB | ~19.6 t/s | 1× M3 Ultra 512 GB solo |
| Kimi K2.7 Code | — | 4-bit | — | ~17.8 t/s | Local 4-bit (reasoning) |
Qwen3.5-122B Q8 HD16 (38.3 tok/s) is #1 in speed with higher quality than Qwen3.5-397B BF16 (+20.5 pts) on a single node. Nex stays #1 local quality (407 vs 385).
Model profiles
Section titled “Model profiles”MiniMax M3 Q6 — “the complete expert, first local to cross the frontier”
Section titled “MiniMax M3 Q6 — “the complete expert, first local to cross the frontier””MiniMax M3 (MoE 427B, in-house MLX 6-bit MSA-fixed, ~320 GB) on 1× M3 Ultra 512 GB at 447.5/460 (97.3%) — new absolute #1, ahead of Opus 4.6 (436) and GLM 5.2 Cloud (410). First local model to beat the frontier estimate on this benchmark.
What M3 breaks: the “coder ≠ analyst” split that dominated the benchmark (Coder-480B 95% writing / 56% debug; Qwen3.5 the inverse). M3 does 99% writing AND 100% debug — first model to dominate both axes at once.
Only execution slip: C03 storage bug — 14 cur.execute() without await (persistence + alerting broken at runtime). Trivial fix, but only caught at execution, not in static review.
| Domain | Score | Note |
|---|---|---|
| Python (writing + debug) | 99% / 100% | Total domination |
| Bash / infra | 95% | Zero GNU-isms, bounded parallelism |
| React / frontend | 97% | ScatterChart quadrants on real median |
| Swift | 99% | The only model with no compile error |
| Architecture | 91.7% | Async storage bug in C03 (−4), trivial fix |
Kimi K2.7 Code — “the reasoning coder, 0.5% behind M3”
Section titled “Kimi K2.7 Code — “the reasoning coder, 0.5% behind M3””Reasoning model (Cloud API) — 436.5/460 (94.9%), #2. Declared role: coder model for Aider. Two absolute records: C03 Architecture (59/60, #1) and C04 Data Pipeline (50/50, #1) — C03 beats even M3 Q6 (55) and Opus 4.6 (55), the only test where M3 isn’t #1. Thinking blocks visible in the deliverables — configure Aider (--reasoning-effort) to strip them before diff parsing. Strong on Python/Bash, no real gaps; only weak spots are a React multi-quant selection bug (C04-2, 46) and 6/7 in debug (C02, 47). Aider verdict: strong on Python/Bash/React.
Hy3-preview Q9 — “a 295B that performs like a 480B — on one node”
Section titled “Hy3-preview Q9 — “a 295B that performs like a 480B — on one node””Tencent Hy3-preview (MoE 295B / 21B active, Q9, ~332 GB) on 1× M3 Ultra 512 — 280/360 (77.8%) on 7 comparable tests (C07/C08 not run). Frees the other 3 nodes during execution. Clear pattern: excels at single-deliverable, precise-spec tasks; drops to Junior the moment the task needs analysis or documentation. Vs Coder-480B on 4 nodes: Hy3 > Coder on 6/7 common tests (+9.5 pts), 3× less hardware, +50% faster. Mandelbrot = panel #1 (48/50). But C06 Section D = 1/10 (delivers refactored code with no documentation), same pattern as C02 (34/50) — it executes, doesn’t reason about code.
Coder-480B Q8 — “codes like a senior, explains like an intern”
Section titled “Coder-480B Q8 — “codes like a senior, explains like an intern””The most asymmetric profile of the benchmark. Crushes writing (+6 to +7 vs Qwen3.5 on C01/C04/C04-2) but collapses on anything needing explanation, diagnosis or justification (C02 at 56%, C06 at 62%). C02 reveals it: 0/15 Diagnostic, 0/10 Justification on C06 — functional fixes with no “why”, a mute fixer. Qwen3.5 finds 7/7 bugs with full root cause on the same test.
Qwen3.5-397B BF16 — “the polyglot analyst”
Section titled “Qwen3.5-397B BF16 — “the polyglot analyst””Wins overall vs Coder (+29.5 pts) precisely on the tests where explanation counts: C02 debug 88% (7/7 bugs, zero false positives), C06 86% (before/after justification). Weakness: modern frontend — React 72% (no TypeScript, hard-coded data), SwiftUI 65% (pre-2023 patterns, no real streaming).
GLM 5.2 Cloud — “the all-rounder that tops the ceiling on analytical tests”
Section titled “GLM 5.2 Cloud — “the all-rounder that tops the ceiling on analytical tests””410/460 (89.1%), between Opus 4.6 (94.8%) and Nex N2 Pro (88.5%). Homogeneous — no spectacular collapse. Two absolute records: C02 (49/50) and C06 (49/50) — on C02 the panel’s most complete analysis (full shell-injection exploitation scenario, not just the fix); on C06, 13 documented changes with before/after/why. Only significant gap: C07 Swift CLI (38/50) — ArgumentParser compile bug + Apple Silicon page size wrong (4096 vs 16384 → ×4 RAM under-estimate).
Nex N2 Pro Q9 — “the agentic fine-tune that validates the base model”
Section titled “Nex N2 Pro Q9 — “the agentic fine-tune that validates the base model””Nex N2 Pro (MLX-9bit, 3× M3 Ultra 256 GB, ~24.3 tok/s) at 407/460 (88.5%) — just under GLM 5.2 Cloud. Same base architecture as Qwen3.5-397B BF16, fine-tuned for agentic workflows. Systematic gains vs base (+4 to +11.5 pts/test): C01 +4, C04 +6, C04-2 +8, C08 +11.5 — correlated with domain depth. Local records: C03 Architecture (49/60, #1), C04-2 React (44/50, #1 tied). Only regression: C07 Swift (−2).
Qwen3.5-122B Q8 HD16 — “the 10B-active that beats the 17B, with review bugs”
Section titled “Qwen3.5-122B Q8 HD16 — “the 10B-active that beats the 17B, with review bugs””Qwen3.5-122B-A10B (MoE 122B / ~10B active, Q8 + BF16 heads, MLX, telecode) on 1 node — 385/460 (83.7%), #5. Counter-intuitive: beats the 397B BF16 (17B active) with 7B fewer active params — quality per active expert beats total parameter volume. Speed: 38.3 tok/s solo, multi-instance. Pattern: conceptually correct code with local (1–5 line) implementation bugs caught on the first run. Strong on Python/debug (near Nex). Weak only on Swift (33/50 C07, 36/50 C08 — compile errors on dead code). With an automatic compile step in the Cline loop, ~80% auto-corrected.
Qwen3-Coder-Next-80B — “the honest Python-first, the broken Swift-last”
Section titled “Qwen3-Coder-Next-80B — “the honest Python-first, the broken Swift-last””80B dense (not MoE). 8 tests, 325/410 (79.3%) — “assisted senior”. Distinctive trait: analytical integrity — on C02 it finds 6/7 bugs and explicitly refuses to invent a 7th (“that would be malpractice”), zero false positives. Python excellent (C04 90%, C06 92%). Swift is the absolute floor (C07 64%, C08 74% — fundamental syntax error, let id: String { hostname } must be var). Bash multi-process broken (subshell IPC). For Cline ACT: recommended for Python with a structured plan; not for Swift without a compile pipeline.
C02 — Debug Fix — expanded panel (16 entries)
Section titled “C02 — Debug Fix — expanded panel (16 entries)”C02 is the most discriminating test: detect 7 hidden bugs, structured diagnosis, fix, security analysis.
| # | Model | /50 | Bugs/7 | Shell fix | Bug #7 | Analysis |
|---|---|---|---|---|---|---|
| 1 | MiniMax M3 Q6 ⭐ | 50 | 7/7 | ✓ | ✓ | Perfect — 7/7 + structured exploitation scenario |
| 1 | GLM 5.2 Cloud ⭐ | 49 | 7/7 | ✓ | ✓ | Complete (root cause + exploitation scenario) |
| 3 | Nex N2 Pro Q9 | 48 | 7/7 | ✓ | ✓ | Complete (root cause + trigger scenario) |
| 4 | Kimi K2.7 Code | 47 | 6/7 | ✓ | ✓ | Complete — Bug 5 implicit |
| 5 | Qwen3.5-397B BF16 | 44 | 7/7 | ✓ | ✓ | Complete |
| 6 | Qwen3-Coder-Next-80B | 42 | 6/7 | ✓ | ✗ | Complete — refuses the 7th |
| 7 | Coder-480B Q8 Tensor | 39 | 5/7 | ✓ | ✗ | Structured |
| 8 | Hy3-preview Q9 | 34 | 6/7 | ✓ | ✓ | Partial — no root cause, no exploitation |
| 9 | Kimi K2.5 Q8 Pipeline / Tensor | 35 | 6–7/7 | ✓ | mixed | Inline |
| 11 | Qwen3.5-397B Q9 Exo | 35 | 6/7 | ⚠ | ✗ | Missing |
| 12 | Coder-480B Q8 Pipeline / FP16 | 28 | 5/7 | ✓ | ✗ | Minimal |
| 14 | Qwen3.5-35B-A3B Q9 | 26 | 6/7 | ✓ | ✗ | None |
| 15 | Mistral Small 3.1 24B Q8 | 23 | 4–5/7 | ✓ | ✗ | None |
| 16 | Coder-480B Q8 Thinking / LongCat Flash Lite Q9 | 22 | 5/7 | ⚠ | mixed | Invisible / none |
Findings: GLM 5.2 Cloud (49) and Nex N2 Pro (48) lead the analytical panel. Three models find 7/7 with full analysis: GLM Cloud, Nex, Qwen3.5 BF16. Inference mode matters on the same model: Coder-480B Tensor 39 vs Pipeline 28 (+11), Thinking even worse (22 — analysis migrates into the invisible block). A 3B-active MoE (Qwen3.5-35B, 76.4 t/s) beats a dense 24B (Mistral Small) on quality and speed.
C08 — SwiftUI iOS (8 models)
Section titled “C08 — SwiftUI iOS (8 models)”| # | Model | /50 | Level |
|---|---|---|---|
| 1 | Claude Opus 4.6 | 49 | iOS Senior |
| 2 | MiniMax M3 Q6 | 48.5 | iOS Senior |
| 3 | Kimi K2.7 Code | 46.5 | iOS Senior |
| 4 | Nex N2 Pro Q9 | 44 | Good iOS |
| 5 | Kimi K2.5 Q8 / GLM 5.2 Cloud | 43–43.5 | Good iOS |
| 6 | GLM-5 Q8 | 41 | Good iOS |
| 7 | Qwen3-Coder-Next-80B | 37 | Intermediate |
| 8 | Qwen3.5-397B BF16 | 32.5 | Junior |
| 9 | Coder-480B Q8 | 26 | Basic |
C09 — Mandelbrot Tkinter
Section titled “C09 — Mandelbrot Tkinter”Interactive GUI: threading, zoom rectangle, view history, color palette, numpy forbidden, inverted Y axis.
| # | Model | /50 | Runs? | Threading |
|---|---|---|---|---|
| 1 | Kimi K2.7 Code ⭐ | 50 | ✓✓ | Real thread + smooth coloring |
| 2 | Hy3-preview Q9 | 48 | ✓✓ | Real thread, HSV inline |
| 3 | GLM 5.2 Cloud / Qwen 3.6 Plus | 47 | ✓✓ | Real thread, HSV colorsys |
| 5 | Qwen3.5-397B Q9 / Qwen3.5-122B Q8 | 46 | ✓ | mixed |
| 7 | Nex N2 Pro Q9 | 38 | ⚠ | fake async (after) |
| 8 | LongCat Flash Lite / Mistral Small | 29–32 | ✗ | — |
Cross-cutting patterns
Section titled “Cross-cutting patterns”- C02 reveals the analytical profile. It’s the only test that separates a model that understands code from one that reproduces patterns. All produce fixes; only GLM 5.2, Nex N2 Pro and Qwen3.5 explain why with full root cause. Missing explanation is near-systematic in code-specialist models (Coder-480B: 0/15 Diagnostic).
- The agentic fine-tune is measurable. Qwen3.5-397B BF16 vs Nex N2 Pro Q9 (same base): +5.2 pts average, max C08 +11.5, only regression C07 −2, and +45% speed. Quality and speed improve — no trade-off.
- Thinking mode is counter-productive on code. Coder-480B Thinking 22 vs standard 39 on C02; Qwen3-Next THNK 39.5 vs INST 49 on T03. The analysis migrates into the invisible block. Implicit/adaptive thinking (Nex) is handled better than explicit.
- Inference mode affects quality, not just speed. Coder-480B Tensor 39 vs Pipeline 28 on C02 — same model, temperature, prompt. +11 pts.
Coding routing matrix
Section titled “Coding routing matrix”| Task | Model | Infra | Why |
|---|---|---|---|
| Everything (generalist) | MiniMax M3 Q6 | 1× M3 Ultra 512 local | 97.3% — #1, 8/9 best-in-class |
| Python writing + debug | MiniMax M3 Q6 | 1× M3 Ultra 512 | 99% / 100% — first to dominate both |
| React dashboard | MiniMax M3 Q6 | 1× M3 Ultra 512 | 97% C04-2 |
| Swift CLI / SwiftUI | MiniMax M3 Q6 | 1× M3 Ultra 512 | 99% C07, 97% C08 |
| Python/Bash/React (Cline, fast) | Qwen3.5-122B Q8 HD16 | 1× node (telecode) | 83.7%, 38.3 tok/s, multi-instance |
| System architecture (fast local) | Nex N2 Pro Q9 | Nex MLX 3 nodes | 88.5% — #1 local quality, 24.3 tok/s |
| Debug + critical analysis (cloud) | GLM 5.2 Cloud | OpenRouter | 98% C02 — exploitation scenario |
| Deep refactoring (cloud) | GLM 5.2 Cloud | OpenRouter | 98% C06 — 13 documented changes |
| Aider, long reasoning session | Kimi K2.7 Code | local 4-bit | thinking visible, fits Aider |
| Reference ceiling | Claude Opus 4.6 | Anthropic API | 94.8% (436/460) — passed by M3 Q6 |
TMB Coding Benchmark — The Monocle Bear — April–June 2026. Panel C01-C08: 9 models · C02 expanded: 16 entries · C08: 9 models · C09: 9 models. Updated 23 June 2026.