Skip to content

TMB Scoreboard — Coding

The Monocle Bear · April–June 2026

  • Tests: TMB Coding C01–C08 (4 languages, /460) + C09 Mandelbrot Tkinter (/50)
  • Evaluators: Claude Opus 4.6 / Sonnet 4.6 — TMB Coding v2 rubrics
  • Infrastructure: Odysseus cluster 4× M3 Ultra + Inferencer + Ultra-512 MLX + OpenRouter

8 tests, 4 languages, /460. A single model can finish all 8 in a day.

TestLanguageSubjectMax
C01PythonCLI — LLM Inference Benchmarker/50
C02PythonDebug & Fix — 7 hidden bugs/50
C03PythonArchitecture — Cluster Monitor System/60
C04PythonData Pipeline — Benchmark Cleaner/50
C04-2React/JSDashboard — TMB Scoreboard Viewer/50
C05Bash + PythonSystem & Infra — MLX Model Deployer/50
C06PythonRefactoring — Scoreboard Analyzer/50
C07SwiftmacOS CLI — Cluster Health Monitor/50
C08SwiftUIiOS App — EXO Model Manager/50
TOTAL/460

Global thresholds: 360+ (88%+) Full-stack autonomous · 290+ (71%+) Assisted senior · 225+ (55%+) Junior · <160 Unusable.


Nine models. MiniMax M3 Q6 = #1 (97.3%). Kimi K2.7 Code = #2 on 7 comparable tests (96.4%, C07/C08 pending). Claude Opus 4.6 = frontier reference.

TestMaxM3 Q6Kimi K2.7Hy3 Q9Coder-480B Q8Qwen3.5-397B BF16GLM 5.2 CloudNex N2 Pro Q9Qwen3.5-122B Q8Qwen3-Coder-Next-80BOpus 4.6
C01 Python CLI5049.5494247404544444150
C02 Debug Fix5050473428444948474250
C03 Architecture6055594039464749504555
C04 Data Pipeline5049.5504748424948444549
C04-2 React5048.54644423644444346
C05 Bash+Py5047.5483635.5414645443747
C06 Refactoring5049.5483731434947444647
C07 Swift CLI5049.54338.5403838333243
C08 SwiftUI5048.546.52632.54344363749
TOTAL460447.5 (97.3%)436.5 (94.9%)280/360 (77.8%)335 (72.8%)364.5 (79.2%)410 (89.1%)407 (88.5%)385 (83.7%)325/410 (79.3%)436 (94.8%)

⭐ Best-in-class on that test. MiniMax M3 Q6 = #1 absolute on 8/9 tests. Kimi K2.7 = #1 on C03 and C04. ◆ Kimi K2.7 Code (reasoning, local 4-bit, ~17.8 tok/s): C01–C08 + C09 Mandelbrot (50/50 ⭐ absolute record). Role: coder model for Aider. † Qwen3-Coder-Next-80B: C04-2 React not tested — score over /410. ‡ Hy3-preview Q9: C07/C08 not tested — score over /360.


ModelActive paramsQuantRAMAvg speedMode
Coder-480B Q835B/480B MoEQ8~480 GB13.3 t/sExo Tensor RDMA 4 nodes
Qwen3.5-397B BF1617B/397B MoEBF16~715 GB16.8 t/sExo Pipeline RDMA 4 nodes
Qwen3.5-122B Q8 HD16~10B/122B MoEQ8 + BF16 heads~120 GB38.3 t/s1× M3 Ultra (telecode) — multi-instance
Nex N2 Pro Q917B/397B MoEMLX-9bit~415 GB24.3 t/sNex MLX 3 nodes M3 Ultra 256 GB
MiniMax M3 Q621B/427B MoEMLX 6-bit~320 GB~20 t/s1× M3 Ultra 512 GB solo
Hy3-preview Q921B/295B MoEQ9~332 GB~19.6 t/s1× M3 Ultra 512 GB solo
Kimi K2.7 Code4-bit~17.8 t/sLocal 4-bit (reasoning)

Qwen3.5-122B Q8 HD16 (38.3 tok/s) is #1 in speed with higher quality than Qwen3.5-397B BF16 (+20.5 pts) on a single node. Nex stays #1 local quality (407 vs 385).


MiniMax M3 Q6 — “the complete expert, first local to cross the frontier”

Section titled “MiniMax M3 Q6 — “the complete expert, first local to cross the frontier””

MiniMax M3 (MoE 427B, in-house MLX 6-bit MSA-fixed, ~320 GB) on 1× M3 Ultra 512 GB at 447.5/460 (97.3%) — new absolute #1, ahead of Opus 4.6 (436) and GLM 5.2 Cloud (410). First local model to beat the frontier estimate on this benchmark.

What M3 breaks: the “coder ≠ analyst” split that dominated the benchmark (Coder-480B 95% writing / 56% debug; Qwen3.5 the inverse). M3 does 99% writing AND 100% debug — first model to dominate both axes at once.

Only execution slip: C03 storage bug — 14 cur.execute() without await (persistence + alerting broken at runtime). Trivial fix, but only caught at execution, not in static review.

DomainScoreNote
Python (writing + debug)99% / 100%Total domination
Bash / infra95%Zero GNU-isms, bounded parallelism
React / frontend97%ScatterChart quadrants on real median
Swift99%The only model with no compile error
Architecture91.7%Async storage bug in C03 (−4), trivial fix

Kimi K2.7 Code — “the reasoning coder, 0.5% behind M3”

Section titled “Kimi K2.7 Code — “the reasoning coder, 0.5% behind M3””

Reasoning model (Cloud API) — 436.5/460 (94.9%), #2. Declared role: coder model for Aider. Two absolute records: C03 Architecture (59/60, #1) and C04 Data Pipeline (50/50, #1) — C03 beats even M3 Q6 (55) and Opus 4.6 (55), the only test where M3 isn’t #1. Thinking blocks visible in the deliverables — configure Aider (--reasoning-effort) to strip them before diff parsing. Strong on Python/Bash, no real gaps; only weak spots are a React multi-quant selection bug (C04-2, 46) and 6/7 in debug (C02, 47). Aider verdict: strong on Python/Bash/React.

Hy3-preview Q9 — “a 295B that performs like a 480B — on one node”

Section titled “Hy3-preview Q9 — “a 295B that performs like a 480B — on one node””

Tencent Hy3-preview (MoE 295B / 21B active, Q9, ~332 GB) on 1× M3 Ultra 512 — 280/360 (77.8%) on 7 comparable tests (C07/C08 not run). Frees the other 3 nodes during execution. Clear pattern: excels at single-deliverable, precise-spec tasks; drops to Junior the moment the task needs analysis or documentation. Vs Coder-480B on 4 nodes: Hy3 > Coder on 6/7 common tests (+9.5 pts), 3× less hardware, +50% faster. Mandelbrot = panel #1 (48/50). But C06 Section D = 1/10 (delivers refactored code with no documentation), same pattern as C02 (34/50) — it executes, doesn’t reason about code.

Coder-480B Q8 — “codes like a senior, explains like an intern”

Section titled “Coder-480B Q8 — “codes like a senior, explains like an intern””

The most asymmetric profile of the benchmark. Crushes writing (+6 to +7 vs Qwen3.5 on C01/C04/C04-2) but collapses on anything needing explanation, diagnosis or justification (C02 at 56%, C06 at 62%). C02 reveals it: 0/15 Diagnostic, 0/10 Justification on C06 — functional fixes with no “why”, a mute fixer. Qwen3.5 finds 7/7 bugs with full root cause on the same test.

Qwen3.5-397B BF16 — “the polyglot analyst”

Section titled “Qwen3.5-397B BF16 — “the polyglot analyst””

Wins overall vs Coder (+29.5 pts) precisely on the tests where explanation counts: C02 debug 88% (7/7 bugs, zero false positives), C06 86% (before/after justification). Weakness: modern frontend — React 72% (no TypeScript, hard-coded data), SwiftUI 65% (pre-2023 patterns, no real streaming).

GLM 5.2 Cloud — “the all-rounder that tops the ceiling on analytical tests”

Section titled “GLM 5.2 Cloud — “the all-rounder that tops the ceiling on analytical tests””

410/460 (89.1%), between Opus 4.6 (94.8%) and Nex N2 Pro (88.5%). Homogeneous — no spectacular collapse. Two absolute records: C02 (49/50) and C06 (49/50) — on C02 the panel’s most complete analysis (full shell-injection exploitation scenario, not just the fix); on C06, 13 documented changes with before/after/why. Only significant gap: C07 Swift CLI (38/50) — ArgumentParser compile bug + Apple Silicon page size wrong (4096 vs 16384 → ×4 RAM under-estimate).

Nex N2 Pro Q9 — “the agentic fine-tune that validates the base model”

Section titled “Nex N2 Pro Q9 — “the agentic fine-tune that validates the base model””

Nex N2 Pro (MLX-9bit, 3× M3 Ultra 256 GB, ~24.3 tok/s) at 407/460 (88.5%) — just under GLM 5.2 Cloud. Same base architecture as Qwen3.5-397B BF16, fine-tuned for agentic workflows. Systematic gains vs base (+4 to +11.5 pts/test): C01 +4, C04 +6, C04-2 +8, C08 +11.5 — correlated with domain depth. Local records: C03 Architecture (49/60, #1), C04-2 React (44/50, #1 tied). Only regression: C07 Swift (−2).

Qwen3.5-122B Q8 HD16 — “the 10B-active that beats the 17B, with review bugs”

Section titled “Qwen3.5-122B Q8 HD16 — “the 10B-active that beats the 17B, with review bugs””

Qwen3.5-122B-A10B (MoE 122B / ~10B active, Q8 + BF16 heads, MLX, telecode) on 1 node — 385/460 (83.7%), #5. Counter-intuitive: beats the 397B BF16 (17B active) with 7B fewer active params — quality per active expert beats total parameter volume. Speed: 38.3 tok/s solo, multi-instance. Pattern: conceptually correct code with local (1–5 line) implementation bugs caught on the first run. Strong on Python/debug (near Nex). Weak only on Swift (33/50 C07, 36/50 C08 — compile errors on dead code). With an automatic compile step in the Cline loop, ~80% auto-corrected.

Qwen3-Coder-Next-80B — “the honest Python-first, the broken Swift-last”

Section titled “Qwen3-Coder-Next-80B — “the honest Python-first, the broken Swift-last””

80B dense (not MoE). 8 tests, 325/410 (79.3%) — “assisted senior”. Distinctive trait: analytical integrity — on C02 it finds 6/7 bugs and explicitly refuses to invent a 7th (“that would be malpractice”), zero false positives. Python excellent (C04 90%, C06 92%). Swift is the absolute floor (C07 64%, C08 74% — fundamental syntax error, let id: String { hostname } must be var). Bash multi-process broken (subshell IPC). For Cline ACT: recommended for Python with a structured plan; not for Swift without a compile pipeline.


C02 — Debug Fix — expanded panel (16 entries)

Section titled “C02 — Debug Fix — expanded panel (16 entries)”

C02 is the most discriminating test: detect 7 hidden bugs, structured diagnosis, fix, security analysis.

#Model/50Bugs/7Shell fixBug #7Analysis
1MiniMax M3 Q6507/7Perfect — 7/7 + structured exploitation scenario
1GLM 5.2 Cloud497/7Complete (root cause + exploitation scenario)
3Nex N2 Pro Q9487/7Complete (root cause + trigger scenario)
4Kimi K2.7 Code476/7Complete — Bug 5 implicit
5Qwen3.5-397B BF16447/7Complete
6Qwen3-Coder-Next-80B426/7Complete — refuses the 7th
7Coder-480B Q8 Tensor395/7Structured
8Hy3-preview Q9346/7Partial — no root cause, no exploitation
9Kimi K2.5 Q8 Pipeline / Tensor356–7/7mixedInline
11Qwen3.5-397B Q9 Exo356/7Missing
12Coder-480B Q8 Pipeline / FP16285/7Minimal
14Qwen3.5-35B-A3B Q9266/7None
15Mistral Small 3.1 24B Q8234–5/7None
16Coder-480B Q8 Thinking / LongCat Flash Lite Q9225/7mixedInvisible / none

Findings: GLM 5.2 Cloud (49) and Nex N2 Pro (48) lead the analytical panel. Three models find 7/7 with full analysis: GLM Cloud, Nex, Qwen3.5 BF16. Inference mode matters on the same model: Coder-480B Tensor 39 vs Pipeline 28 (+11), Thinking even worse (22 — analysis migrates into the invisible block). A 3B-active MoE (Qwen3.5-35B, 76.4 t/s) beats a dense 24B (Mistral Small) on quality and speed.


#Model/50Level
1Claude Opus 4.649iOS Senior
2MiniMax M3 Q648.5iOS Senior
3Kimi K2.7 Code46.5iOS Senior
4Nex N2 Pro Q944Good iOS
5Kimi K2.5 Q8 / GLM 5.2 Cloud43–43.5Good iOS
6GLM-5 Q841Good iOS
7Qwen3-Coder-Next-80B37Intermediate
8Qwen3.5-397B BF1632.5Junior
9Coder-480B Q826Basic

Interactive GUI: threading, zoom rectangle, view history, color palette, numpy forbidden, inverted Y axis.

#Model/50Runs?Threading
1Kimi K2.7 Code50✓✓Real thread + smooth coloring
2Hy3-preview Q948✓✓Real thread, HSV inline
3GLM 5.2 Cloud / Qwen 3.6 Plus47✓✓Real thread, HSV colorsys
5Qwen3.5-397B Q9 / Qwen3.5-122B Q846mixed
7Nex N2 Pro Q938fake async (after)
8LongCat Flash Lite / Mistral Small29–32

  • C02 reveals the analytical profile. It’s the only test that separates a model that understands code from one that reproduces patterns. All produce fixes; only GLM 5.2, Nex N2 Pro and Qwen3.5 explain why with full root cause. Missing explanation is near-systematic in code-specialist models (Coder-480B: 0/15 Diagnostic).
  • The agentic fine-tune is measurable. Qwen3.5-397B BF16 vs Nex N2 Pro Q9 (same base): +5.2 pts average, max C08 +11.5, only regression C07 −2, and +45% speed. Quality and speed improve — no trade-off.
  • Thinking mode is counter-productive on code. Coder-480B Thinking 22 vs standard 39 on C02; Qwen3-Next THNK 39.5 vs INST 49 on T03. The analysis migrates into the invisible block. Implicit/adaptive thinking (Nex) is handled better than explicit.
  • Inference mode affects quality, not just speed. Coder-480B Tensor 39 vs Pipeline 28 on C02 — same model, temperature, prompt. +11 pts.

TaskModelInfraWhy
Everything (generalist)MiniMax M3 Q61× M3 Ultra 512 local97.3% — #1, 8/9 best-in-class
Python writing + debugMiniMax M3 Q61× M3 Ultra 51299% / 100% — first to dominate both
React dashboardMiniMax M3 Q61× M3 Ultra 51297% C04-2
Swift CLI / SwiftUIMiniMax M3 Q61× M3 Ultra 51299% C07, 97% C08
Python/Bash/React (Cline, fast)Qwen3.5-122B Q8 HD161× node (telecode)83.7%, 38.3 tok/s, multi-instance
System architecture (fast local)Nex N2 Pro Q9Nex MLX 3 nodes88.5% — #1 local quality, 24.3 tok/s
Debug + critical analysis (cloud)GLM 5.2 CloudOpenRouter98% C02 — exploitation scenario
Deep refactoring (cloud)GLM 5.2 CloudOpenRouter98% C06 — 13 documented changes
Aider, long reasoning sessionKimi K2.7 Codelocal 4-bitthinking visible, fits Aider
Reference ceilingClaude Opus 4.6Anthropic API94.8% (436/460) — passed by M3 Q6

TMB Coding Benchmark — The Monocle Bear — April–June 2026. Panel C01-C08: 9 models · C02 expanded: 16 entries · C08: 9 models · C09: 9 models. Updated 23 June 2026.