TMB Scoreboard — Coding

The Monocle Bear · April–June 2026

Tests: TMB Coding C01–C08 (4 languages, /460) + C09 Mandelbrot Tkinter (/50)
Evaluators: Claude Opus 4.6 / Sonnet 4.6 — TMB Coding v2 rubrics
Infrastructure: Odysseus cluster 4× M3 Ultra + Inferencer + Ultra-512 MLX + OpenRouter

The battery

8 tests, 4 languages, /460. A single model can finish all 8 in a day.

Test	Language	Subject	Max
C01	Python	CLI — LLM Inference Benchmarker	/50
C02	Python	Debug & Fix — 7 hidden bugs	/50
C03	Python	Architecture — Cluster Monitor System	/60
C04	Python	Data Pipeline — Benchmark Cleaner	/50
C04-2	React/JS	Dashboard — TMB Scoreboard Viewer	/50
C05	Bash + Python	System & Infra — MLX Model Deployer	/50
C06	Python	Refactoring — Scoreboard Analyzer	/50
C07	Swift	macOS CLI — Cluster Health Monitor	/50
C08	SwiftUI	iOS App — EXO Model Manager	/50
TOTAL			/460

Global thresholds: 360+ (88%+) Full-stack autonomous · 290+ (71%+) Assisted senior · 225+ (55%+) Junior · <160 Unusable.

Scoreboard C01–C08

Nine models. MiniMax M3 Q6 = #1 (97.3%). Kimi K2.7 Code = #2 on 7 comparable tests (96.4%, C07/C08 pending). Claude Opus 4.6 = frontier reference.

Test	Max	M3 Q6 ★	Kimi K2.7 ◆	Hy3 Q9	Coder-480B Q8	Qwen3.5-397B BF16	GLM 5.2 Cloud	Nex N2 Pro Q9	Qwen3.5-122B Q8	Qwen3-Coder-Next-80B	Opus 4.6
C01 Python CLI	50	49.5	49	42	47	40	45	44	44	41	50
C02 Debug Fix	50	50	47	34	28	44	49	48	47	42	50
C03 Architecture	60	55	59 ⭐	40	39	46	47	49	50	45	55
C04 Data Pipeline	50	49.5	50 ⭐	47	48	42	49	48	44	45	49
C04-2 React	50	48.5	46	44	42	36	44	44	43	—	46
C05 Bash+Py	50	47.5	48	36	35.5	41	46	45	44	37	47
C06 Refactoring	50	49.5	48	37	31	43	49	47	44	46	47
C07 Swift CLI	50	49.5	43	—	38.5	40	38	38	33	32	43
C08 SwiftUI	50	48.5	46.5	—	26	32.5	43	44	36	37	49
TOTAL	460	447.5 (97.3%)	436.5 (94.9%) ◆	280/360 (77.8%) ‡	335 (72.8%)	364.5 (79.2%)	410 (89.1%)	407 (88.5%)	385 (83.7%)	325/410 (79.3%) †	436 (94.8%)

⭐ Best-in-class on that test. MiniMax M3 Q6 = #1 absolute on 8/9 tests. Kimi K2.7 = #1 on C03 and C04. ◆ Kimi K2.7 Code (reasoning, local 4-bit, ~17.8 tok/s): C01–C08 + C09 Mandelbrot (50/50 ⭐ absolute record). Role: coder model for Aider. † Qwen3-Coder-Next-80B: C04-2 React not tested — score over /410. ‡ Hy3-preview Q9: C07/C08 not tested — score over /360.

Infra & speed

Model	Active params	Quant	RAM	Avg speed	Mode
Coder-480B Q8	35B/480B MoE	Q8	~480 GB	13.3 t/s	Exo Tensor RDMA 4 nodes
Qwen3.5-397B BF16	17B/397B MoE	BF16	~715 GB	16.8 t/s	Exo Pipeline RDMA 4 nodes
Qwen3.5-122B Q8 HD16	~10B/122B MoE	Q8 + BF16 heads	~120 GB	38.3 t/s	1× M3 Ultra (telecode) — multi-instance
Nex N2 Pro Q9	17B/397B MoE	MLX-9bit	~415 GB	24.3 t/s	Nex MLX 3 nodes M3 Ultra 256 GB
MiniMax M3 Q6	21B/427B MoE	MLX 6-bit	~320 GB	~20 t/s	1× M3 Ultra 512 GB solo
Hy3-preview Q9	21B/295B MoE	Q9	~332 GB	~19.6 t/s	1× M3 Ultra 512 GB solo
Kimi K2.7 Code	—	4-bit	—	~17.8 t/s	Local 4-bit (reasoning)

Qwen3.5-122B Q8 HD16 (38.3 tok/s) is #1 in speed with higher quality than Qwen3.5-397B BF16 (+20.5 pts) on a single node. Nex stays #1 local quality (407 vs 385).

Model profiles

MiniMax M3 Q6 — “the complete expert, first local to cross the frontier”

MiniMax M3 (MoE 427B, in-house MLX 6-bit MSA-fixed, ~320 GB) on 1× M3 Ultra 512 GB at 447.5/460 (97.3%) — new absolute #1, ahead of Opus 4.6 (436) and GLM 5.2 Cloud (410). First local model to beat the frontier estimate on this benchmark.

What M3 breaks: the “coder ≠ analyst” split that dominated the benchmark (Coder-480B 95% writing / 56% debug; Qwen3.5 the inverse). M3 does 99% writing AND 100% debug — first model to dominate both axes at once.

Only execution slip: C03 storage bug — 14 cur.execute() without await (persistence + alerting broken at runtime). Trivial fix, but only caught at execution, not in static review.

Domain	Score	Note
Python (writing + debug)	99% / 100%	Total domination
Bash / infra	95%	Zero GNU-isms, bounded parallelism
React / frontend	97%	ScatterChart quadrants on real median
Swift	99%	The only model with no compile error
Architecture	91.7%	Async storage bug in C03 (−4), trivial fix

Kimi K2.7 Code — “the reasoning coder, 0.5% behind M3”

Reasoning model (Cloud API) — 436.5/460 (94.9%), #2. Declared role: coder model for Aider. Two absolute records: C03 Architecture (59/60, #1) and C04 Data Pipeline (50/50, #1) — C03 beats even M3 Q6 (55) and Opus 4.6 (55), the only test where M3 isn’t #1. Thinking blocks visible in the deliverables — configure Aider (--reasoning-effort) to strip them before diff parsing. Strong on Python/Bash, no real gaps; only weak spots are a React multi-quant selection bug (C04-2, 46) and 6/7 in debug (C02, 47). Aider verdict: strong on Python/Bash/React.

Hy3-preview Q9 — “a 295B that performs like a 480B — on one node”

Tencent Hy3-preview (MoE 295B / 21B active, Q9, ~332 GB) on 1× M3 Ultra 512 — 280/360 (77.8%) on 7 comparable tests (C07/C08 not run). Frees the other 3 nodes during execution. Clear pattern: excels at single-deliverable, precise-spec tasks; drops to Junior the moment the task needs analysis or documentation. Vs Coder-480B on 4 nodes: Hy3 > Coder on 6/7 common tests (+9.5 pts), 3× less hardware, +50% faster. Mandelbrot = panel #1 (48/50). But C06 Section D = 1/10 (delivers refactored code with no documentation), same pattern as C02 (34/50) — it executes, doesn’t reason about code.

Coder-480B Q8 — “codes like a senior, explains like an intern”

The most asymmetric profile of the benchmark. Crushes writing (+6 to +7 vs Qwen3.5 on C01/C04/C04-2) but collapses on anything needing explanation, diagnosis or justification (C02 at 56%, C06 at 62%). C02 reveals it: 0/15 Diagnostic, 0/10 Justification on C06 — functional fixes with no “why”, a mute fixer. Qwen3.5 finds 7/7 bugs with full root cause on the same test.

Qwen3.5-397B BF16 — “the polyglot analyst”

Wins overall vs Coder (+29.5 pts) precisely on the tests where explanation counts: C02 debug 88% (7/7 bugs, zero false positives), C06 86% (before/after justification). Weakness: modern frontend — React 72% (no TypeScript, hard-coded data), SwiftUI 65% (pre-2023 patterns, no real streaming).

GLM 5.2 Cloud — “the all-rounder that tops the ceiling on analytical tests”

410/460 (89.1%), between Opus 4.6 (94.8%) and Nex N2 Pro (88.5%). Homogeneous — no spectacular collapse. Two absolute records: C02 (49/50) and C06 (49/50) — on C02 the panel’s most complete analysis (full shell-injection exploitation scenario, not just the fix); on C06, 13 documented changes with before/after/why. Only significant gap: C07 Swift CLI (38/50) — ArgumentParser compile bug + Apple Silicon page size wrong (4096 vs 16384 → ×4 RAM under-estimate).

Nex N2 Pro Q9 — “the agentic fine-tune that validates the base model”

Nex N2 Pro (MLX-9bit, 3× M3 Ultra 256 GB, ~24.3 tok/s) at 407/460 (88.5%) — just under GLM 5.2 Cloud. Same base architecture as Qwen3.5-397B BF16, fine-tuned for agentic workflows. Systematic gains vs base (+4 to +11.5 pts/test): C01 +4, C04 +6, C04-2 +8, C08 +11.5 — correlated with domain depth. Local records: C03 Architecture (49/60, #1), C04-2 React (44/50, #1 tied). Only regression: C07 Swift (−2).

Qwen3.5-122B Q8 HD16 — “the 10B-active that beats the 17B, with review bugs”

Qwen3.5-122B-A10B (MoE 122B / ~10B active, Q8 + BF16 heads, MLX, telecode) on 1 node — 385/460 (83.7%), #5. Counter-intuitive: beats the 397B BF16 (17B active) with 7B fewer active params — quality per active expert beats total parameter volume. Speed: 38.3 tok/s solo, multi-instance. Pattern: conceptually correct code with local (1–5 line) implementation bugs caught on the first run. Strong on Python/debug (near Nex). Weak only on Swift (33/50 C07, 36/50 C08 — compile errors on dead code). With an automatic compile step in the Cline loop, ~80% auto-corrected.

Qwen3-Coder-Next-80B — “the honest Python-first, the broken Swift-last”

80B dense (not MoE). 8 tests, 325/410 (79.3%) — “assisted senior”. Distinctive trait: analytical integrity — on C02 it finds 6/7 bugs and explicitly refuses to invent a 7th (“that would be malpractice”), zero false positives. Python excellent (C04 90%, C06 92%). Swift is the absolute floor (C07 64%, C08 74% — fundamental syntax error, let id: String { hostname } must be var). Bash multi-process broken (subshell IPC). For Cline ACT: recommended for Python with a structured plan; not for Swift without a compile pipeline.

C02 — Debug Fix — expanded panel (16 entries)

C02 is the most discriminating test: detect 7 hidden bugs, structured diagnosis, fix, security analysis.

#	Model	/50	Bugs/7	Shell fix	Bug #7	Analysis
1	MiniMax M3 Q6 ⭐	50	7/7	✓	✓	Perfect — 7/7 + structured exploitation scenario
1	GLM 5.2 Cloud ⭐	49	7/7	✓	✓	Complete (root cause + exploitation scenario)
3	Nex N2 Pro Q9	48	7/7	✓	✓	Complete (root cause + trigger scenario)
4	Kimi K2.7 Code	47	6/7	✓	✓	Complete — Bug 5 implicit
5	Qwen3.5-397B BF16	44	7/7	✓	✓	Complete
6	Qwen3-Coder-Next-80B	42	6/7	✓	✗	Complete — refuses the 7th
7	Coder-480B Q8 Tensor	39	5/7	✓	✗	Structured
8	Hy3-preview Q9	34	6/7	✓	✓	Partial — no root cause, no exploitation
9	Kimi K2.5 Q8 Pipeline / Tensor	35	6–7/7	✓	mixed	Inline
11	Qwen3.5-397B Q9 Exo	35	6/7	⚠	✗	Missing
12	Coder-480B Q8 Pipeline / FP16	28	5/7	✓	✗	Minimal
14	Qwen3.5-35B-A3B Q9	26	6/7	✓	✗	None
15	Mistral Small 3.1 24B Q8	23	4–5/7	✓	✗	None
16	Coder-480B Q8 Thinking / LongCat Flash Lite Q9	22	5/7	⚠	mixed	Invisible / none

Findings: GLM 5.2 Cloud (49) and Nex N2 Pro (48) lead the analytical panel. Three models find 7/7 with full analysis: GLM Cloud, Nex, Qwen3.5 BF16. Inference mode matters on the same model: Coder-480B Tensor 39 vs Pipeline 28 (+11), Thinking even worse (22 — analysis migrates into the invisible block). A 3B-active MoE (Qwen3.5-35B, 76.4 t/s) beats a dense 24B (Mistral Small) on quality and speed.

C08 — SwiftUI iOS (8 models)

#	Model	/50	Level
1	Claude Opus 4.6	49	iOS Senior
2	MiniMax M3 Q6	48.5	iOS Senior
3	Kimi K2.7 Code	46.5	iOS Senior
4	Nex N2 Pro Q9	44	Good iOS
5	Kimi K2.5 Q8 / GLM 5.2 Cloud	43–43.5	Good iOS
6	GLM-5 Q8	41	Good iOS
7	Qwen3-Coder-Next-80B	37	Intermediate
8	Qwen3.5-397B BF16	32.5	Junior
9	Coder-480B Q8	26	Basic

C09 — Mandelbrot Tkinter

Interactive GUI: threading, zoom rectangle, view history, color palette, numpy forbidden, inverted Y axis.

#	Model	/50	Runs?	Threading
1	Kimi K2.7 Code ⭐	50	✓✓	Real thread + smooth coloring
2	Hy3-preview Q9	48	✓✓	Real thread, HSV inline
3	GLM 5.2 Cloud / Qwen 3.6 Plus	47	✓✓	Real thread, HSV colorsys
5	Qwen3.5-397B Q9 / Qwen3.5-122B Q8	46	✓	mixed
7	Nex N2 Pro Q9	38	⚠	fake async (after)
8	LongCat Flash Lite / Mistral Small	29–32	✗	—

Cross-cutting patterns

C02 reveals the analytical profile. It’s the only test that separates a model that understands code from one that reproduces patterns. All produce fixes; only GLM 5.2, Nex N2 Pro and Qwen3.5 explain why with full root cause. Missing explanation is near-systematic in code-specialist models (Coder-480B: 0/15 Diagnostic).
The agentic fine-tune is measurable. Qwen3.5-397B BF16 vs Nex N2 Pro Q9 (same base): +5.2 pts average, max C08 +11.5, only regression C07 −2, and +45% speed. Quality and speed improve — no trade-off.
Thinking mode is counter-productive on code. Coder-480B Thinking 22 vs standard 39 on C02; Qwen3-Next THNK 39.5 vs INST 49 on T03. The analysis migrates into the invisible block. Implicit/adaptive thinking (Nex) is handled better than explicit.
Inference mode affects quality, not just speed. Coder-480B Tensor 39 vs Pipeline 28 on C02 — same model, temperature, prompt. +11 pts.

Coding routing matrix

Task	Model	Infra	Why
Everything (generalist)	MiniMax M3 Q6	1× M3 Ultra 512 local	97.3% — #1, 8/9 best-in-class
Python writing + debug	MiniMax M3 Q6	1× M3 Ultra 512	99% / 100% — first to dominate both
React dashboard	MiniMax M3 Q6	1× M3 Ultra 512	97% C04-2
Swift CLI / SwiftUI	MiniMax M3 Q6	1× M3 Ultra 512	99% C07, 97% C08
Python/Bash/React (Cline, fast)	Qwen3.5-122B Q8 HD16	1× node (telecode)	83.7%, 38.3 tok/s, multi-instance
System architecture (fast local)	Nex N2 Pro Q9	Nex MLX 3 nodes	88.5% — #1 local quality, 24.3 tok/s
Debug + critical analysis (cloud)	GLM 5.2 Cloud	OpenRouter	98% C02 — exploitation scenario
Deep refactoring (cloud)	GLM 5.2 Cloud	OpenRouter	98% C06 — 13 documented changes
Aider, long reasoning session	Kimi K2.7 Code	local 4-bit	thinking visible, fits Aider
Reference ceiling	Claude Opus 4.6	Anthropic API	94.8% (436/460) — passed by M3 Q6

TMB Coding Benchmark — The Monocle Bear — April–June 2026. Panel C01-C08: 9 models · C02 expanded: 16 entries · C08: 9 models · C09: 9 models. Updated 23 June 2026.