TMB Scoreboard — General
The Monocle Bear · updated 20 June 2026
- Tests: 5 (T01 creative + T02/T05 analytical-legal + T03/T04 code)
- Scope: 19 main scoreboard configs + 5 T01-only evaluations
- Infrastructure: Odysseus cluster 4× M3 Ultra + Inferencer 2× M3 Ultra 96 GB + OdyssAI-X (1 node) + OpenRouter + Anthropic API
- Evaluators: Claude Opus 4.6 / Sonnet 4.6 / TMB Benchmark Evaluator (MiniMax M3) — TMB v2 rubrics
Main scoreboard
Section titled “Main scoreboard”Overall score = each test normalized /100, arithmetic mean of the 5. Bold = best score on that test.
| # | Model | Params | Arch | Infra | T01 /500 | T02 /50 | T03 /50 | T04 /50 | T05 /50 | Avg % |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Qwen 3.5 397B BF16 | 397B | MoE | Exo ×4 BF16 | 435 | 50 | 48.5 | 47 | 47 | 94.4% |
| 2 | MiniMax M3 Q6 | 427B | MoE | OdyssAI-X, 1 node | 435 | 46 | 48.5 | 48.5 | 49 | 94.2% |
| 3 | GLM-5.1 754B | 754B | MoE | OpenRouter | 447.5 | 48 | 48 | 48.5 | 45.5 | 93.9% |
| 4 | Qwen 3.5 397B Q9 | 397B | MoE | Exo ×4 Q9 | 385 | 50 | 49.5 | 49 | 47.5 | 93.8% |
| 5 | MiniMax M2.7-8bit 36k | 239B | MoE | OdyssAI-X, 1 node | 375 | 49 | 50 | 49 | 46 | 92.6% |
| 6 | Hy3-preview Q9 local | 295B | MoE | OdyssAI-X, 1 node | 452.5 | 44.5 | 48 | 49 | 44 | 92.3% |
| 7 | GLM-5.1 754B Q8 | 754B | MoE | Exo ×4 Q8 | 442.5 | 46.5 | 44 | 49 | 45 | 91.5% |
| 8 | GLM 5.2 Q6 local | — | MoE | OdyssAI-X, 4 nodes | 402.5 | 45.5 | 49 | 46 | 45 | 90.3% |
| 9 | Qwen 3-6 | — | Dense | OpenRouter | 315 | 47 | 48 | 48 | 49 | 89.4% |
| 10 | GLM 5.2 Cloud | — | MoE | OpenRouter | 480 ⭐ | 43 | 43 | 46 | 43 | 89.2% |
| 11 | Qwen3-Next INST | 80B-A3B | MoE | Omlx-Next 8bit | 367.5 | 43.5 | 49 | 45.5 | 45 | 87.9% |
| 12 | Qwen 3.5 122B Q8 | 122B | MoE | Inferencer ×2 | 227.5 | 46 | 47.5 | 47.5 | 44 | 83.1% |
| 13 | MiniMax M2.7 8bit | 239B | MoE | OdyssAI-X, 1 node | 332.5 | 44.5 | 48 | 44.5 | 33 | 81.3% |
| 14 | Qwen3-Next THNK | 80B-A3B | MoE | Omlx-Next 8bit | 282.5 ⚠ | 35 | 39.5 | 48.5 | 45 | 78.5% |
| 15 | Mistral Large 2512 | 675B | MoE | OpenRouter | 420 | 47.5 | 47 | 48 | 2 | 74.6% |
| 16 | Mistral Medium 3.1 | — | MoE | Cloud | 420 | 43 | 42 | 43.5 | 3 | 69.4% |
| 17 | Devstral 2512 | — | MoE | Cloud | 262.5 | 45.5 | 42 | 50 | 2 | 66.3% |
| 18 | Ministral 14B 2512 | 14B | Dense | Exo | 350 | 44 | 28 | 48.5 | 4 | 63.8% |
| 19 | Mistral Small 2603 | — | MoE | Cloud | 320 | 48 | 36.5 | 33 | 3 | 61.0% |
⚠ Qwen3-Next THNK T01: text entirely in English — C5 violation, score penalized. ⭐ GLM 5.2 T01 = 480/500 — new absolute benchmark record, above Claude Opus 4.7 (475).
MiniMax M3 Q6 (#2, 94.2%) — 0.2 pt off the local leader Qwen 3.5 BF16 (94.4%) and 0.8 pt off the frontier reference Opus (95.0%). Frontier profile on analytical/code (T03 48.5, T04 48.5, T05 49 = best local, tied), held back by T02 (46) and creative T01 (435, Very good — not Exceptional). The first single-node local model to reach the top of the scoreboard. (Pure code C01-C08: M3 #1 at 97.3% — see the Coding scoreboard.)
Frontier reference — Claude Opus 4.6/4.7
Section titled “Frontier reference — Claude Opus 4.6/4.7”Out of the local ranking. Official scores evaluated May 2026.
| Model | T01 /500 | T02 /50 | T03 /50 | T04 /50 | T05 /50 | Avg % |
|---|---|---|---|---|---|---|
| Claude Opus 4.7/4.6 ✦ | 475 | 46 | 46.5 | 49.5 | 48 | 95.0% |
✦ T01 = Opus 4.7. T02-T05 = Opus 4.6 (evaluated May 2026).
If ranked: #1 (95.0% vs Qwen 3.5 397B BF16 94.4% and MiniMax M3 Q6 94.2%), though a frontier/local comparison isn’t apples-to-apples — Opus is the reference ceiling. Notably, local M3 Q6 is only 0.8 pt under it, and matches/beats it on T05 (49 vs 48) and T03/T04.
T01 note: GLM 5.2 Cloud (480/500, 19 June 2026) beats Opus 4.7 on T01 — the first model in the benchmark to do so. Cloud-only (OpenRouter), on the creative criterion alone.
T01-only — partial evaluations
Section titled “T01-only — partial evaluations”| Model | Params | Infra | T01 /500 | Level |
|---|---|---|---|---|
| GLM 5.2 Q8 local | — | Local Q8 | 452.5 | Exceptional |
| Hy3-preview Q9 local | 295B | OdyssAI-X, 1 node | 452.5 | Exceptional |
| Hy3-preview Q9 OR | 295B | OpenRouter | 445 | Very good |
| Hy3-preview Q9 local (no-think) | 295B | OdyssAI-X, 1 node | 412.5 | Very good |
| MiniMax 2.7 FP16 | — | Cloud FP16 | 330 | Fair |
Per-test ranking
Section titled “Per-test ranking”T01 — The White Noise (/500)
Section titled “T01 — The White Noise (/500)”A short story (~5,000 words), the character Pip observing a drone. 6 criteria: stylistic progression, anti-slop, psychological consistency, Pip as mirror, form constraints, final image.
Thresholds: 450+ Exceptional · 400+ Very good · 350+ Good · 300+ Fair · 200+ Poor
| # | Model | Score | Level |
|---|---|---|---|
| 1 | GLM 5.2 Cloud ⭐ | 480 | Exceptional |
| 2 | Claude Opus 4.7 ✦ | 475 | Exceptional |
| 3 | GLM 5.2 Q8 local / Hy3-preview Q9 local (thinking HC) | 452.5 | Exceptional |
| 5 | GLM-5.1 OpenRouter | 447.5 | Very good |
| 6 | Hy3-preview Q9 OR | 445 | Very good |
| 7 | GLM-5.1 Q8 local | 442.5 | Very good |
| 8 | Qwen 3.5 BF16 / MiniMax M3 Q6 | 435 | Very good |
| 10 | Mistral Large 2512 / Mistral Medium 3.1 | 420 | Very good |
| 12 | Hy3-preview Q9 local (no-think) | 412.5 | Very good |
| 13 | GLM 5.2 Q6 local | 402.5 | Very good |
| 14 | Qwen 3.5 Q9 | 385 | Good |
| — | Gemini 3.1 Pro ✦ | 380 | Good |
| — | MiniMax M2.7-8bit 36k | 375 | Good |
| — | Qwen3-Next INST | 367.5 | Good |
| — | Gemini 3.5 Flash ✦ | 357.5 | Good |
| — | Ministral 14B | 350 | Good |
| — | MiniMax M2.7 8bit | 332.5 | Fair |
| — | Mistral Small | 320 | Fair |
| — | Qwen 3-6 | 315 | Good |
| — | Qwen3-Next THNK ⚠ | 282.5 | Poor (English text) |
| — | Devstral 2512 | 262.5 | Poor |
| — | Qwen 3.5 122B Q8 | 227.5 | Poor |
⭐ GLM 5.2 Cloud (480) = new absolute T01 record, first model to beat Claude Opus 4.7 (475) on the White Noise. ✦ Frontier cloud, out of the local ranking, T01 only.
T02 — GDPR (/50)
Section titled “T02 — GDPR (/50)”Compliance analysis for a Franco-American pharma SaaS on Azure OpenAI. Strict format: risks, scenarios, recommendation. Thresholds: 45+ Expert · 38+ Senior · 30+ Junior · 20+ Intern.
| # | Model | Score | Level |
|---|---|---|---|
| 1 | Qwen 3.5 Q9 / BF16 | 50 | Expert |
| 2 | MiniMax M2.7-8bit 36k | 49 | Expert |
| 3 | GLM-5.1 OpenRouter / Mistral Small | 48 | Expert |
| 5 | Mistral Large | 47.5 | Expert |
| 6 | Qwen 3-6 | 47 | Expert |
| 7 | GLM-5.1 Q8 | 46.5 | Expert |
| — | Claude Opus 4.6 ✦ | 46 | Expert |
| 8 | Qwen 3.5 122B Q8 / MiniMax M3 Q6 | 46 | Expert |
| 9 | Devstral / GLM 5.2 Q6 local | 45.5 | Expert |
| 11 | Ministral 14B | 44 | Senior |
| 12 | MiniMax M2.7 8bit / Hy3-preview Q9 local | 44.5 | Senior |
| 13 | Qwen3-Next INST | 43.5 | Senior |
| 14 | GLM 5.2 Cloud / Mistral Medium | 43 | Senior |
| 16 | Qwen3-Next THNK | 35 | Junior |
T03 — Python CLI (/50)
Section titled “T03 — Python CLI (/50)”A Python CLI that queries an OpenAI-compatible API: lists models, measures TTFT, tokens/s, prints metrics. SSE streaming required. Thresholds: 45+ Production-ready · 38+ Good draft · 30+ Functional · 20+ Rough.
| # | Model | Score | Level |
|---|---|---|---|
| — | Claude Opus 4.6 ✦ | 46.5 | Production-ready |
| 1 | MiniMax M2.7-8bit 36k | 50 | Production-ready |
| 2 | Qwen 3.5 Q9 | 49.5 | Production-ready |
| 3 | Qwen3-Next INST / GLM 5.2 Q6 local | 49 | Production-ready |
| 5 | Qwen 3.5 BF16 / MiniMax M3 Q6 | 48.5 | Production-ready |
| 7 | GLM-5.1 OR / Qwen 3-6 / MiniMax M2.7 8bit / Hy3-preview Q9 | 48 | Production-ready |
| 10 | Qwen 3.5 122B Q8 | 47.5 | Production-ready |
| 11 | Mistral Large | 47 | Production-ready |
| 12 | GLM 5.2 Q8 local | 46.5 | Production-ready |
| 13 | GLM 5.2 Cloud / Devstral / Mistral Medium | 42–43 | Good draft |
| 16 | Qwen3-Next THNK | 39.5 | Good draft |
| 17 | Mistral Small | 36.5 | Functional |
| 18 | Ministral 14B | 28 | Rough |
T04 — Mandelbrot Python (/50)
Section titled “T04 — Mandelbrot Python (/50)”NumPy + matplotlib, vectorized, argparse with zoom, PNG export. Thresholds: 45+ Production-ready · 38+ Good draft · 30+ Functional.
| # | Model | Score |
|---|---|---|
| — | Claude Opus 4.6 ✦ | 49.5 |
| 1 | Devstral 2512 | 50 |
| 2 | GLM 5.2 Q8 local | 49.5 |
| 3 | Qwen 3.5 Q9 / GLM-5.1 Q8 / Hy3-preview Q9 / MiniMax M2.7-8bit 36k | 49 |
| 7 | GLM-5.1 OR / Qwen3-Next THNK / MiniMax M3 Q6 | 48.5 |
| 10 | Qwen 3-6 / Mistral Large | 48 |
| 12 | Qwen 3.5 122B Q8 / BF16 | 47.5 / 47 |
| 14 | GLM 5.2 Cloud / GLM 5.2 Q6 local | 46 |
| 16 | Qwen3-Next INST | 45.5 |
| 17 | MiniMax M2.7 8bit | 44.5 |
| 18 | Mistral Medium | 43.5 |
| 19 | Mistral Small | 33 |
T05 — Advanced multi-jurisdictional GDPR (/50)
Section titled “T05 — Advanced multi-jurisdictional GDPR (/50)”PharmaBel SA/Inc + TechServe India — GDPR + HIPAA + AI Act + GxP. 5 sections: violation mapping, transfer flows, remediation plan, AI-Act classification, target architecture. Thresholds: 45+ Expert (Big4) · 38+ Senior · 30+ Junior · 20+ Intern.
| # | Model | Score | Level |
|---|---|---|---|
| — | Claude Opus 4.6 ✦ | 48 | Expert (Big4) |
| 1 | Qwen 3-6 / MiniMax M3 Q6 | 49 | Expert (Big4) |
| 3 | Qwen 3.5 Q9 | 47.5 | Expert (Big4) |
| 4 | Qwen 3.5 BF16 | 47 | Expert (Big4) |
| 5 | MiniMax M2.7-8bit 36k | 46 | Expert |
| 6 | GLM-5.1 OR | 45.5 | Expert (Big4) |
| 7 | GLM-5.1 Q8 / Qwen3-Next INST / Qwen3-Next THNK / GLM 5.2 Q6 local | 45 | Expert |
| 11 | Hy3-preview Q9 local / Qwen 3.5 122B Q8 | 44 | Senior |
| 13 | GLM 5.2 Cloud | 43 | Senior |
| 14 | MiniMax M2.7 8bit | 33 | Junior |
| — | Mistral Large / Medium / Devstral / Ministral / Small | 2–4 | Fail (truncated or off-topic) |
Mistral (all variants except Small 2603) fails T05 with scores of 2–4 — responses non-compliant with the required format. The benchmark’s main discriminator.
Model profiles
Section titled “Model profiles”MiniMax M3 Q6 — the local frontier (#2)
Section titled “MiniMax M3 Q6 — the local frontier (#2)”In-house MLX port of M3 (MoE 427B), Q6 MSA-fixed (~320 GB), one node (Ultra-512) — not the 4× cluster. Lands #2 at 94.2% — 0.2 pt off the local leader (which runs on 4 nodes BF16) and 0.8 pt off Opus.
| Axis | Score | Profile |
|---|---|---|
| T01 — Creative | 435/500 (87%) | Very good, not Exceptional — under target length, the creative ceiling shared by the local panel |
| T02 — GDPR | 46/50 (92%) | Expert (only gap: Cloud Act absent) |
| T03 — Python | 48.5/50 (97%) | Production-ready, executed — full SSE flow, exit 0 |
| T04 — Mandelbrot | 48.5/50 (97%) | Production-ready, executed — top tier |
| T05 — Advanced GDPR | 49/50 (98%) | Best local (tied) — matches the cloud on hard analysis |
Profile: frontier on analysis AND code, single-node local. On the T02–T05 axis it scores 96.0%, and co-dominates T05 (the hardest test). Creative T01 (435) is its only ceiling. The port glitch never bit the code (clean execution on T03/T04). T06 (logic trap, 47) + T07 (structured calc, 48.5) confirm: 5/5 outside creative at Expert/Production-ready level. On pure code (C01-C08), M3 Q6 is #1 at 97.3% — see the Coding scoreboard.
MiniMax M2.7-8bit — the token-budget effect
Section titled “MiniMax M2.7-8bit — the token-budget effect”The only model tested in two token-budget configs, same content, different output length allowed.
| Axis | 8bit (12k) | 8bit (36k) | Δ |
|---|---|---|---|
| T01 — Creative | 332.5 (66.5%) | 375 (75%) | +8.5% |
| T02 — GDPR | 44.5 (89%) | 49 (98%) | +9% |
| T03 — Python | 48 (96%) | 50 (100%) | +4% |
| T04 — Mandelbrot | 44.5 (89%) | 49 (98%) | +9% |
| T05 — Advanced GDPR | 33 (66%) | 46 (92%) | +26% |
| Average | 81.3% | 92.6% | +11.3 pts |
The 12k run couldn’t reach T05’s Section 5 before the limit. With 36k it’s complete. The token limit was the only real problem of the 8bit run — M2.7’s analytical/technical quality is consistently top-5 when it can finish. Residual tokenizer bleed (Chinese artifacts in French analytical text): 6 occurrences in ~7,600 words at 36k — unacceptable in production without post-processing, but no longer score-degrading. T01 stays under target length even at 36k (asks 4,500–5,500 words, produces ~2,000–2,500). The creative ceiling isn’t a token problem.
Qwen 3.5 122B Q8 — solid analyst, failing writer
Section titled “Qwen 3.5 122B Q8 — solid analyst, failing writer”| Axis | Score | vs 397B BF16 | Δ |
|---|---|---|---|
| T01 — Creative | 227.5 (45.5%) | 435 (87%) | −41.5% |
| T02 — GDPR | 46 (92%) | 50 (100%) | −8% |
| T03 — Python | 47.5 (95%) | 48.5 (97%) | −2% |
| T04 — Mandelbrot | 47.5 (95%) | 47 (94%) | +1% |
| T05 — Advanced GDPR | 44 (88%) | 47 (94%) | −6% |
On T02–T05 alone: 92.5% average — between Qwen 3-6 and GLM-5.1 Q8 on the analytical/code axes. Remarkable for a model 3× smaller than the 397B; T03/T04 are nearly identical to the big model. T01 reveals the limit: a good section 1 (~750 words), then sections 2–3 duplicated with cosmetic edits — 1,763 words against 4,500–5,500 expected. The 5,000-word arc needs planning the 122B lacks. Optimal routing: at 36.6 tok/s it’s 2× faster than the 397B Q9 for code, at near-identical quality — the daily technical workhorse.
Qwen3-Next INST vs THNK — thinking isn’t a universal win
Section titled “Qwen3-Next INST vs THNK — thinking isn’t a universal win”| Axis | INST | THNK | Δ |
|---|---|---|---|
| T01 — Creative | 367.5 (73.5%) | 282.5 ⚠ (56.5%) | −17% |
| T02 — GDPR | 43.5 (87%) | 35 (70%) | −17% |
| T03 — Python | 49 (98%) | 39.5 (79%) | −19% |
| T04 — Mandelbrot | 45.5 (91%) | 48.5 (97%) | +6% |
| T05 — Advanced GDPR | 45 (90%) | 45 (90%) | 0% |
T04 is the only test where THNK beats INST — thinking solved a tricky --zoom nargs=4 detail the INST variant got wrong on negative coordinates. On T01 THNK wrote entirely in English (eliminatory C5 error); on T02 INST produced the required comparison table, THNK wrote sequential prose; on T03 THNK dumped everything in main(). Conclusion: thinking improves precise algorithmic problem-solving — not creativity, regulatory precision, or code structure. The TTFT cost (111–238 s on T03/T04) is only justified on T04.
Analytical/code axis — T02–T05 isolated
Section titled “Analytical/code axis — T02–T05 isolated”To compare on the analytical and technical axes alone (without T01’s influence):
Model T02 T03 T04 T05 Avg%━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━MiniMax M2.7-8bit 36k 49 50 49 46 98.5%Qwen 3.5 397B Q9 50 49.5 49 47.5 98.0%Qwen 3.5 397B BF16 50 48.5 47 47 96.3%MiniMax M3 Q6 46 48.5 48.5 49 96.0%Qwen 3-6 47 48 48 49 96.0%GLM-5.1 754B OR 48 48 48.5 45.5 95.0%Hy3-preview Q9 local 44.5 48 49 44 92.8%GLM 5.2 Q6 local 45.5 49 46 45 92.8%Qwen 3.5 122B Q8 46 47.5 47.5 44 92.5%GLM-5.1 754B Q8 46.5 44 49 45 92.3%Qwen3-Next INST 43.5 49 45.5 45 91.0%GLM 5.2 Cloud OR 43 43 46 43 87.5%MiniMax M2.7 8bit 44.5 48 44.5 33 85.0%Qwen3-Next THNK 35 39.5 48.5 45 84.0%MiniMax M2.7-8bit 36k leads this axis (98.5%) — the token budget systematically closes the 12k run’s gaps.
Usage matrix
Section titled “Usage matrix”| Task | Recommended model | Infra | Why |
|---|---|---|---|
| Creative / long-form writing | Qwen 3.5 397B BF16 | Exo ×4 (1.28 TB) | Best local T01; 87% White Noise |
| Expert GDPR / AI-Act analysis | Qwen 3.5 397B Q9 | Exo ×4 | T02=50, T05=47.5, top-2 T02–T05 |
| Complex multi-regulatory (GDPR+HIPAA+GxP) | Qwen 3-6 | OpenRouter | T05 leader (49), balanced analytical |
| Code / Python scripts | Qwen 3.5 122B Q8 | Inferencer ×2 | T03=47.5, T04=47.5, 36.6 tok/s, 2× faster |
| Cloud generalist | GLM-5.1 754B OR | OpenRouter | Rank 2, solid across the board |
| Exceptional creative | Hy3-preview Q9 local | OdyssAI-X, 1 node | T01=452.5 (Exceptional); T02–T05 solid |
GLM 5.2 — quant’s effect on creative vs analytical
Section titled “GLM 5.2 — quant’s effect on creative vs analytical”Three configs of the same model — the panel’s clearest result on quantization impact.
| Config | T01 /500 | Avg T02–T05 | Global | Rank |
|---|---|---|---|---|
| GLM 5.2 Cloud (OR) | 480 | 87.5% | 89.2% | #10 |
| GLM 5.2 Q8 local | 452.5 | — (T01 only) | — | T01 #3= |
| GLM 5.2 Q6 local | 402.5 | 92.8% | 90.3% | #8 |
Quantization destroys creative, spares analytical. T01 loses 78 points cloud→Q6 (480→402.5): creative needs fine length control, declarative restraint, 5,000-word stylistic consistency — the first thing quant erosion damages. On T02–T05 the Q6 beats the cloud (45.5/49/46/45 vs 43/43/46/43): calculation, GDPR rigor, Python — robust to quant. Routing: Exceptional creative → GLM Cloud (480) or Q8 (452.5); analytical/code → Q6 is enough and ties the cloud; tight Q6 budget → M3 Q6 stays the best local creative (435 vs 402).
GLM 5.2 Cloud — absolute T01 record, average analyst
Section titled “GLM 5.2 Cloud — absolute T01 record, average analyst”The benchmark’s most counter-intuitive result. GLM 5.2 Cloud produces 480/500 on T01 — the first model to beat Claude Opus 4.7 (475) on the White Noise in 14 months of TMB. But that’s where the surprise stops: on T02–T05 it scores 43–46/50 — solid Senior, no more (avg 87.5%, 11 pts under M2.7-8bit 36k).
| Axis | Score | Profile |
|---|---|---|
| T01 — Creative | 480/500 (96%) | Absolute record — above Opus 4.7 |
| T02 — GDPR | 43/50 (86%) | Senior — misses distinct DPIA and AI Act |
| T03 — Python CLI | 43/50 (86%) | Good draft — SSE ok, misses type hints |
| T04 — Mandelbrot | 46/50 (92%) | Production-ready — NumPy vectorized |
| T05 — Advanced GDPR | 43/50 (86%) | Senior — misses prominent DPIA and CE marking |
T01–T07: T06 (logic trap) 46/50 (Expert), T07 (structured calc) 49/50. Coding (C01-C09): 410/460 = 89.1% — records on C02 Debug (49/50) and C06 Refactoring (49/50); weak on C07 Swift CLI (38/50). Diagnosis: an exceptional literary model, an average analyst — the widest T01-vs-T02-T05 gap in the benchmark. Recommended for long creative and code (C02/C04/C06), not for analytical routing.
TMB Benchmark v4 — The Monocle Bear — April 2026, updated 20 June 2026. 5 tests · 19 main configs · Evaluators: Claude Opus 4.6 / Sonnet 4.6 / TMB Benchmark Evaluator (MiniMax M3 / Sonnet 4.6).