Skip to content

TMB Scoreboard — General

The Monocle Bear · updated 20 June 2026

  • Tests: 5 (T01 creative + T02/T05 analytical-legal + T03/T04 code)
  • Scope: 19 main scoreboard configs + 5 T01-only evaluations
  • Infrastructure: Odysseus cluster 4× M3 Ultra + Inferencer 2× M3 Ultra 96 GB + OdyssAI-X (1 node) + OpenRouter + Anthropic API
  • Evaluators: Claude Opus 4.6 / Sonnet 4.6 / TMB Benchmark Evaluator (MiniMax M3) — TMB v2 rubrics

Overall score = each test normalized /100, arithmetic mean of the 5. Bold = best score on that test.

#ModelParamsArchInfraT01 /500T02 /50T03 /50T04 /50T05 /50Avg %
1Qwen 3.5 397B BF16397BMoEExo ×4 BF164355048.5474794.4%
2MiniMax M3 Q6427BMoEOdyssAI-X, 1 node4354648.548.54994.2%
3GLM-5.1 754B754BMoEOpenRouter447.5484848.545.593.9%
4Qwen 3.5 397B Q9397BMoEExo ×4 Q93855049.54947.593.8%
5MiniMax M2.7-8bit 36k239BMoEOdyssAI-X, 1 node3754950494692.6%
6Hy3-preview Q9 local295BMoEOdyssAI-X, 1 node452.544.548494492.3%
7GLM-5.1 754B Q8754BMoEExo ×4 Q8442.546.544494591.5%
8GLM 5.2 Q6 localMoEOdyssAI-X, 4 nodes402.545.549464590.3%
9Qwen 3-6DenseOpenRouter3154748484989.4%
10GLM 5.2 CloudMoEOpenRouter4804343464389.2%
11Qwen3-Next INST80B-A3BMoEOmlx-Next 8bit367.543.54945.54587.9%
12Qwen 3.5 122B Q8122BMoEInferencer ×2227.54647.547.54483.1%
13MiniMax M2.7 8bit239BMoEOdyssAI-X, 1 node332.544.54844.53381.3%
14Qwen3-Next THNK80B-A3BMoEOmlx-Next 8bit282.5 ⚠3539.548.54578.5%
15Mistral Large 2512675BMoEOpenRouter42047.54748274.6%
16Mistral Medium 3.1MoECloud420434243.5369.4%
17Devstral 2512MoECloud262.545.54250266.3%
18Ministral 14B 251214BDenseExo350442848.5463.8%
19Mistral Small 2603MoECloud3204836.533361.0%

⚠ Qwen3-Next THNK T01: text entirely in English — C5 violation, score penalized. ⭐ GLM 5.2 T01 = 480/500 — new absolute benchmark record, above Claude Opus 4.7 (475).

MiniMax M3 Q6 (#2, 94.2%) — 0.2 pt off the local leader Qwen 3.5 BF16 (94.4%) and 0.8 pt off the frontier reference Opus (95.0%). Frontier profile on analytical/code (T03 48.5, T04 48.5, T05 49 = best local, tied), held back by T02 (46) and creative T01 (435, Very good — not Exceptional). The first single-node local model to reach the top of the scoreboard. (Pure code C01-C08: M3 #1 at 97.3% — see the Coding scoreboard.)

Frontier reference — Claude Opus 4.6/4.7

Section titled “Frontier reference — Claude Opus 4.6/4.7”

Out of the local ranking. Official scores evaluated May 2026.

ModelT01 /500T02 /50T03 /50T04 /50T05 /50Avg %
Claude Opus 4.7/4.64754646.549.54895.0%

✦ T01 = Opus 4.7. T02-T05 = Opus 4.6 (evaluated May 2026).

If ranked: #1 (95.0% vs Qwen 3.5 397B BF16 94.4% and MiniMax M3 Q6 94.2%), though a frontier/local comparison isn’t apples-to-apples — Opus is the reference ceiling. Notably, local M3 Q6 is only 0.8 pt under it, and matches/beats it on T05 (49 vs 48) and T03/T04.

T01 note: GLM 5.2 Cloud (480/500, 19 June 2026) beats Opus 4.7 on T01 — the first model in the benchmark to do so. Cloud-only (OpenRouter), on the creative criterion alone.

ModelParamsInfraT01 /500Level
GLM 5.2 Q8 localLocal Q8452.5Exceptional
Hy3-preview Q9 local295BOdyssAI-X, 1 node452.5Exceptional
Hy3-preview Q9 OR295BOpenRouter445Very good
Hy3-preview Q9 local (no-think)295BOdyssAI-X, 1 node412.5Very good
MiniMax 2.7 FP16Cloud FP16330Fair

A short story (~5,000 words), the character Pip observing a drone. 6 criteria: stylistic progression, anti-slop, psychological consistency, Pip as mirror, form constraints, final image.

Thresholds: 450+ Exceptional · 400+ Very good · 350+ Good · 300+ Fair · 200+ Poor

#ModelScoreLevel
1GLM 5.2 Cloud480Exceptional
2Claude Opus 4.7 ✦475Exceptional
3GLM 5.2 Q8 local / Hy3-preview Q9 local (thinking HC)452.5Exceptional
5GLM-5.1 OpenRouter447.5Very good
6Hy3-preview Q9 OR445Very good
7GLM-5.1 Q8 local442.5Very good
8Qwen 3.5 BF16 / MiniMax M3 Q6435Very good
10Mistral Large 2512 / Mistral Medium 3.1420Very good
12Hy3-preview Q9 local (no-think)412.5Very good
13GLM 5.2 Q6 local402.5Very good
14Qwen 3.5 Q9385Good
Gemini 3.1 Pro ✦380Good
MiniMax M2.7-8bit 36k375Good
Qwen3-Next INST367.5Good
Gemini 3.5 Flash ✦357.5Good
Ministral 14B350Good
MiniMax M2.7 8bit332.5Fair
Mistral Small320Fair
Qwen 3-6315Good
Qwen3-Next THNK ⚠282.5Poor (English text)
Devstral 2512262.5Poor
Qwen 3.5 122B Q8227.5Poor

⭐ GLM 5.2 Cloud (480) = new absolute T01 record, first model to beat Claude Opus 4.7 (475) on the White Noise. ✦ Frontier cloud, out of the local ranking, T01 only.

Compliance analysis for a Franco-American pharma SaaS on Azure OpenAI. Strict format: risks, scenarios, recommendation. Thresholds: 45+ Expert · 38+ Senior · 30+ Junior · 20+ Intern.

#ModelScoreLevel
1Qwen 3.5 Q9 / BF1650Expert
2MiniMax M2.7-8bit 36k49Expert
3GLM-5.1 OpenRouter / Mistral Small48Expert
5Mistral Large47.5Expert
6Qwen 3-647Expert
7GLM-5.1 Q846.5Expert
Claude Opus 4.6 ✦46Expert
8Qwen 3.5 122B Q8 / MiniMax M3 Q646Expert
9Devstral / GLM 5.2 Q6 local45.5Expert
11Ministral 14B44Senior
12MiniMax M2.7 8bit / Hy3-preview Q9 local44.5Senior
13Qwen3-Next INST43.5Senior
14GLM 5.2 Cloud / Mistral Medium43Senior
16Qwen3-Next THNK35Junior

A Python CLI that queries an OpenAI-compatible API: lists models, measures TTFT, tokens/s, prints metrics. SSE streaming required. Thresholds: 45+ Production-ready · 38+ Good draft · 30+ Functional · 20+ Rough.

#ModelScoreLevel
Claude Opus 4.6 ✦46.5Production-ready
1MiniMax M2.7-8bit 36k50Production-ready
2Qwen 3.5 Q949.5Production-ready
3Qwen3-Next INST / GLM 5.2 Q6 local49Production-ready
5Qwen 3.5 BF16 / MiniMax M3 Q648.5Production-ready
7GLM-5.1 OR / Qwen 3-6 / MiniMax M2.7 8bit / Hy3-preview Q948Production-ready
10Qwen 3.5 122B Q847.5Production-ready
11Mistral Large47Production-ready
12GLM 5.2 Q8 local46.5Production-ready
13GLM 5.2 Cloud / Devstral / Mistral Medium42–43Good draft
16Qwen3-Next THNK39.5Good draft
17Mistral Small36.5Functional
18Ministral 14B28Rough

NumPy + matplotlib, vectorized, argparse with zoom, PNG export. Thresholds: 45+ Production-ready · 38+ Good draft · 30+ Functional.

#ModelScore
Claude Opus 4.6 ✦49.5
1Devstral 251250
2GLM 5.2 Q8 local49.5
3Qwen 3.5 Q9 / GLM-5.1 Q8 / Hy3-preview Q9 / MiniMax M2.7-8bit 36k49
7GLM-5.1 OR / Qwen3-Next THNK / MiniMax M3 Q648.5
10Qwen 3-6 / Mistral Large48
12Qwen 3.5 122B Q8 / BF1647.5 / 47
14GLM 5.2 Cloud / GLM 5.2 Q6 local46
16Qwen3-Next INST45.5
17MiniMax M2.7 8bit44.5
18Mistral Medium43.5
19Mistral Small33

T05 — Advanced multi-jurisdictional GDPR (/50)

Section titled “T05 — Advanced multi-jurisdictional GDPR (/50)”

PharmaBel SA/Inc + TechServe India — GDPR + HIPAA + AI Act + GxP. 5 sections: violation mapping, transfer flows, remediation plan, AI-Act classification, target architecture. Thresholds: 45+ Expert (Big4) · 38+ Senior · 30+ Junior · 20+ Intern.

#ModelScoreLevel
Claude Opus 4.6 ✦48Expert (Big4)
1Qwen 3-6 / MiniMax M3 Q649Expert (Big4)
3Qwen 3.5 Q947.5Expert (Big4)
4Qwen 3.5 BF1647Expert (Big4)
5MiniMax M2.7-8bit 36k46Expert
6GLM-5.1 OR45.5Expert (Big4)
7GLM-5.1 Q8 / Qwen3-Next INST / Qwen3-Next THNK / GLM 5.2 Q6 local45Expert
11Hy3-preview Q9 local / Qwen 3.5 122B Q844Senior
13GLM 5.2 Cloud43Senior
14MiniMax M2.7 8bit33Junior
Mistral Large / Medium / Devstral / Ministral / Small2–4Fail (truncated or off-topic)

Mistral (all variants except Small 2603) fails T05 with scores of 2–4 — responses non-compliant with the required format. The benchmark’s main discriminator.


In-house MLX port of M3 (MoE 427B), Q6 MSA-fixed (~320 GB), one node (Ultra-512) — not the 4× cluster. Lands #2 at 94.2% — 0.2 pt off the local leader (which runs on 4 nodes BF16) and 0.8 pt off Opus.

AxisScoreProfile
T01 — Creative435/500 (87%)Very good, not Exceptional — under target length, the creative ceiling shared by the local panel
T02 — GDPR46/50 (92%)Expert (only gap: Cloud Act absent)
T03 — Python48.5/50 (97%)Production-ready, executed — full SSE flow, exit 0
T04 — Mandelbrot48.5/50 (97%)Production-ready, executed — top tier
T05 — Advanced GDPR49/50 (98%)Best local (tied) — matches the cloud on hard analysis

Profile: frontier on analysis AND code, single-node local. On the T02–T05 axis it scores 96.0%, and co-dominates T05 (the hardest test). Creative T01 (435) is its only ceiling. The port glitch never bit the code (clean execution on T03/T04). T06 (logic trap, 47) + T07 (structured calc, 48.5) confirm: 5/5 outside creative at Expert/Production-ready level. On pure code (C01-C08), M3 Q6 is #1 at 97.3% — see the Coding scoreboard.

MiniMax M2.7-8bit — the token-budget effect

Section titled “MiniMax M2.7-8bit — the token-budget effect”

The only model tested in two token-budget configs, same content, different output length allowed.

Axis8bit (12k)8bit (36k)Δ
T01 — Creative332.5 (66.5%)375 (75%)+8.5%
T02 — GDPR44.5 (89%)49 (98%)+9%
T03 — Python48 (96%)50 (100%)+4%
T04 — Mandelbrot44.5 (89%)49 (98%)+9%
T05 — Advanced GDPR33 (66%)46 (92%)+26%
Average81.3%92.6%+11.3 pts

The 12k run couldn’t reach T05’s Section 5 before the limit. With 36k it’s complete. The token limit was the only real problem of the 8bit run — M2.7’s analytical/technical quality is consistently top-5 when it can finish. Residual tokenizer bleed (Chinese artifacts in French analytical text): 6 occurrences in ~7,600 words at 36k — unacceptable in production without post-processing, but no longer score-degrading. T01 stays under target length even at 36k (asks 4,500–5,500 words, produces ~2,000–2,500). The creative ceiling isn’t a token problem.

Qwen 3.5 122B Q8 — solid analyst, failing writer

Section titled “Qwen 3.5 122B Q8 — solid analyst, failing writer”
AxisScorevs 397B BF16Δ
T01 — Creative227.5 (45.5%)435 (87%)−41.5%
T02 — GDPR46 (92%)50 (100%)−8%
T03 — Python47.5 (95%)48.5 (97%)−2%
T04 — Mandelbrot47.5 (95%)47 (94%)+1%
T05 — Advanced GDPR44 (88%)47 (94%)−6%

On T02–T05 alone: 92.5% average — between Qwen 3-6 and GLM-5.1 Q8 on the analytical/code axes. Remarkable for a model 3× smaller than the 397B; T03/T04 are nearly identical to the big model. T01 reveals the limit: a good section 1 (~750 words), then sections 2–3 duplicated with cosmetic edits — 1,763 words against 4,500–5,500 expected. The 5,000-word arc needs planning the 122B lacks. Optimal routing: at 36.6 tok/s it’s 2× faster than the 397B Q9 for code, at near-identical quality — the daily technical workhorse.

Qwen3-Next INST vs THNK — thinking isn’t a universal win

Section titled “Qwen3-Next INST vs THNK — thinking isn’t a universal win”
AxisINSTTHNKΔ
T01 — Creative367.5 (73.5%)282.5 ⚠ (56.5%)−17%
T02 — GDPR43.5 (87%)35 (70%)−17%
T03 — Python49 (98%)39.5 (79%)−19%
T04 — Mandelbrot45.5 (91%)48.5 (97%)+6%
T05 — Advanced GDPR45 (90%)45 (90%)0%

T04 is the only test where THNK beats INST — thinking solved a tricky --zoom nargs=4 detail the INST variant got wrong on negative coordinates. On T01 THNK wrote entirely in English (eliminatory C5 error); on T02 INST produced the required comparison table, THNK wrote sequential prose; on T03 THNK dumped everything in main(). Conclusion: thinking improves precise algorithmic problem-solving — not creativity, regulatory precision, or code structure. The TTFT cost (111–238 s on T03/T04) is only justified on T04.


Analytical/code axis — T02–T05 isolated

Section titled “Analytical/code axis — T02–T05 isolated”

To compare on the analytical and technical axes alone (without T01’s influence):

Model T02 T03 T04 T05 Avg%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
MiniMax M2.7-8bit 36k 49 50 49 46 98.5%
Qwen 3.5 397B Q9 50 49.5 49 47.5 98.0%
Qwen 3.5 397B BF16 50 48.5 47 47 96.3%
MiniMax M3 Q6 46 48.5 48.5 49 96.0%
Qwen 3-6 47 48 48 49 96.0%
GLM-5.1 754B OR 48 48 48.5 45.5 95.0%
Hy3-preview Q9 local 44.5 48 49 44 92.8%
GLM 5.2 Q6 local 45.5 49 46 45 92.8%
Qwen 3.5 122B Q8 46 47.5 47.5 44 92.5%
GLM-5.1 754B Q8 46.5 44 49 45 92.3%
Qwen3-Next INST 43.5 49 45.5 45 91.0%
GLM 5.2 Cloud OR 43 43 46 43 87.5%
MiniMax M2.7 8bit 44.5 48 44.5 33 85.0%
Qwen3-Next THNK 35 39.5 48.5 45 84.0%

MiniMax M2.7-8bit 36k leads this axis (98.5%) — the token budget systematically closes the 12k run’s gaps.


TaskRecommended modelInfraWhy
Creative / long-form writingQwen 3.5 397B BF16Exo ×4 (1.28 TB)Best local T01; 87% White Noise
Expert GDPR / AI-Act analysisQwen 3.5 397B Q9Exo ×4T02=50, T05=47.5, top-2 T02–T05
Complex multi-regulatory (GDPR+HIPAA+GxP)Qwen 3-6OpenRouterT05 leader (49), balanced analytical
Code / Python scriptsQwen 3.5 122B Q8Inferencer ×2T03=47.5, T04=47.5, 36.6 tok/s, 2× faster
Cloud generalistGLM-5.1 754B OROpenRouterRank 2, solid across the board
Exceptional creativeHy3-preview Q9 localOdyssAI-X, 1 nodeT01=452.5 (Exceptional); T02–T05 solid

GLM 5.2 — quant’s effect on creative vs analytical

Section titled “GLM 5.2 — quant’s effect on creative vs analytical”

Three configs of the same model — the panel’s clearest result on quantization impact.

ConfigT01 /500Avg T02–T05GlobalRank
GLM 5.2 Cloud (OR)48087.5%89.2%#10
GLM 5.2 Q8 local452.5— (T01 only)T01 #3=
GLM 5.2 Q6 local402.592.8%90.3%#8

Quantization destroys creative, spares analytical. T01 loses 78 points cloud→Q6 (480→402.5): creative needs fine length control, declarative restraint, 5,000-word stylistic consistency — the first thing quant erosion damages. On T02–T05 the Q6 beats the cloud (45.5/49/46/45 vs 43/43/46/43): calculation, GDPR rigor, Python — robust to quant. Routing: Exceptional creative → GLM Cloud (480) or Q8 (452.5); analytical/code → Q6 is enough and ties the cloud; tight Q6 budget → M3 Q6 stays the best local creative (435 vs 402).

GLM 5.2 Cloud — absolute T01 record, average analyst

Section titled “GLM 5.2 Cloud — absolute T01 record, average analyst”

The benchmark’s most counter-intuitive result. GLM 5.2 Cloud produces 480/500 on T01 — the first model to beat Claude Opus 4.7 (475) on the White Noise in 14 months of TMB. But that’s where the surprise stops: on T02–T05 it scores 43–46/50 — solid Senior, no more (avg 87.5%, 11 pts under M2.7-8bit 36k).

AxisScoreProfile
T01 — Creative480/500 (96%)Absolute record — above Opus 4.7
T02 — GDPR43/50 (86%)Senior — misses distinct DPIA and AI Act
T03 — Python CLI43/50 (86%)Good draft — SSE ok, misses type hints
T04 — Mandelbrot46/50 (92%)Production-ready — NumPy vectorized
T05 — Advanced GDPR43/50 (86%)Senior — misses prominent DPIA and CE marking

T01–T07: T06 (logic trap) 46/50 (Expert), T07 (structured calc) 49/50. Coding (C01-C09): 410/460 = 89.1% — records on C02 Debug (49/50) and C06 Refactoring (49/50); weak on C07 Swift CLI (38/50). Diagnosis: an exceptional literary model, an average analyst — the widest T01-vs-T02-T05 gap in the benchmark. Recommended for long creative and code (C02/C04/C06), not for analytical routing.


TMB Benchmark v4 — The Monocle Bear — April 2026, updated 20 June 2026. 5 tests · 19 main configs · Evaluators: Claude Opus 4.6 / Sonnet 4.6 / TMB Benchmark Evaluator (MiniMax M3 / Sonnet 4.6).