TMB Scoreboard — General

The Monocle Bear · updated 20 June 2026

Tests: 5 (T01 creative + T02/T05 analytical-legal + T03/T04 code)
Scope: 19 main scoreboard configs + 5 T01-only evaluations
Infrastructure: Odysseus cluster 4× M3 Ultra + Inferencer 2× M3 Ultra 96 GB + OdyssAI-X (1 node) + OpenRouter + Anthropic API
Evaluators: Claude Opus 4.6 / Sonnet 4.6 / TMB Benchmark Evaluator (MiniMax M3) — TMB v2 rubrics

Main scoreboard

Overall score = each test normalized /100, arithmetic mean of the 5. Bold = best score on that test.

#	Model	Params	Arch	Infra	T01 /500	T02 /50	T03 /50	T04 /50	T05 /50	Avg %
1	Qwen 3.5 397B BF16	397B	MoE	Exo ×4 BF16	435	50	48.5	47	47	94.4%
2	MiniMax M3 Q6	427B	MoE	OdyssAI-X, 1 node	435	46	48.5	48.5	49	94.2%
3	GLM-5.1 754B	754B	MoE	OpenRouter	447.5	48	48	48.5	45.5	93.9%
4	Qwen 3.5 397B Q9	397B	MoE	Exo ×4 Q9	385	50	49.5	49	47.5	93.8%
5	MiniMax M2.7-8bit 36k	239B	MoE	OdyssAI-X, 1 node	375	49	50	49	46	92.6%
6	Hy3-preview Q9 local	295B	MoE	OdyssAI-X, 1 node	452.5	44.5	48	49	44	92.3%
7	GLM-5.1 754B Q8	754B	MoE	Exo ×4 Q8	442.5	46.5	44	49	45	91.5%
8	GLM 5.2 Q6 local	—	MoE	OdyssAI-X, 4 nodes	402.5	45.5	49	46	45	90.3%
9	Qwen 3-6	—	Dense	OpenRouter	315	47	48	48	49	89.4%
10	GLM 5.2 Cloud	—	MoE	OpenRouter	480 ⭐	43	43	46	43	89.2%
11	Qwen3-Next INST	80B-A3B	MoE	Omlx-Next 8bit	367.5	43.5	49	45.5	45	87.9%
12	Qwen 3.5 122B Q8	122B	MoE	Inferencer ×2	227.5	46	47.5	47.5	44	83.1%
13	MiniMax M2.7 8bit	239B	MoE	OdyssAI-X, 1 node	332.5	44.5	48	44.5	33	81.3%
14	Qwen3-Next THNK	80B-A3B	MoE	Omlx-Next 8bit	282.5 ⚠	35	39.5	48.5	45	78.5%
15	Mistral Large 2512	675B	MoE	OpenRouter	420	47.5	47	48	2	74.6%
16	Mistral Medium 3.1	—	MoE	Cloud	420	43	42	43.5	3	69.4%
17	Devstral 2512	—	MoE	Cloud	262.5	45.5	42	50	2	66.3%
18	Ministral 14B 2512	14B	Dense	Exo	350	44	28	48.5	4	63.8%
19	Mistral Small 2603	—	MoE	Cloud	320	48	36.5	33	3	61.0%

⚠ Qwen3-Next THNK T01: text entirely in English — C5 violation, score penalized. ⭐ GLM 5.2 T01 = 480/500 — new absolute benchmark record, above Claude Opus 4.7 (475).

MiniMax M3 Q6 (#2, 94.2%) — 0.2 pt off the local leader Qwen 3.5 BF16 (94.4%) and 0.8 pt off the frontier reference Opus (95.0%). Frontier profile on analytical/code (T03 48.5, T04 48.5, T05 49 = best local, tied), held back by T02 (46) and creative T01 (435, Very good — not Exceptional). The first single-node local model to reach the top of the scoreboard. (Pure code C01-C08: M3 #1 at 97.3% — see the Coding scoreboard.)

Frontier reference — Claude Opus 4.6/4.7

Out of the local ranking. Official scores evaluated May 2026.

Model	T01 /500	T02 /50	T03 /50	T04 /50	T05 /50	Avg %
Claude Opus 4.7/4.6 ✦	475	46	46.5	49.5	48	95.0%

✦ T01 = Opus 4.7. T02-T05 = Opus 4.6 (evaluated May 2026).

If ranked: #1 (95.0% vs Qwen 3.5 397B BF16 94.4% and MiniMax M3 Q6 94.2%), though a frontier/local comparison isn’t apples-to-apples — Opus is the reference ceiling. Notably, local M3 Q6 is only 0.8 pt under it, and matches/beats it on T05 (49 vs 48) and T03/T04.

T01 note: GLM 5.2 Cloud (480/500, 19 June 2026) beats Opus 4.7 on T01 — the first model in the benchmark to do so. Cloud-only (OpenRouter), on the creative criterion alone.

T01-only — partial evaluations

Model	Params	Infra	T01 /500	Level
GLM 5.2 Q8 local	—	Local Q8	452.5	Exceptional
Hy3-preview Q9 local	295B	OdyssAI-X, 1 node	452.5	Exceptional
Hy3-preview Q9 OR	295B	OpenRouter	445	Very good
Hy3-preview Q9 local (no-think)	295B	OdyssAI-X, 1 node	412.5	Very good
MiniMax 2.7 FP16	—	Cloud FP16	330	Fair

Per-test ranking

T01 — The White Noise (/500)

A short story (~5,000 words), the character Pip observing a drone. 6 criteria: stylistic progression, anti-slop, psychological consistency, Pip as mirror, form constraints, final image.

Thresholds: 450+ Exceptional · 400+ Very good · 350+ Good · 300+ Fair · 200+ Poor

#	Model	Score	Level
1	GLM 5.2 Cloud ⭐	480	Exceptional
2	Claude Opus 4.7 ✦	475	Exceptional
3	GLM 5.2 Q8 local / Hy3-preview Q9 local (thinking HC)	452.5	Exceptional
5	GLM-5.1 OpenRouter	447.5	Very good
6	Hy3-preview Q9 OR	445	Very good
7	GLM-5.1 Q8 local	442.5	Very good
8	Qwen 3.5 BF16 / MiniMax M3 Q6	435	Very good
10	Mistral Large 2512 / Mistral Medium 3.1	420	Very good
12	Hy3-preview Q9 local (no-think)	412.5	Very good
13	GLM 5.2 Q6 local	402.5	Very good
14	Qwen 3.5 Q9	385	Good
—	Gemini 3.1 Pro ✦	380	Good
—	MiniMax M2.7-8bit 36k	375	Good
—	Qwen3-Next INST	367.5	Good
—	Gemini 3.5 Flash ✦	357.5	Good
—	Ministral 14B	350	Good
—	MiniMax M2.7 8bit	332.5	Fair
—	Mistral Small	320	Fair
—	Qwen 3-6	315	Good
—	Qwen3-Next THNK ⚠	282.5	Poor (English text)
—	Devstral 2512	262.5	Poor
—	Qwen 3.5 122B Q8	227.5	Poor

⭐ GLM 5.2 Cloud (480) = new absolute T01 record, first model to beat Claude Opus 4.7 (475) on the White Noise. ✦ Frontier cloud, out of the local ranking, T01 only.

Compliance analysis for a Franco-American pharma SaaS on Azure OpenAI. Strict format: risks, scenarios, recommendation. Thresholds: 45+ Expert · 38+ Senior · 30+ Junior · 20+ Intern.

#	Model	Score	Level
1	Qwen 3.5 Q9 / BF16	50	Expert
2	MiniMax M2.7-8bit 36k	49	Expert
3	GLM-5.1 OpenRouter / Mistral Small	48	Expert
5	Mistral Large	47.5	Expert
6	Qwen 3-6	47	Expert
7	GLM-5.1 Q8	46.5	Expert
—	Claude Opus 4.6 ✦	46	Expert
8	Qwen 3.5 122B Q8 / MiniMax M3 Q6	46	Expert
9	Devstral / GLM 5.2 Q6 local	45.5	Expert
11	Ministral 14B	44	Senior
12	MiniMax M2.7 8bit / Hy3-preview Q9 local	44.5	Senior
13	Qwen3-Next INST	43.5	Senior
14	GLM 5.2 Cloud / Mistral Medium	43	Senior
16	Qwen3-Next THNK	35	Junior

T03 — Python CLI (/50)

A Python CLI that queries an OpenAI-compatible API: lists models, measures TTFT, tokens/s, prints metrics. SSE streaming required. Thresholds: 45+ Production-ready · 38+ Good draft · 30+ Functional · 20+ Rough.

#	Model	Score	Level
—	Claude Opus 4.6 ✦	46.5	Production-ready
1	MiniMax M2.7-8bit 36k	50	Production-ready
2	Qwen 3.5 Q9	49.5	Production-ready
3	Qwen3-Next INST / GLM 5.2 Q6 local	49	Production-ready
5	Qwen 3.5 BF16 / MiniMax M3 Q6	48.5	Production-ready
7	GLM-5.1 OR / Qwen 3-6 / MiniMax M2.7 8bit / Hy3-preview Q9	48	Production-ready
10	Qwen 3.5 122B Q8	47.5	Production-ready
11	Mistral Large	47	Production-ready
12	GLM 5.2 Q8 local	46.5	Production-ready
13	GLM 5.2 Cloud / Devstral / Mistral Medium	42–43	Good draft
16	Qwen3-Next THNK	39.5	Good draft
17	Mistral Small	36.5	Functional
18	Ministral 14B	28	Rough

T04 — Mandelbrot Python (/50)

NumPy + matplotlib, vectorized, argparse with zoom, PNG export. Thresholds: 45+ Production-ready · 38+ Good draft · 30+ Functional.

#	Model	Score
—	Claude Opus 4.6 ✦	49.5
1	Devstral 2512	50
2	GLM 5.2 Q8 local	49.5
3	Qwen 3.5 Q9 / GLM-5.1 Q8 / Hy3-preview Q9 / MiniMax M2.7-8bit 36k	49
7	GLM-5.1 OR / Qwen3-Next THNK / MiniMax M3 Q6	48.5
10	Qwen 3-6 / Mistral Large	48
12	Qwen 3.5 122B Q8 / BF16	47.5 / 47
14	GLM 5.2 Cloud / GLM 5.2 Q6 local	46
16	Qwen3-Next INST	45.5
17	MiniMax M2.7 8bit	44.5
18	Mistral Medium	43.5
19	Mistral Small	33

PharmaBel SA/Inc + TechServe India — GDPR + HIPAA + AI Act + GxP. 5 sections: violation mapping, transfer flows, remediation plan, AI-Act classification, target architecture. Thresholds: 45+ Expert (Big4) · 38+ Senior · 30+ Junior · 20+ Intern.

#	Model	Score	Level
—	Claude Opus 4.6 ✦	48	Expert (Big4)
1	Qwen 3-6 / MiniMax M3 Q6	49	Expert (Big4)
3	Qwen 3.5 Q9	47.5	Expert (Big4)
4	Qwen 3.5 BF16	47	Expert (Big4)
5	MiniMax M2.7-8bit 36k	46	Expert
6	GLM-5.1 OR	45.5	Expert (Big4)
7	GLM-5.1 Q8 / Qwen3-Next INST / Qwen3-Next THNK / GLM 5.2 Q6 local	45	Expert
11	Hy3-preview Q9 local / Qwen 3.5 122B Q8	44	Senior
13	GLM 5.2 Cloud	43	Senior
14	MiniMax M2.7 8bit	33	Junior
—	Mistral Large / Medium / Devstral / Ministral / Small	2–4	Fail (truncated or off-topic)

Mistral (all variants except Small 2603) fails T05 with scores of 2–4 — responses non-compliant with the required format. The benchmark’s main discriminator.

Model profiles

MiniMax M3 Q6 — the local frontier (#2)

In-house MLX port of M3 (MoE 427B), Q6 MSA-fixed (~320 GB), one node (Ultra-512) — not the 4× cluster. Lands #2 at 94.2% — 0.2 pt off the local leader (which runs on 4 nodes BF16) and 0.8 pt off Opus.

Axis	Score	Profile
T01 — Creative	435/500 (87%)	Very good, not Exceptional — under target length, the creative ceiling shared by the local panel
T02 — GDPR	46/50 (92%)	Expert (only gap: Cloud Act absent)
T03 — Python	48.5/50 (97%)	Production-ready, executed — full SSE flow, exit 0
T04 — Mandelbrot	48.5/50 (97%)	Production-ready, executed — top tier
T05 — Advanced GDPR	49/50 (98%)	Best local (tied) — matches the cloud on hard analysis

Profile: frontier on analysis AND code, single-node local. On the T02–T05 axis it scores 96.0%, and co-dominates T05 (the hardest test). Creative T01 (435) is its only ceiling. The port glitch never bit the code (clean execution on T03/T04). T06 (logic trap, 47) + T07 (structured calc, 48.5) confirm: 5/5 outside creative at Expert/Production-ready level. On pure code (C01-C08), M3 Q6 is #1 at 97.3% — see the Coding scoreboard.

MiniMax M2.7-8bit — the token-budget effect

The only model tested in two token-budget configs, same content, different output length allowed.

Axis	8bit (12k)	8bit (36k)	Δ
T01 — Creative	332.5 (66.5%)	375 (75%)	+8.5%
T02 — GDPR	44.5 (89%)	49 (98%)	+9%
T03 — Python	48 (96%)	50 (100%)	+4%
T04 — Mandelbrot	44.5 (89%)	49 (98%)	+9%
T05 — Advanced GDPR	33 (66%)	46 (92%)	+26%
Average	81.3%	92.6%	+11.3 pts

The 12k run couldn’t reach T05’s Section 5 before the limit. With 36k it’s complete. The token limit was the only real problem of the 8bit run — M2.7’s analytical/technical quality is consistently top-5 when it can finish. Residual tokenizer bleed (Chinese artifacts in French analytical text): 6 occurrences in ~7,600 words at 36k — unacceptable in production without post-processing, but no longer score-degrading. T01 stays under target length even at 36k (asks 4,500–5,500 words, produces ~2,000–2,500). The creative ceiling isn’t a token problem.

Qwen 3.5 122B Q8 — solid analyst, failing writer

Axis	Score	vs 397B BF16	Δ
T01 — Creative	227.5 (45.5%)	435 (87%)	−41.5%
T02 — GDPR	46 (92%)	50 (100%)	−8%
T03 — Python	47.5 (95%)	48.5 (97%)	−2%
T04 — Mandelbrot	47.5 (95%)	47 (94%)	+1%
T05 — Advanced GDPR	44 (88%)	47 (94%)	−6%

On T02–T05 alone: 92.5% average — between Qwen 3-6 and GLM-5.1 Q8 on the analytical/code axes. Remarkable for a model 3× smaller than the 397B; T03/T04 are nearly identical to the big model. T01 reveals the limit: a good section 1 (~750 words), then sections 2–3 duplicated with cosmetic edits — 1,763 words against 4,500–5,500 expected. The 5,000-word arc needs planning the 122B lacks. Optimal routing: at 36.6 tok/s it’s 2× faster than the 397B Q9 for code, at near-identical quality — the daily technical workhorse.

Qwen3-Next INST vs THNK — thinking isn’t a universal win

Axis	INST	THNK	Δ
T01 — Creative	367.5 (73.5%)	282.5 ⚠ (56.5%)	−17%
T02 — GDPR	43.5 (87%)	35 (70%)	−17%
T03 — Python	49 (98%)	39.5 (79%)	−19%
T04 — Mandelbrot	45.5 (91%)	48.5 (97%)	+6%
T05 — Advanced GDPR	45 (90%)	45 (90%)	0%

T04 is the only test where THNK beats INST — thinking solved a tricky --zoom nargs=4 detail the INST variant got wrong on negative coordinates. On T01 THNK wrote entirely in English (eliminatory C5 error); on T02 INST produced the required comparison table, THNK wrote sequential prose; on T03 THNK dumped everything in main(). Conclusion: thinking improves precise algorithmic problem-solving — not creativity, regulatory precision, or code structure. The TTFT cost (111–238 s on T03/T04) is only justified on T04.

Analytical/code axis — T02–T05 isolated

To compare on the analytical and technical axes alone (without T01’s influence):

Model                  T02   T03   T04   T05   Avg%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
MiniMax M2.7-8bit 36k  49    50    49    46    98.5%
Qwen 3.5 397B Q9       50    49.5  49    47.5  98.0%
Qwen 3.5 397B BF16     50    48.5  47    47    96.3%
MiniMax M3 Q6          46    48.5  48.5  49    96.0%
Qwen 3-6               47    48    48    49    96.0%
GLM-5.1 754B OR        48    48    48.5  45.5  95.0%
Hy3-preview Q9 local   44.5  48    49    44    92.8%
GLM 5.2 Q6 local       45.5  49    46    45    92.8%
Qwen 3.5 122B Q8       46    47.5  47.5  44    92.5%
GLM-5.1 754B Q8        46.5  44    49    45    92.3%
Qwen3-Next INST        43.5  49    45.5  45    91.0%
GLM 5.2 Cloud OR       43    43    46    43    87.5%
MiniMax M2.7 8bit      44.5  48    44.5  33    85.0%
Qwen3-Next THNK        35    39.5  48.5  45    84.0%

MiniMax M2.7-8bit 36k leads this axis (98.5%) — the token budget systematically closes the 12k run’s gaps.

Usage matrix

Task	Recommended model	Infra	Why
Creative / long-form writing	Qwen 3.5 397B BF16	Exo ×4 (1.28 TB)	Best local T01; 87% White Noise
Expert GDPR / AI-Act analysis	Qwen 3.5 397B Q9	Exo ×4	T02=50, T05=47.5, top-2 T02–T05
Complex multi-regulatory (GDPR+HIPAA+GxP)	Qwen 3-6	OpenRouter	T05 leader (49), balanced analytical
Code / Python scripts	Qwen 3.5 122B Q8	Inferencer ×2	T03=47.5, T04=47.5, 36.6 tok/s, 2× faster
Cloud generalist	GLM-5.1 754B OR	OpenRouter	Rank 2, solid across the board
Exceptional creative	Hy3-preview Q9 local	OdyssAI-X, 1 node	T01=452.5 (Exceptional); T02–T05 solid

GLM 5.2 — quant’s effect on creative vs analytical

Three configs of the same model — the panel’s clearest result on quantization impact.

Config	T01 /500	Avg T02–T05	Global	Rank
GLM 5.2 Cloud (OR)	480	87.5%	89.2%	#10
GLM 5.2 Q8 local	452.5	— (T01 only)	—	T01 #3=
GLM 5.2 Q6 local	402.5	92.8%	90.3%	#8

Quantization destroys creative, spares analytical. T01 loses 78 points cloud→Q6 (480→402.5): creative needs fine length control, declarative restraint, 5,000-word stylistic consistency — the first thing quant erosion damages. On T02–T05 the Q6 beats the cloud (45.5/49/46/45 vs 43/43/46/43): calculation, GDPR rigor, Python — robust to quant. Routing: Exceptional creative → GLM Cloud (480) or Q8 (452.5); analytical/code → Q6 is enough and ties the cloud; tight Q6 budget → M3 Q6 stays the best local creative (435 vs 402).

GLM 5.2 Cloud — absolute T01 record, average analyst

The benchmark’s most counter-intuitive result. GLM 5.2 Cloud produces 480/500 on T01 — the first model to beat Claude Opus 4.7 (475) on the White Noise in 14 months of TMB. But that’s where the surprise stops: on T02–T05 it scores 43–46/50 — solid Senior, no more (avg 87.5%, 11 pts under M2.7-8bit 36k).

Axis	Score	Profile
T01 — Creative	480/500 (96%)	Absolute record — above Opus 4.7
T02 — GDPR	43/50 (86%)	Senior — misses distinct DPIA and AI Act
T03 — Python CLI	43/50 (86%)	Good draft — SSE ok, misses type hints
T04 — Mandelbrot	46/50 (92%)	Production-ready — NumPy vectorized
T05 — Advanced GDPR	43/50 (86%)	Senior — misses prominent DPIA and CE marking

T01–T07: T06 (logic trap) 46/50 (Expert), T07 (structured calc) 49/50. Coding (C01-C09): 410/460 = 89.1% — records on C02 Debug (49/50) and C06 Refactoring (49/50); weak on C07 Swift CLI (38/50). Diagnosis: an exceptional literary model, an average analyst — the widest T01-vs-T02-T05 gap in the benchmark. Recommended for long creative and code (C02/C04/C06), not for analytical routing.

TMB Benchmark v4 — The Monocle Bear — April 2026, updated 20 June 2026. 5 tests · 19 main configs · Evaluators: Claude Opus 4.6 / Sonnet 4.6 / TMB Benchmark Evaluator (MiniMax M3 / Sonnet 4.6).