TMB Scoreboard — Planning

TMB Planner Bench V1 · The Monocle Bear · 20 June 2026

Benchmark: 4 tests /50, total /200
Goal: measure the ability to plan before coding — not to code
Evaluator: Claude Sonnet 4.6 · Temperature 0 · one pass per test

Main scoreboard

Rank	Model	Quant	Infra	P01	P02	P03	P04	Total	%	Level
—	GLM 5.2 Q6	Q6	Local Ultra-512	41	42	47	48	178	89.0%	Architect ⭐ ¹
1	Kimi K2.7 Code	4-bit	Local Ultra-512	44	38	47	48	177	88.5%	Senior (top)
2	MiniMax M3 local	Q6H16	Local Ultra-512	40	38	48	50	176	88.0%	Senior → Architect
3	MiniMax M3 cloud	—	Cloud API	39	37	47	47	170	85.0%	Senior
—	Hy3-preview Q9	Q9	Local Ultra-512	0	46	45	36	127	63.5%	Junior ²

¹ GLM 5.2 Q6 ranked out of competition for the production planner seat — too slow (TTFT + decode). Best reasoner of the bench on the merits. ² Hy3 P01 = 0: delivered SQL code instead of a plan. The only model to ask clarifying questions (P02 = 46/50, best of the whole panel).

Thresholds: 180–200 Architect · 150–179 Senior · 120–149 Junior · 80–119 Intern · <80 Unusable.

Per-test scores

P01 — Decomposition & sequencing (/50)

Rank	Model	Score	Key notes
1	Kimi K2.7 Code	44	Only one with 15/15 decomposition + 10/10 critical steps. 401 vs 403 explicitly distinguished. Tests included.
2	GLM 5.2 Q6	41	7 files named, correct ownership. No tests (−3), 404 instead of 403 (−2).
3	MiniMax M3 local	40	Exemplary mapping (strict MVC). Forgot tests. 404 instead of 403.
4	MiniMax M3 cloud	39	Richer MVC architecture (13 files). Same omissions as local.
—	Hy3-preview Q9	0	Delivered SQL migrations instead of the plan. Plan mode ignored.

P02 — Understanding the need / the trap (/50)

Discriminating test: identify the real bottlenecks (taxes 2.8 s + email 800 ms), avoid the DB trap, ask the business question before diving in.

Rank	Model	Score	Key notes
1	Hy3-preview Q9	46	Only one to ask 3 discriminating questions before planning. Perfect diagnosis. Think-chain exposed.
2	GLM 5.2 Q6	42	Names the “classic trap” on line 1. Even offers to contact the provider. Questions come after the plan (−5).
3	Kimi K2.7 Code / MiniMax M3 local	38	”Trap to avoid” in a dedicated block. Diagnosis perfect — but no clarification reflex.
5	MiniMax M3 cloud	37	Same as local. Clarification is an M3-architecture question, not hardware.

Shared blind spot (except Hy3): none asks “does the business accept an estimated tax?” before planning.

P03 — Edge cases & risks (/50)

Rank	Model	Score	Key notes
1	MiniMax M3 local	48	11/11 risks + best prioritization (executive summary).
2	GLM 5.2 Q6 / Kimi K2.7 / MiniMax M3 cloud	47	11/11. GLM alone sees “accidental overwrite by partial CSV”. Kimi adds 7 blocking questions. M3 cloud has the most sophisticated concurrency.
5	Hy3-preview Q9	45	11/11. Clean rollback. Flat prioritization.

P04 — Executable handoff (/50)

A spec precise enough that a coder implements without re-asking the intent.

Rank	Model	Score	Key notes
1	MiniMax M3 local	50	Textbook spec. Global timeout 2,500 ms. 8-entry behavior table. Clear out-of-scope section.
2	GLM 5.2 Q6 / Kimi K2.7	48	GLM: exact global-timeout behavior with literal error message. Kimi: 5 env-var constants, explicit parallel execution.
4	MiniMax M3 cloud	47	More production-ready (env vars, logging) — but adds unrequested out-of-scope. No explicit global timeout.
5	Hy3-preview Q9	36	Sequential/parallel decision left to the coder. `timestamp` and `latency_ms` absent.

Judgment vs mechanics split

Model	Judgment (P02+P03)	Mechanics (P01+P04)
Hy3	91/100 ⭐	36/100
GLM 5.2 Q6	89/100	89/100
Kimi K2.7 Code	85/100	92/100 ⭐
M3 local	86/100	90/100
M3 cloud	84/100	86/100

The Hy3 paradox in one line: judgment #1, mechanics last — the exact inverse of the initial prediction.

Profiles

Kimi K2.7 Code — the operational planner. 177/200 (88.5%), #1 production. Balanced: best P01 of the panel (44/50), near-perfect P04, exhaustive P03. Thinking chain visible in the answers (prefixed “We need…”) — strip it in production. Weakness: no clarification reflex (P02 = 38/50). For Aider: ensure thinking blocks are stripped before diff parsing.

MiniMax M3 local (Q6H16) — the reference planner. 176/200 (88.0%), #2 overall but #1 on P04 (50/50, reference-grade — 8-case error table, 2,500 ms global timeout, clear out-of-scope; a coder implements without one intent question). Best edge-case coverage (P03 48/50). Weakness: P02 clarification (−12), P01 slightly sub-optimal (forgot tests). For the Cline pipeline: recommended planner seat — local beats cloud by 6 pts on P04.

GLM 5.2 Q6 — best reasoner, out of production. 178/200 (89.0%) — highest score, out of competition (TTFT + decode too slow for an interactive pipeline). Qualitatively: finest P02 (names the trap first), best-specified global timeout in P04, the only “partial-CSV overwrite” risk in P03. Use: offline / batch planning, spec review — not pipeline execution.

MiniMax M3 cloud — the conservative senior. 170/200 (85.0%). Local −6 pts, of which −3 on P04 (unrequested features: env vars, advisory lock). Shares the local’s blind spots. Prefer it when the context forces the API.

Hy3-preview Q9 — the paradox. 127/200 (63.5%), Junior. The 0/50 on P01 (code instead of a plan) is eliminatory. But: best P02 of the panel (46/50), the only model to ask the architecture-changing clarifying questions. Diagnosis: Hy3 thinks well (visible reasoning, 91/100 on judgment) but doesn’t stay in its lane — “Plan mode, don’t code” is treated as optional. A signal of architecture, not intelligence. Its natural role is execution, not planning.

Recommended routing

Context	Recommended model
Interactive Cline pipeline (planner)	Kimi K2.7 Code or MiniMax M3 local
Very precise spec required (P04)	MiniMax M3 local (50/50)
Offline / batch planning	GLM 5.2 Q6 (best reasoner)
Cloud-only context	MiniMax M3 cloud
Coder in the pipeline	Kimi K2.7 Code or Nex N2 Pro Q9
Avoid as planner	Hy3 (P01 mode-fail)

Method notes

4 tests × /50 = /200. All models tested in Plan mode (no code execution).
P02 is the discriminator — it separates planners that ask the right questions from those that rush to a good plan.
P01 is eliminatory — a mode-fail (code delivered) scores 0/50 regardless of content.
Shared blind spot (except Hy3): the business-clarification reflex. Systemic fix: add to the system prompt — “Before planning a task with an implicit business constraint, state an explicit assumption and ask for validation.”

TMB Planner Bench V1 — The Monocle Bear — June 2026. 5 models: GLM 5.2 Q6, Kimi K2.7 Code, MiniMax M3 local Q6, MiniMax M3 cloud, Hy3-preview Q9.