Skip to content

TMB Scoreboard — Planning

TMB Planner Bench V1 · The Monocle Bear · 20 June 2026

  • Benchmark: 4 tests /50, total /200
  • Goal: measure the ability to plan before coding — not to code
  • Evaluator: Claude Sonnet 4.6 · Temperature 0 · one pass per test

RankModelQuantInfraP01P02P03P04Total%Level
GLM 5.2 Q6Q6Local Ultra-5124142474817889.0%Architect ⭐ ¹
1Kimi K2.7 Code4-bitLocal Ultra-5124438474817788.5%Senior (top)
2MiniMax M3 localQ6H16Local Ultra-5124038485017688.0%Senior → Architect
3MiniMax M3 cloudCloud API3937474717085.0%Senior
Hy3-preview Q9Q9Local Ultra-512046453612763.5%Junior ²

¹ GLM 5.2 Q6 ranked out of competition for the production planner seat — too slow (TTFT + decode). Best reasoner of the bench on the merits. ² Hy3 P01 = 0: delivered SQL code instead of a plan. The only model to ask clarifying questions (P02 = 46/50, best of the whole panel).

Thresholds: 180–200 Architect · 150–179 Senior · 120–149 Junior · 80–119 Intern · <80 Unusable.


RankModelScoreKey notes
1Kimi K2.7 Code44Only one with 15/15 decomposition + 10/10 critical steps. 401 vs 403 explicitly distinguished. Tests included.
2GLM 5.2 Q6417 files named, correct ownership. No tests (−3), 404 instead of 403 (−2).
3MiniMax M3 local40Exemplary mapping (strict MVC). Forgot tests. 404 instead of 403.
4MiniMax M3 cloud39Richer MVC architecture (13 files). Same omissions as local.
Hy3-preview Q90Delivered SQL migrations instead of the plan. Plan mode ignored.

P02 — Understanding the need / the trap (/50)

Section titled “P02 — Understanding the need / the trap (/50)”

Discriminating test: identify the real bottlenecks (taxes 2.8 s + email 800 ms), avoid the DB trap, ask the business question before diving in.

RankModelScoreKey notes
1Hy3-preview Q946Only one to ask 3 discriminating questions before planning. Perfect diagnosis. Think-chain exposed.
2GLM 5.2 Q642Names the “classic trap” on line 1. Even offers to contact the provider. Questions come after the plan (−5).
3Kimi K2.7 Code / MiniMax M3 local38”Trap to avoid” in a dedicated block. Diagnosis perfect — but no clarification reflex.
5MiniMax M3 cloud37Same as local. Clarification is an M3-architecture question, not hardware.

Shared blind spot (except Hy3): none asks “does the business accept an estimated tax?” before planning.

RankModelScoreKey notes
1MiniMax M3 local4811/11 risks + best prioritization (executive summary).
2GLM 5.2 Q6 / Kimi K2.7 / MiniMax M3 cloud4711/11. GLM alone sees “accidental overwrite by partial CSV”. Kimi adds 7 blocking questions. M3 cloud has the most sophisticated concurrency.
5Hy3-preview Q94511/11. Clean rollback. Flat prioritization.

A spec precise enough that a coder implements without re-asking the intent.

RankModelScoreKey notes
1MiniMax M3 local50Textbook spec. Global timeout 2,500 ms. 8-entry behavior table. Clear out-of-scope section.
2GLM 5.2 Q6 / Kimi K2.748GLM: exact global-timeout behavior with literal error message. Kimi: 5 env-var constants, explicit parallel execution.
4MiniMax M3 cloud47More production-ready (env vars, logging) — but adds unrequested out-of-scope. No explicit global timeout.
5Hy3-preview Q936Sequential/parallel decision left to the coder. timestamp and latency_ms absent.

ModelJudgment (P02+P03)Mechanics (P01+P04)
Hy391/10036/100
GLM 5.2 Q689/10089/100
Kimi K2.7 Code85/10092/100
M3 local86/10090/100
M3 cloud84/10086/100

The Hy3 paradox in one line: judgment #1, mechanics last — the exact inverse of the initial prediction.


Kimi K2.7 Code — the operational planner. 177/200 (88.5%), #1 production. Balanced: best P01 of the panel (44/50), near-perfect P04, exhaustive P03. Thinking chain visible in the answers (prefixed “We need…”) — strip it in production. Weakness: no clarification reflex (P02 = 38/50). For Aider: ensure thinking blocks are stripped before diff parsing.

MiniMax M3 local (Q6H16) — the reference planner. 176/200 (88.0%), #2 overall but #1 on P04 (50/50, reference-grade — 8-case error table, 2,500 ms global timeout, clear out-of-scope; a coder implements without one intent question). Best edge-case coverage (P03 48/50). Weakness: P02 clarification (−12), P01 slightly sub-optimal (forgot tests). For the Cline pipeline: recommended planner seat — local beats cloud by 6 pts on P04.

GLM 5.2 Q6 — best reasoner, out of production. 178/200 (89.0%) — highest score, out of competition (TTFT + decode too slow for an interactive pipeline). Qualitatively: finest P02 (names the trap first), best-specified global timeout in P04, the only “partial-CSV overwrite” risk in P03. Use: offline / batch planning, spec review — not pipeline execution.

MiniMax M3 cloud — the conservative senior. 170/200 (85.0%). Local −6 pts, of which −3 on P04 (unrequested features: env vars, advisory lock). Shares the local’s blind spots. Prefer it when the context forces the API.

Hy3-preview Q9 — the paradox. 127/200 (63.5%), Junior. The 0/50 on P01 (code instead of a plan) is eliminatory. But: best P02 of the panel (46/50), the only model to ask the architecture-changing clarifying questions. Diagnosis: Hy3 thinks well (visible reasoning, 91/100 on judgment) but doesn’t stay in its lane — “Plan mode, don’t code” is treated as optional. A signal of architecture, not intelligence. Its natural role is execution, not planning.


ContextRecommended model
Interactive Cline pipeline (planner)Kimi K2.7 Code or MiniMax M3 local
Very precise spec required (P04)MiniMax M3 local (50/50)
Offline / batch planningGLM 5.2 Q6 (best reasoner)
Cloud-only contextMiniMax M3 cloud
Coder in the pipelineKimi K2.7 Code or Nex N2 Pro Q9
Avoid as plannerHy3 (P01 mode-fail)

  • 4 tests × /50 = /200. All models tested in Plan mode (no code execution).
  • P02 is the discriminator — it separates planners that ask the right questions from those that rush to a good plan.
  • P01 is eliminatory — a mode-fail (code delivered) scores 0/50 regardless of content.
  • Shared blind spot (except Hy3): the business-clarification reflex. Systemic fix: add to the system prompt — “Before planning a task with an implicit business constraint, state an explicit assumption and ask for validation.”

TMB Planner Bench V1 — The Monocle Bear — June 2026. 5 models: GLM 5.2 Q6, Kimi K2.7 Code, MiniMax M3 local Q6, MiniMax M3 cloud, Hy3-preview Q9.