TMB Scoreboard — Planning
TMB Planner Bench V1 · The Monocle Bear · 20 June 2026
- Benchmark: 4 tests /50, total /200
- Goal: measure the ability to plan before coding — not to code
- Evaluator: Claude Sonnet 4.6 · Temperature 0 · one pass per test
Main scoreboard
Section titled “Main scoreboard”| Rank | Model | Quant | Infra | P01 | P02 | P03 | P04 | Total | % | Level |
|---|---|---|---|---|---|---|---|---|---|---|
| — | GLM 5.2 Q6 | Q6 | Local Ultra-512 | 41 | 42 | 47 | 48 | 178 | 89.0% | Architect ⭐ ¹ |
| 1 | Kimi K2.7 Code | 4-bit | Local Ultra-512 | 44 | 38 | 47 | 48 | 177 | 88.5% | Senior (top) |
| 2 | MiniMax M3 local | Q6H16 | Local Ultra-512 | 40 | 38 | 48 | 50 | 176 | 88.0% | Senior → Architect |
| 3 | MiniMax M3 cloud | — | Cloud API | 39 | 37 | 47 | 47 | 170 | 85.0% | Senior |
| — | Hy3-preview Q9 | Q9 | Local Ultra-512 | 0 | 46 | 45 | 36 | 127 | 63.5% | Junior ² |
¹ GLM 5.2 Q6 ranked out of competition for the production planner seat — too slow (TTFT + decode). Best reasoner of the bench on the merits. ² Hy3 P01 = 0: delivered SQL code instead of a plan. The only model to ask clarifying questions (P02 = 46/50, best of the whole panel).
Thresholds: 180–200 Architect · 150–179 Senior · 120–149 Junior · 80–119 Intern · <80 Unusable.
Per-test scores
Section titled “Per-test scores”P01 — Decomposition & sequencing (/50)
Section titled “P01 — Decomposition & sequencing (/50)”| Rank | Model | Score | Key notes |
|---|---|---|---|
| 1 | Kimi K2.7 Code | 44 | Only one with 15/15 decomposition + 10/10 critical steps. 401 vs 403 explicitly distinguished. Tests included. |
| 2 | GLM 5.2 Q6 | 41 | 7 files named, correct ownership. No tests (−3), 404 instead of 403 (−2). |
| 3 | MiniMax M3 local | 40 | Exemplary mapping (strict MVC). Forgot tests. 404 instead of 403. |
| 4 | MiniMax M3 cloud | 39 | Richer MVC architecture (13 files). Same omissions as local. |
| — | Hy3-preview Q9 | 0 | Delivered SQL migrations instead of the plan. Plan mode ignored. |
P02 — Understanding the need / the trap (/50)
Section titled “P02 — Understanding the need / the trap (/50)”Discriminating test: identify the real bottlenecks (taxes 2.8 s + email 800 ms), avoid the DB trap, ask the business question before diving in.
| Rank | Model | Score | Key notes |
|---|---|---|---|
| 1 | Hy3-preview Q9 | 46 | Only one to ask 3 discriminating questions before planning. Perfect diagnosis. Think-chain exposed. |
| 2 | GLM 5.2 Q6 | 42 | Names the “classic trap” on line 1. Even offers to contact the provider. Questions come after the plan (−5). |
| 3 | Kimi K2.7 Code / MiniMax M3 local | 38 | ”Trap to avoid” in a dedicated block. Diagnosis perfect — but no clarification reflex. |
| 5 | MiniMax M3 cloud | 37 | Same as local. Clarification is an M3-architecture question, not hardware. |
Shared blind spot (except Hy3): none asks “does the business accept an estimated tax?” before planning.
P03 — Edge cases & risks (/50)
Section titled “P03 — Edge cases & risks (/50)”| Rank | Model | Score | Key notes |
|---|---|---|---|
| 1 | MiniMax M3 local | 48 | 11/11 risks + best prioritization (executive summary). |
| 2 | GLM 5.2 Q6 / Kimi K2.7 / MiniMax M3 cloud | 47 | 11/11. GLM alone sees “accidental overwrite by partial CSV”. Kimi adds 7 blocking questions. M3 cloud has the most sophisticated concurrency. |
| 5 | Hy3-preview Q9 | 45 | 11/11. Clean rollback. Flat prioritization. |
P04 — Executable handoff (/50)
Section titled “P04 — Executable handoff (/50)”A spec precise enough that a coder implements without re-asking the intent.
| Rank | Model | Score | Key notes |
|---|---|---|---|
| 1 | MiniMax M3 local | 50 | Textbook spec. Global timeout 2,500 ms. 8-entry behavior table. Clear out-of-scope section. |
| 2 | GLM 5.2 Q6 / Kimi K2.7 | 48 | GLM: exact global-timeout behavior with literal error message. Kimi: 5 env-var constants, explicit parallel execution. |
| 4 | MiniMax M3 cloud | 47 | More production-ready (env vars, logging) — but adds unrequested out-of-scope. No explicit global timeout. |
| 5 | Hy3-preview Q9 | 36 | Sequential/parallel decision left to the coder. timestamp and latency_ms absent. |
Judgment vs mechanics split
Section titled “Judgment vs mechanics split”| Model | Judgment (P02+P03) | Mechanics (P01+P04) |
|---|---|---|
| Hy3 | 91/100 ⭐ | 36/100 |
| GLM 5.2 Q6 | 89/100 | 89/100 |
| Kimi K2.7 Code | 85/100 | 92/100 ⭐ |
| M3 local | 86/100 | 90/100 |
| M3 cloud | 84/100 | 86/100 |
The Hy3 paradox in one line: judgment #1, mechanics last — the exact inverse of the initial prediction.
Profiles
Section titled “Profiles”Kimi K2.7 Code — the operational planner. 177/200 (88.5%), #1 production. Balanced: best P01 of the panel (44/50), near-perfect P04, exhaustive P03. Thinking chain visible in the answers (prefixed “We need…”) — strip it in production. Weakness: no clarification reflex (P02 = 38/50). For Aider: ensure thinking blocks are stripped before diff parsing.
MiniMax M3 local (Q6H16) — the reference planner. 176/200 (88.0%), #2 overall but #1 on P04 (50/50, reference-grade — 8-case error table, 2,500 ms global timeout, clear out-of-scope; a coder implements without one intent question). Best edge-case coverage (P03 48/50). Weakness: P02 clarification (−12), P01 slightly sub-optimal (forgot tests). For the Cline pipeline: recommended planner seat — local beats cloud by 6 pts on P04.
GLM 5.2 Q6 — best reasoner, out of production. 178/200 (89.0%) — highest score, out of competition (TTFT + decode too slow for an interactive pipeline). Qualitatively: finest P02 (names the trap first), best-specified global timeout in P04, the only “partial-CSV overwrite” risk in P03. Use: offline / batch planning, spec review — not pipeline execution.
MiniMax M3 cloud — the conservative senior. 170/200 (85.0%). Local −6 pts, of which −3 on P04 (unrequested features: env vars, advisory lock). Shares the local’s blind spots. Prefer it when the context forces the API.
Hy3-preview Q9 — the paradox. 127/200 (63.5%), Junior. The 0/50 on P01 (code instead of a plan) is eliminatory. But: best P02 of the panel (46/50), the only model to ask the architecture-changing clarifying questions. Diagnosis: Hy3 thinks well (visible reasoning, 91/100 on judgment) but doesn’t stay in its lane — “Plan mode, don’t code” is treated as optional. A signal of architecture, not intelligence. Its natural role is execution, not planning.
Recommended routing
Section titled “Recommended routing”| Context | Recommended model |
|---|---|
| Interactive Cline pipeline (planner) | Kimi K2.7 Code or MiniMax M3 local |
| Very precise spec required (P04) | MiniMax M3 local (50/50) |
| Offline / batch planning | GLM 5.2 Q6 (best reasoner) |
| Cloud-only context | MiniMax M3 cloud |
| Coder in the pipeline | Kimi K2.7 Code or Nex N2 Pro Q9 |
| Avoid as planner | Hy3 (P01 mode-fail) |
Method notes
Section titled “Method notes”- 4 tests × /50 = /200. All models tested in Plan mode (no code execution).
- P02 is the discriminator — it separates planners that ask the right questions from those that rush to a good plan.
- P01 is eliminatory — a mode-fail (code delivered) scores 0/50 regardless of content.
- Shared blind spot (except Hy3): the business-clarification reflex. Systemic fix: add to the system prompt — “Before planning a task with an implicit business constraint, state an explicit assumption and ask for validation.”
TMB Planner Bench V1 — The Monocle Bear — June 2026. 5 models: GLM 5.2 Q6, Kimi K2.7 Code, MiniMax M3 local Q6, MiniMax M3 cloud, Hy3-preview Q9.