Skip to content

OdyssAI-X — troubleshooting

Most failures fall into one of these. The boot log usually names the culprit.

SymptomFix
docker compose up exits immediatelyBad ~/.odysseus/topology.yaml or an unreachable SSH target. docker logs odyssai-odysseus — the boot log names the bad key.
Engine up, /v1/models emptyNo model loaded — run the load step (POST /admin/<cluster>/load).
Dashboard reachable, but a cluster shows “down”The orchestrator can’t SSH a node, or the node’s runtime isn’t provisioned. Re-run the Engine bootstrap (idempotent).
SymptomFix
Shape mismatch at runner initA sharding mismatch. Tensor parallel needs KV-heads divisible by the node count; big MoEs need pipeline — pass "sharding":"pipeline" in the load payload. See Inference modes.
Load hangs foreverThe model isn’t present under models_dir on every node. Use the dashboard Sync matrix to rsync, or a shared mount.
Paths mismatch between nodesmodels_dir must be the same path on every node.
SymptomFix
Topology Build: “X cannot reach Y”A TB5 cable is unplugged or a node is off. Fix the cabling, Build again.
Cluster silently falls back to ring, or Build fails on a fresh MacA brand-new Mac’s Thunderbolt ports have no IPv6 link-local (fe80) address, which JACCL needs. Provision the node once, at its console (never over SSH): Configurator → node-setup → network. See The cluster.
errno 16 / 96 / 2 after several sessionsJACCL queue-pair degradation — a known upstream MLX bug on RDMA re-init. Reboot the affected nodes (dashboard → Reboot all). Not a data risk; it accumulates over many load/unload cycles.
SymptomFix
Reply is mostly <think>…</think>A reasoner with max_tokens too low — bump it so there’s room after reasoning.
Thinking on when you didn’t askReasoner models default enable_thinking=true. Set the server default in dashboard → Settings, or pass enable_thinking:false per request.
Gibberish / broken FrenchA quant too aggressive for stable output (try a higher quant), or a wrong chat template on the routed alias.
SymptomFix
A cloud alias doesn’t appear in /v1/modelsRe-check the provider key in dashboard → Settings → Cloud providers. Aliases appear immediately on a valid key.
LiteLLM model update didn’t stickUse PATCH /model/{id}/updatePUT returns success without persisting. (LiteLLM is a legacy fallback; prefer the engine’s own cloud passthrough.)