OdyssAI-X — troubleshooting
Most failures fall into one of these. The boot log usually names the culprit.
Install & boot
Section titled “Install & boot”| Symptom | Fix |
|---|---|
docker compose up exits immediately | Bad ~/.odysseus/topology.yaml or an unreachable SSH target. docker logs odyssai-odysseus — the boot log names the bad key. |
Engine up, /v1/models empty | No model loaded — run the load step (POST /admin/<cluster>/load). |
| Dashboard reachable, but a cluster shows “down” | The orchestrator can’t SSH a node, or the node’s runtime isn’t provisioned. Re-run the Engine bootstrap (idempotent). |
Loading a model
Section titled “Loading a model”| Symptom | Fix |
|---|---|
Shape mismatch at runner init | A sharding mismatch. Tensor parallel needs KV-heads divisible by the node count; big MoEs need pipeline — pass "sharding":"pipeline" in the load payload. See Inference modes. |
| Load hangs forever | The model isn’t present under models_dir on every node. Use the dashboard Sync matrix to rsync, or a shared mount. |
| Paths mismatch between nodes | models_dir must be the same path on every node. |
RDMA / Thunderbolt
Section titled “RDMA / Thunderbolt”| Symptom | Fix |
|---|---|
| Topology Build: “X cannot reach Y” | A TB5 cable is unplugged or a node is off. Fix the cabling, Build again. |
Cluster silently falls back to ring, or Build fails on a fresh Mac | A brand-new Mac’s Thunderbolt ports have no IPv6 link-local (fe80) address, which JACCL needs. Provision the node once, at its console (never over SSH): Configurator → node-setup → network. See The cluster. |
errno 16 / 96 / 2 after several sessions | JACCL queue-pair degradation — a known upstream MLX bug on RDMA re-init. Reboot the affected nodes (dashboard → Reboot all). Not a data risk; it accumulates over many load/unload cycles. |
Inference behaviour
Section titled “Inference behaviour”| Symptom | Fix |
|---|---|
Reply is mostly <think>…</think> | A reasoner with max_tokens too low — bump it so there’s room after reasoning. |
| Thinking on when you didn’t ask | Reasoner models default enable_thinking=true. Set the server default in dashboard → Settings, or pass enable_thinking:false per request. |
| Gibberish / broken French | A quant too aggressive for stable output (try a higher quant), or a wrong chat template on the routed alias. |
Cloud & LiteLLM
Section titled “Cloud & LiteLLM”| Symptom | Fix |
|---|---|
A cloud alias doesn’t appear in /v1/models | Re-check the provider key in dashboard → Settings → Cloud providers. Aliases appear immediately on a valid key. |
| LiteLLM model update didn’t stick | Use PATCH /model/{id}/update — PUT returns success without persisting. (LiteLLM is a legacy fallback; prefer the engine’s own cloud passthrough.) |
Read next
Section titled “Read next”- The cluster — RDMA, the JACCL quirk, fresh-node onboarding.
- Inference modes — the
Shape mismatchcause. - Cluster health — what to watch before things break.