Skip to content

OdyssAI-X — troubleshooting

Most failures fall into one of these. The boot log usually names the culprit.

Install & boot

Symptom	Fix
`docker compose up` exits immediately	Bad `~/.odysseus/topology.yaml` or an unreachable SSH target. `docker logs odyssai-odysseus` — the boot log names the bad key.
Engine up, `/v1/models` empty	No model loaded — run the load step (`POST /admin/<cluster>/load`).
Dashboard reachable, but a cluster shows “down”	The orchestrator can’t SSH a node, or the node’s runtime isn’t provisioned. Re-run the Engine bootstrap (idempotent).

Loading a model

Symptom	Fix
`Shape mismatch` at runner init	A sharding mismatch. Tensor parallel needs KV-heads divisible by the node count; big MoEs need pipeline — pass `"sharding":"pipeline"` in the load payload. See Inference modes.
Load hangs forever	The model isn’t present under `models_dir` on every node. Use the dashboard Sync matrix to rsync, or a shared mount.
Paths mismatch between nodes	`models_dir` must be the same path on every node.

RDMA / Thunderbolt

Symptom	Fix
Topology Build: “X cannot reach Y”	A TB5 cable is unplugged or a node is off. Fix the cabling, Build again.
Cluster silently falls back to `ring`, or Build fails on a fresh Mac	A brand-new Mac’s Thunderbolt ports have no IPv6 link-local (`fe80`) address, which JACCL needs. Provision the node once, at its console (never over SSH): Configurator → node-setup → network. See The cluster.
`errno 16 / 96 / 2` after several sessions	JACCL queue-pair degradation — a known upstream MLX bug on RDMA re-init. Reboot the affected nodes (dashboard → Reboot all). Not a data risk; it accumulates over many load/unload cycles.

Inference behaviour

Symptom	Fix
Reply is mostly `<think>…</think>`	A reasoner with `max_tokens` too low — bump it so there’s room after reasoning.
Thinking on when you didn’t ask	Reasoner models default `enable_thinking=true`. Set the server default in dashboard → Settings, or pass `enable_thinking:false` per request.
Gibberish / broken French	A quant too aggressive for stable output (try a higher quant), or a wrong chat template on the routed alias.

Cloud & LiteLLM

Symptom	Fix
A cloud alias doesn’t appear in `/v1/models`	Re-check the provider key in dashboard → Settings → Cloud providers. Aliases appear immediately on a valid key.
LiteLLM model update didn’t stick	Use `PATCH /model/{id}/update` — `PUT` returns success without persisting. (LiteLLM is a legacy fallback; prefer the engine’s own cloud passthrough.)

Read next

The cluster — RDMA, the JACCL quirk, fresh-node onboarding.
Inference modes — the Shape mismatch cause.
Cluster health — what to watch before things break.