Cluster health
A handful of checks tell you everything: is the engine up, is the model warm, are the nodes alive, is the RDMA mesh intact.
Is the engine up?
Section titled “Is the engine up?”curl -s http://<server>:8000/health# {"status":"idle|busy", "version":"…"}idle = up, no active generation. busy = serving. No response = the container
is down or restarting (docker logs odyssai-odysseus).
Are the clusters loaded?
Section titled “Are the clusters loaded?”curl -s http://<server>:8000/admin/clusters # per-cluster statuscurl -s http://<server>:8000/v1/models # what's servable right nowA cluster with a loaded pool shows in both. Empty /v1/models = nothing loaded.
Watch a load
Section titled “Watch a load”docker logs -f odyssai-odysseusEach rank moves loading → idle. When all ranks are idle, the pool serves. The
dashboard’s per-cluster card shows the same — plus live tok/s, the current phase
(prefill / decode / streaming / idle), and per-pool activity.
Are the nodes alive?
Section titled “Are the nodes alive?”for ip in <node-ips>; do ssh admin@$ip "hostname; vm_stat | head -1"doneA node that doesn’t answer SSH won’t be reachable by the orchestrator either.
Is the RDMA mesh intact?
Section titled “Is the RDMA mesh intact?”On the engine nodes (jaccl clusters only):
ssh admin@<node> "ibv_devinfo 2>&1 | awk '/^hca_id/{n=\$2} /state:/{print n,\$2}'"Every Thunderbolt port should report an active state. A port that’s down means a cable moved — rebuild the topology (Configurator → Topology → Rebuild).
The recurring one: JACCL degradation
Section titled “The recurring one: JACCL degradation”After many load/unload cycles on a jaccl cluster you may see errno 16 / 96 / 2
and failed collectives. It’s the known queue-pair quirk — reboot the nodes
(dashboard → Reboot all) to reset the RDMA state. A long-lived loaded cluster
is unaffected; the degradation accumulates across reloads. See
The cluster.
A quick health routine
Section titled “A quick health routine”GET /health→idle/busy.GET /admin/clusters→ each cluster’s pool present.- Dashboard card → tok/s moving, phase sane, no stuck
loading. - If a load failed:
docker logsfor the rank that didn’t reachidle.
Read next
Section titled “Read next”- Troubleshooting — when a check comes back wrong.
- The cluster — the RDMA and JACCL details.
- Deploy — pushing a code change without breaking the running engine.