Skip to content

Cluster health

A handful of checks tell you everything: is the engine up, is the model warm, are the nodes alive, is the RDMA mesh intact.

Terminal window
curl -s http://<server>:8000/health
# {"status":"idle|busy", "version":"…"}

idle = up, no active generation. busy = serving. No response = the container is down or restarting (docker logs odyssai-odysseus).

Terminal window
curl -s http://<server>:8000/admin/clusters # per-cluster status
curl -s http://<server>:8000/v1/models # what's servable right now

A cluster with a loaded pool shows in both. Empty /v1/models = nothing loaded.

Terminal window
docker logs -f odyssai-odysseus

Each rank moves loading → idle. When all ranks are idle, the pool serves. The dashboard’s per-cluster card shows the same — plus live tok/s, the current phase (prefill / decode / streaming / idle), and per-pool activity.

Terminal window
for ip in <node-ips>; do
ssh admin@$ip "hostname; vm_stat | head -1"
done

A node that doesn’t answer SSH won’t be reachable by the orchestrator either.

On the engine nodes (jaccl clusters only):

Terminal window
ssh admin@<node> "ibv_devinfo 2>&1 | awk '/^hca_id/{n=\$2} /state:/{print n,\$2}'"

Every Thunderbolt port should report an active state. A port that’s down means a cable moved — rebuild the topology (Configurator → Topology → Rebuild).

After many load/unload cycles on a jaccl cluster you may see errno 16 / 96 / 2 and failed collectives. It’s the known queue-pair quirk — reboot the nodes (dashboard → Reboot all) to reset the RDMA state. A long-lived loaded cluster is unaffected; the degradation accumulates across reloads. See The cluster.

  1. GET /healthidle/busy.
  2. GET /admin/clusters → each cluster’s pool present.
  3. Dashboard card → tok/s moving, phase sane, no stuck loading.
  4. If a load failed: docker logs for the rank that didn’t reach idle.
  • Troubleshooting — when a check comes back wrong.
  • The cluster — the RDMA and JACCL details.
  • Deploy — pushing a code change without breaking the running engine.