The cluster
Distributed inference is not free speed. It buys you models that don’t fit on one machine — and asks for cabling, coordination, and the occasional reboot in return.
OdyssAI-X spreads one model across 1–5 Mac Studios. The orchestrator (a cheap Mac mini) holds no weights; it SSH-spawns an MLX runner on each node and routes requests to them. This page is the mental model: what a cluster gives you, the two transports, and the one known sharp edge.
What you gain, what you pay
Section titled “What you gain, what you pay”| You gain | You pay |
|---|---|
| Models that don’t fit one Mac (200 B–700 B MoE) | One Thunderbolt 5 cable per node-to-node link |
| More aggregate memory bandwidth | Coordination latency on every token (collectives) |
| Higher throughput on big models | A topology to build and keep wired correctly |
The 80/20 rule still holds: if your model fits one Mac, run Telemak. Reach for a cluster when the weights are bigger than your biggest machine.
Two transports
Section titled “Two transports”A cluster picks a backend — how the nodes exchange tensors during a forward pass.
| Backend | What it is | When |
|---|---|---|
ring | TCP collectives over normal Ethernet (~10 G). | The safe default. Always works, no special cabling. Throughput-limited. |
jaccl | RDMA over Thunderbolt 5. ~2× faster on big models. | When you’ve wired a TB5 mesh and want the throughput. Has a known queue-pair quirk (below). |
You set the backend per cluster in ~/.odysseus/topology.yaml (the Configurator
writes it for you). Start on ring; move to jaccl once the mesh is cabled and
validated.
The JACCL queue-pair quirk
Section titled “The JACCL queue-pair quirk”jaccl is faster, but after several consecutive load/unload cycles the RDMA
queue pairs can degrade — you’ll see errno 16 / 96 / 2 and failed collectives.
This is a known upstream MLX/JACCL bug in RDMA connection re-initialization, not
a data risk: it surfaces on model load/unload, not mid-inference.
The fix is a reboot. Rebooting the affected nodes resets the RDMA state. The dashboard has a Reboot all button for exactly this. In practice it’s an ops chore, not a stability problem — a long-lived loaded cluster runs fine; the degradation accumulates across many reloads.
Fresh-node RDMA onboarding
Section titled “Fresh-node RDMA onboarding”A brand-new Mac runs the default Thunderbolt Bridge (bridge0), which gives
the TB ports no IPv6 link-local (fe80) address — and fe80 per port is
exactly what JACCL and the wiring auto-discovery need. No fe80, no RDMA mesh.
Provision the node once, at its console (never over SSH — the driver refuses
when SSH_CONNECTION is set, and a half-applied switch over SSH can strand the
machine): Configurator → node-setup → network. It installs a dedicated
odyssai network location that yields the fe80 addresses and re-asserts the
setup forever via a root LaunchDaemon. You do this once per node.
Building and rebuilding the topology
Section titled “Building and rebuilding the topology”The Configurator’s Topology → Build step probes the wiring (IPv6 neighbour
discovery on each TB5 link), generates the rdma_to: matrix, validates mesh
symmetry (every cable on both ends, N·(N−1) edges), and writes
~/.odysseus/topology.yaml — backing up the old one and preserving other
clusters.
Moved a cable or added a node? Topology → Rebuild re-probes, shows a before/after diff, re-validates, rewrites the file. No hand-editing.
How a model is split
Section titled “How a model is split”Two sharding strategies decide how the weights spread across ranks — covered in detail in Inference modes:
- Tensor parallel — splits each layer across ranks. Requires the model’s KV heads to be divisible by the node count. Classic dense + MoE (Qwen, Llama).
- Pipeline parallel — splits the layers across ranks. No KV-head constraint.
Required for the big MoEs that ship a
PipelineMixin(DeepSeek v2/v3, GLM MoE, HunYuan-3, …).
The load endpoint picks a sane default; you override with "sharding":"pipeline"
in the load payload when a big MoE needs it.
Loading and watching
Section titled “Loading and watching”# Load (the model must already exist under models_dir on every node)curl -X POST http://<server>:8000/admin/<cluster>/load \ -H 'Content-Type: application/json' \ -d '{"model":"mlx-community/Qwen3.5-122B-A10B-8bit"}'
# Watch each rank go loading → idledocker logs -f odyssai-odysseusWhen every rank reports idle, the cluster serves. The dashboard’s Argo card
shows per-pool activity, tokens/s, and the live phase.
Read next
Section titled “Read next”- Inference modes — tensor vs pipeline, KV-heads, when to use which.
- HTTP API — the endpoints the cluster exposes.
- Troubleshooting — JACCL errno, shape mismatch, empty
/v1/models. - Install the stack — building the cluster with the Configurator.