The cluster

Distributed inference is not free speed. It buys you models that don’t fit on one machine — and asks for cabling, coordination, and the occasional reboot in return.

OdyssAI-X spreads one model across 1–5 Mac Studios. The orchestrator (a cheap Mac mini) holds no weights; it SSH-spawns an MLX runner on each node and routes requests to them. This page is the mental model: what a cluster gives you, the two transports, and the one known sharp edge.

What you gain, what you pay

You gain	You pay
Models that don’t fit one Mac (200 B–700 B MoE)	One Thunderbolt 5 cable per node-to-node link
More aggregate memory bandwidth	Coordination latency on every token (collectives)
Higher throughput on big models	A topology to build and keep wired correctly

The 80/20 rule still holds: if your model fits one Mac, run Telemak. Reach for a cluster when the weights are bigger than your biggest machine.

Two transports

A cluster picks a backend — how the nodes exchange tensors during a forward pass.

Backend	What it is	When
`ring`	TCP collectives over normal Ethernet (~10 G).	The safe default. Always works, no special cabling. Throughput-limited.
`jaccl`	RDMA over Thunderbolt 5. ~2× faster on big models.	When you’ve wired a TB5 mesh and want the throughput. Has a known queue-pair quirk (below).

You set the backend per cluster in ~/.odysseus/topology.yaml (the Configurator writes it for you). Start on ring; move to jaccl once the mesh is cabled and validated.

The JACCL queue-pair quirk

jaccl is faster, but after several consecutive load/unload cycles the RDMA queue pairs can degrade — you’ll see errno 16 / 96 / 2 and failed collectives. This is a known upstream MLX/JACCL bug in RDMA connection re-initialization, not a data risk: it surfaces on model load/unload, not mid-inference.

The fix is a reboot. Rebooting the affected nodes resets the RDMA state. The dashboard has a Reboot all button for exactly this. In practice it’s an ops chore, not a stability problem — a long-lived loaded cluster runs fine; the degradation accumulates across many reloads.

Fresh-node RDMA onboarding

A brand-new Mac runs the default Thunderbolt Bridge (bridge0), which gives the TB ports no IPv6 link-local (fe80) address — and fe80 per port is exactly what JACCL and the wiring auto-discovery need. No fe80, no RDMA mesh.

Provision the node once, at its console (never over SSH — the driver refuses when SSH_CONNECTION is set, and a half-applied switch over SSH can strand the machine): Configurator → node-setup → network. It installs a dedicated odyssai network location that yields the fe80 addresses and re-asserts the setup forever via a root LaunchDaemon. You do this once per node.

Building and rebuilding the topology

The Configurator’s Topology → Build step probes the wiring (IPv6 neighbour discovery on each TB5 link), generates the rdma_to: matrix, validates mesh symmetry (every cable on both ends, N·(N−1) edges), and writes ~/.odysseus/topology.yaml — backing up the old one and preserving other clusters.

Moved a cable or added a node? Topology → Rebuild re-probes, shows a before/after diff, re-validates, rewrites the file. No hand-editing.

How a model is split

Two sharding strategies decide how the weights spread across ranks — covered in detail in Inference modes:

Tensor parallel — splits each layer across ranks. Requires the model’s KV heads to be divisible by the node count. Classic dense + MoE (Qwen, Llama).
Pipeline parallel — splits the layers across ranks. No KV-head constraint. Required for the big MoEs that ship a PipelineMixin (DeepSeek v2/v3, GLM MoE, HunYuan-3, …).

The load endpoint picks a sane default; you override with "sharding":"pipeline" in the load payload when a big MoE needs it.

Loading and watching

# Load (the model must already exist under models_dir on every node)
curl -X POST http://<server>:8000/admin/<cluster>/load \
  -H 'Content-Type: application/json' \
  -d '{"model":"mlx-community/Qwen3.5-122B-A10B-8bit"}'

# Watch each rank go loading → idle
docker logs -f odyssai-odysseus

When every rank reports idle, the cluster serves. The dashboard’s Argo card shows per-pool activity, tokens/s, and the live phase.