Skip to content

Inference modes

A model can be split two ways across the cluster. One slices every layer; the other hands whole layers down a line. Pick the wrong one and the load fails at init.

When OdyssAI-X loads a model across N nodes, it has to decide how to cut the weights. There are two strategies, and some models only accept one.

Each layer is sliced across all ranks — every node holds a shard of every layer and they exchange partial results on each token.

  • Best for classic dense and MoE models (Qwen, Llama, most of mlx-community).
  • The constraint: the model’s KV heads must be divisible by the node count. A model with 8 KV heads shards cleanly across 2 or 4 nodes, not across 3.
  • Cost: a collective on every token (all-reduce). This is where the transport matters — jaccl over Thunderbolt 5 hides it far better than ring over TCP.

If you load a tensor-parallel model on a node count that doesn’t divide its KV heads, the runner fails at init with a Shape mismatch — that’s the signal to either change the node count or switch to pipeline.

The model’s layers are split across ranks — node 0 runs the first block of layers, node 1 the next, and so on. The activation is passed down the line.

  • No KV-head constraint. Any node count works.
  • Required for the big MoEs that ship a PipelineMixin: DeepSeek v2 / v3 / v3.2, GLM MoE (and MoE-Lite), Ministral-3, HunYuan-3, and similar frontier mixtures. Tensor parallel simply isn’t available for them.
  • Cost: the pipeline has a fill/drain bubble; throughput is best when many tokens are in flight. For interactive single-stream use it’s still the only way to run a 400 B+ MoE across a handful of Macs.

The load endpoint picks a sane default from the model architecture. You override explicitly when a big MoE needs pipeline:

Terminal window
curl -X POST http://<server>:8000/admin/<cluster>/load \
-H 'Content-Type: application/json' \
-d '{"model":"…DeepSeek-V3…","sharding":"pipeline"}'
If the model is…Use
Dense or MoE with KV heads divisible by your node counttensor (default)
A big MoE with a PipelineMixin (DeepSeek v2/v3, GLM MoE, HunYuan-3, …)pipeline (required)
Dense but KV heads don’t divide your node countpipeline, or change the node count

Sharding (tensor vs pipeline) is what gets split; the backend (jaccl vs ring, see The cluster) is how the shards talk. They’re independent choices: you can run tensor-parallel over ring, or pipeline over jaccl. Tensor parallel is the one that benefits most from jaccl’s RDMA, because it does a collective on every single token.

  • The cluster — the transports, RDMA wiring, the JACCL quirk.
  • TroubleshootingShape mismatch at load and what it means.
  • HTTP API — the load endpoint and the rest of the surface.