Inference modes
A model can be split two ways across the cluster. One slices every layer; the other hands whole layers down a line. Pick the wrong one and the load fails at init.
When OdyssAI-X loads a model across N nodes, it has to decide how to cut the weights. There are two strategies, and some models only accept one.
Tensor parallel
Section titled “Tensor parallel”Each layer is sliced across all ranks — every node holds a shard of every layer and they exchange partial results on each token.
- Best for classic dense and MoE models (Qwen, Llama, most of mlx-community).
- The constraint: the model’s KV heads must be divisible by the node count. A model with 8 KV heads shards cleanly across 2 or 4 nodes, not across 3.
- Cost: a collective on every token (all-reduce). This is where the transport
matters —
jacclover Thunderbolt 5 hides it far better thanringover TCP.
If you load a tensor-parallel model on a node count that doesn’t divide its KV
heads, the runner fails at init with a Shape mismatch — that’s the signal
to either change the node count or switch to pipeline.
Pipeline parallel
Section titled “Pipeline parallel”The model’s layers are split across ranks — node 0 runs the first block of layers, node 1 the next, and so on. The activation is passed down the line.
- No KV-head constraint. Any node count works.
- Required for the big MoEs that ship a
PipelineMixin: DeepSeek v2 / v3 / v3.2, GLM MoE (and MoE-Lite), Ministral-3, HunYuan-3, and similar frontier mixtures. Tensor parallel simply isn’t available for them. - Cost: the pipeline has a fill/drain bubble; throughput is best when many tokens are in flight. For interactive single-stream use it’s still the only way to run a 400 B+ MoE across a handful of Macs.
Which one do I get?
Section titled “Which one do I get?”The load endpoint picks a sane default from the model architecture. You override explicitly when a big MoE needs pipeline:
curl -X POST http://<server>:8000/admin/<cluster>/load \ -H 'Content-Type: application/json' \ -d '{"model":"…DeepSeek-V3…","sharding":"pipeline"}'| If the model is… | Use |
|---|---|
| Dense or MoE with KV heads divisible by your node count | tensor (default) |
A big MoE with a PipelineMixin (DeepSeek v2/v3, GLM MoE, HunYuan-3, …) | pipeline (required) |
| Dense but KV heads don’t divide your node count | pipeline, or change the node count |
A note on the transport
Section titled “A note on the transport”Sharding (tensor vs pipeline) is what gets split; the backend (jaccl vs
ring, see The cluster) is how the shards talk. They’re
independent choices: you can run tensor-parallel over ring, or pipeline over
jaccl. Tensor parallel is the one that benefits most from jaccl’s RDMA, because
it does a collective on every single token.
Read next
Section titled “Read next”- The cluster — the transports, RDMA wiring, the JACCL quirk.
- Troubleshooting —
Shape mismatchat load and what it means. - HTTP API — the load endpoint and the rest of the surface.