Inference modes

A model can be split two ways across the cluster. One slices every layer; the other hands whole layers down a line. Pick the wrong one and the load fails at init.

When OdyssAI-X loads a model across N nodes, it has to decide how to cut the weights. There are two strategies, and some models only accept one.

Tensor parallel

Each layer is sliced across all ranks — every node holds a shard of every layer and they exchange partial results on each token.

Best for classic dense and MoE models (Qwen, Llama, most of mlx-community).
The constraint: the model’s KV heads must be divisible by the node count. A model with 8 KV heads shards cleanly across 2 or 4 nodes, not across 3.
Cost: a collective on every token (all-reduce). This is where the transport matters — jaccl over Thunderbolt 5 hides it far better than ring over TCP.

If you load a tensor-parallel model on a node count that doesn’t divide its KV heads, the runner fails at init with a Shape mismatch — that’s the signal to either change the node count or switch to pipeline.

Pipeline parallel

The model’s layers are split across ranks — node 0 runs the first block of layers, node 1 the next, and so on. The activation is passed down the line.

No KV-head constraint. Any node count works.
Required for the big MoEs that ship a PipelineMixin: DeepSeek v2 / v3 / v3.2, GLM MoE (and MoE-Lite), Ministral-3, HunYuan-3, and similar frontier mixtures. Tensor parallel simply isn’t available for them.
Cost: the pipeline has a fill/drain bubble; throughput is best when many tokens are in flight. For interactive single-stream use it’s still the only way to run a 400 B+ MoE across a handful of Macs.

Which one do I get?

The load endpoint picks a sane default from the model architecture. You override explicitly when a big MoE needs pipeline:

curl -X POST http://<server>:8000/admin/<cluster>/load \
  -H 'Content-Type: application/json' \
  -d '{"model":"…DeepSeek-V3…","sharding":"pipeline"}'

If the model is…	Use
Dense or MoE with KV heads divisible by your node count	tensor (default)
A big MoE with a `PipelineMixin` (DeepSeek v2/v3, GLM MoE, HunYuan-3, …)	pipeline (required)
Dense but KV heads don’t divide your node count	pipeline, or change the node count

A note on the transport

Sharding (tensor vs pipeline) is what gets split; the backend (jaccl vs ring, see The cluster) is how the shards talk. They’re independent choices: you can run tensor-parallel over ring, or pipeline over jaccl. Tensor parallel is the one that benefits most from jaccl’s RDMA, because it does a collective on every single token.

Inference modes

Tensor parallel

Pipeline parallel

Which one do I get?

A note on the transport

Read next