Skip to content

Telemak — performance

Numbers from real hardware, not a spec sheet. Your mileage moves with the model, the quant, and how much you ask it to write.

On an M3 Ultra 96 GB, Telemak (native Swift) serving a Qwen3.6-35B-A3B MoE 8-bit reaches ~66 tok/s decode, with TTFT around 3.9 s cold. That’s the peak-throughput config in our own benchmarks — a MoE with few active parameters on a single big-memory Mac is where Telemak shines.

For comparison, a dense 31 B model 8-bit on a single node runs closer to ~18 tok/s — dense models move every parameter per token, so they’re slower than a same-size MoE.

ClassExampleRough decode
MoE, few active paramsQwen3.6-35B-A3B 8-bit~60–66 tok/s
Dense mid-sizea 31 B dense 8-bit~18 tok/s
Big MoE80–122 B MoE 8-bitlower, memory-bound on bigger weights

These are single-stream interactive numbers. Treat them as the shape of what to expect, then measure your own model on your own Mac.

Telemak keeps models in wired memory so the GPU always has them. The practical ceiling is your RAM minus headroom for macOS and anything else running. Telemak tracks each model’s wired_limit_mb and won’t let the OS page weights out — which is exactly what keeps tok/s stable instead of collapsing when memory gets tight.

A rough fit guide (leave ~10–20% headroom):

RAMComfortable load
64 GBone 35 B MoE 8-bit, or two ~30 B co-loaded (chat + embedder)
96–128 GBa 70–80 B MoE 8-bit
256 GBa 122 B MoE 8-bit
512 GBa 200 B+ MoE (mixed-quant)

The first turn of a long conversation pays full prefill. The second turn reuses the on-disk KV prefix (~/.telemak/sessions/) and comes back much faster — in practice several times faster on a multi-thousand-token context. This is why keeping a model warm and a conversation going beats reloading.

Telemak runs in production on M3 Ultra and M3 Max machines. A recorded endurance mark: 96 h 44 min non-stop, 23 549 inferences, zero failed requests, with two 30 B models co-loaded (73.9 / 96 GB wired). The LaunchAgent restarts the daemon on crash and across reboots, so uptime is the daemon’s job, not yours.

  • Architecture — why wired memory and the KV cache behave this way.
  • The menu bar — watching tok/s and the wired footprint live.
  • Comparison — Telemak vs Ollama, LM Studio, vLLM, exo.