Telemak — performance
Numbers from real hardware, not a spec sheet. Your mileage moves with the model, the quant, and how much you ask it to write.
A measured data point
Section titled “A measured data point”On an M3 Ultra 96 GB, Telemak (native Swift) serving a Qwen3.6-35B-A3B MoE 8-bit reaches ~66 tok/s decode, with TTFT around 3.9 s cold. That’s the peak-throughput config in our own benchmarks — a MoE with few active parameters on a single big-memory Mac is where Telemak shines.
For comparison, a dense 31 B model 8-bit on a single node runs closer to ~18 tok/s — dense models move every parameter per token, so they’re slower than a same-size MoE.
| Class | Example | Rough decode |
|---|---|---|
| MoE, few active params | Qwen3.6-35B-A3B 8-bit | ~60–66 tok/s |
| Dense mid-size | a 31 B dense 8-bit | ~18 tok/s |
| Big MoE | 80–122 B MoE 8-bit | lower, memory-bound on bigger weights |
These are single-stream interactive numbers. Treat them as the shape of what to expect, then measure your own model on your own Mac.
The wired-memory budget
Section titled “The wired-memory budget”Telemak keeps models in wired memory so the GPU always has them. The practical
ceiling is your RAM minus headroom for macOS and anything else running. Telemak
tracks each model’s wired_limit_mb and won’t let the OS page weights out — which
is exactly what keeps tok/s stable instead of collapsing when memory gets tight.
A rough fit guide (leave ~10–20% headroom):
| RAM | Comfortable load |
|---|---|
| 64 GB | one 35 B MoE 8-bit, or two ~30 B co-loaded (chat + embedder) |
| 96–128 GB | a 70–80 B MoE 8-bit |
| 256 GB | a 122 B MoE 8-bit |
| 512 GB | a 200 B+ MoE (mixed-quant) |
The KV-cache win
Section titled “The KV-cache win”The first turn of a long conversation pays full prefill. The second turn reuses
the on-disk KV prefix (~/.telemak/sessions/) and comes back much faster —
in practice several times faster on a multi-thousand-token context. This is why
keeping a model warm and a conversation going beats reloading.
Reliability
Section titled “Reliability”Telemak runs in production on M3 Ultra and M3 Max machines. A recorded endurance mark: 96 h 44 min non-stop, 23 549 inferences, zero failed requests, with two 30 B models co-loaded (73.9 / 96 GB wired). The LaunchAgent restarts the daemon on crash and across reboots, so uptime is the daemon’s job, not yours.
Read next
Section titled “Read next”- Architecture — why wired memory and the KV cache behave this way.
- The menu bar — watching tok/s and the wired footprint live.
- Comparison — Telemak vs Ollama, LM Studio, vLLM, exo.