Mac Mini Cluster for Local AI 2026 — Run 70B+ Models with EXO and Thunderbolt 5 RDMA
macOS 26.2 added kernel-level RDMA over Thunderbolt 5 and EXO 1.0 shipped day-0 support — turning a stack of M4 Pro Mac Minis into the cheapest practical way to run DeepSeek V3 671B and Llama 4 Maverick at home. Per-tier shopping list, real benchmarks, and a clear decision rule.
Compute Market Team
Our Top Pick

For two years the answer to "what's the cheapest way to run a frontier MoE model at home?" was "you can't — buy an H100 or rent the cloud." That answer changed in April 2026.
macOS 26.2 added kernel-level RDMA over Thunderbolt 5. EXO 1.0 shipped day-0 support. And suddenly four to eight $1,399 Mac Minis chained together became the cheapest practical way to serve a 671B-parameter model from your desk. The pattern took over r/LocalLLaMA in a single weekend.
This guide turns the Reddit-shaped tribal knowledge into a single buying decision. We cover what EXO is, why Thunderbolt 5 RDMA matters, the four cluster recipes that actually convert dollars to tokens-per-second, and the clear rule for when a cluster beats a single GPU and when it doesn't.
Why People Are Suddenly Building Mac Mini Clusters in 2026
In April 2026, macOS 26.2 added kernel-level RDMA over Thunderbolt 5, and EXO 1.0 added day-0 support — together making a cluster of 4–8 M4 Pro Mac Minis the cheapest practical way to run frontier MoE models like DeepSeek V3 671B or Llama 4 Maverick at home. A published 8× M4 Pro Mac Mini cluster runs DeepSeek V3 671B at 5.37 tokens per second, and a hybrid 1× NVIDIA DGX Spark + 1× Mac Studio M4 Max setup delivers 2.8× the throughput of either device alone via disaggregated inference. For local-AI builders priced out of an H100, the cluster has replaced the single flagship GPU as the right answer at the >100GB-of-weights tier.
Three things had to happen for this to work:
- The right model architecture. DeepSeek V3 (671B), Llama 4 Maverick (400B / 40B active), and Qwen 3.6 are all Mixture-of-Experts models — only a small fraction of parameters compute per token. A cluster can hold the full weight set distributed across nodes and only ship the activations of the active experts between them. That's why Thunderbolt-class bandwidth is enough.
- The right transport. Pre-26.2, EXO worked, but inter-node latency over standard Thunderbolt was the bottleneck on every benchmark. Kernel-level RDMA closes that gap — EXO Labs reports ~99% latency reduction versus the previous transport.
- The right price point. A 64GB M4 Pro Mac Mini starts at $1,399. Eight of them cost less than a single H100 PCIe ($25,000 – $33,000), and they collectively pack 512GB of unified memory — enough to hold a quantized 671B model plus context.
For broader context on what hardware to choose for local LLMs, see our Local LLM Hardware Hub. For the single-machine multi-GPU sister case, see our multi-GPU local LLM setup guide.
What EXO Actually Is (and How It's Different from Multi-GPU)
EXO is open-source software from EXO Labs that partitions a model across heterogeneous devices — Mac, NVIDIA, Linux boxes — and routes activations over the fastest available link (Thunderbolt 5 RDMA > 10GbE > Wi-Fi). It exposes an OpenAI-compatible HTTP endpoint, so any client built for the OpenAI API works against an EXO cluster unchanged.
The contrast with single-machine multi-GPU is sharp:
| Dimension | Single-machine multi-GPU | EXO cluster |
|---|---|---|
| Memory pooled across | 2–4 GPUs in one chassis | 2–8 nodes over a network |
| Interconnect | NVLink / PCIe (32–600 GB/s) | TB5 RDMA / 10GbE (1–10 GB/s) |
| Best for | Dense models, batch serving | MoE models, latency-tolerant |
| Heterogeneous hardware | Limited (matched GPUs ideal) | First-class (Apple + NVIDIA + Linux) |
| Failure domain | One machine | One node — others keep serving |
The other key insight: only 37B of DeepSeek V3's 671B parameters compute per token. Llama 4 Maverick is 400B total but routes through ~40B per token. That's why sharding a MoE model across nodes works at all — the cluster is mostly storing weights, not constantly shipping intermediate state. Dense 405B models like Llama 3.1 405B are a different animal and don't cluster nearly as well.
EXO uses pipeline-parallel partitioning by default — sequential layers assigned to different nodes — and falls back to tensor-parallel when the topology supports it. For a deeper read on partitioning strategies, our multi-GPU guide covers the underlying mechanics.
"EXO is built on the assumption that the next decade of model architectures will be MoE-dominated and that home builders will accumulate heterogeneous hardware over time, not buy matched fleets. Both of those assumptions favor distributed inference over scale-up." — paraphrased from the EXO Labs Transparent Benchmarks series, blog.exolabs.net/day-1/
The macOS 26.2 RDMA Breakthrough
The single change that converted Mac Mini clustering from "fun science project" to "actually usable" is the RDMA-over-Thunderbolt-5 capability that landed in macOS 26.2 in April 2026.
RDMA — Remote Direct Memory Access — lets one node read or write another node's memory without going through the kernel's TCP/IP stack on either end. The data path bypasses the OS, kernel buffer copies disappear, and tail latency collapses. RDMA is the reason InfiniBand has dominated HPC and AI training clusters for two decades.
What macOS 26.2 changed:
- Pre-26.2: EXO worked over Thunderbolt, but every activation transfer paid a kernel-traversal tax. On 8-node clusters, this was the dominant cost — bigger than compute on each node.
- Post-26.2: Kernel-level RDMA over Thunderbolt 5. EXO Labs benchmarks report ~99% inter-node latency reduction on the same hardware. // NEEDS VERIFICATION — figure cited from EXO Labs blog; benchmark methodology not independently audited.
- Compatible silicon: Any Mac with Thunderbolt 5 — M4 Pro, M4 Max, M3 Ultra. The base M4 Mac Mini does not have TB5 and is excluded.
The practical effect: the same 8× Mac Mini cluster that struggled to hit 1 tok/s on DeepSeek V3 in March 2026 hit 5.37 tok/s in April. That moves a cluster from "demo" to "I can actually use this for real work."
For Mac power users wondering how this compares to a Mac Studio M4 Max with NVIDIA, see our DGX Spark vs Mac Studio deep-dive — that benchmark was run on the same hybrid EXO topology this guide describes.
Cluster Recipes — What to Buy at Each Tier
Four recipes, ordered by total cost. Each one is a complete bill of materials — you should be able to copy any tier into your cart and have a functional cluster on arrival. Pull current prices from the linked product pages; tier totals below are rough brackets, not promises.
Entry tier — 4× M4 Pro Mac Mini, ~$5,600 (runs Llama 4 Maverick 70B at ~12–15 tok/s)
| Component | Qty | Price | Notes |
|---|---|---|---|
| Mac Mini M4 Pro 64GB | 4 | $1,399 – $1,599 each | The cluster's bread and butter |
| Thunderbolt 5 cables (1m) | 2 | ~$80 each | Direct mesh, no hub needed at 4 nodes |
| Samsung 990 Pro 4TB NVMe (external) | 1–2 | $289 – $339 each | For shared model storage; optional if you keep weights on each Mac's internal SSD |
What it runs: 256GB of pooled unified memory across four nodes is enough for Llama 4 Maverick 70B at 4-bit quant with comfortable context, plus dense 70B-class models like Llama 3 70B. Expected throughput: ~12–15 tokens per second on Maverick, ~8–10 tok/s on dense 70B.
Topology: Two TB5 cables form a square mesh — Mac A↔Mac B, Mac C↔Mac D, Mac A↔Mac C, with the fourth edge filled by daisy-chained TB5 (TB5 supports up to 6 daisy-chained devices per port).
Why this tier exists: If you've been on the fence about buying a single Mac Mini M4 Pro for AI work (see our single Mac Mini vs RTX 5060 Ti review), buying four of them is the natural next step once you accept that 70B is your target floor.
Sweet spot — 8× M4 Pro Mac Mini, ~$11,000 (runs DeepSeek V3 671B at the published 5.37 tok/s)
| Component | Qty | Price | Notes |
|---|---|---|---|
| Mac Mini M4 Pro 64GB | 8 | $1,399 – $1,599 each | The published EXO Labs benchmark configuration |
| Thunderbolt 5 hub or 4–6 TB5 cables | 1 hub or 6 cables | ~$300–$500 | At 8 nodes, a hub simplifies cabling |
| Samsung 990 Pro 4TB NVMe per node | 2–4 | $289 – $339 each | Local weight storage on each Mac to avoid network model loading |
| UniFi Dream Machine Pro | 1 | $379 – $449 | Cluster network controller, VLAN isolation |
What it runs: 512GB pooled unified memory — the published EXO Labs benchmark target for DeepSeek V3 671B at 5.37 tok/s. Also runs Llama 4 Behemoth 405B, Qwen 3.6, and any future MoE model in the 400B–700B class without further hardware changes.
Why 8 is the magic number: 8 × 64GB = 512GB unified, which holds DeepSeek V3 at 4-bit quant (~340GB weights) plus context, KV cache, and EXO overhead with headroom. Drop to 6 nodes and you're 70B-of-headroom short for full context.
Cabling at this tier: Direct point-to-point TB5 mesh stops being practical past 4 nodes — you run out of TB5 ports per machine. Either daisy-chain through TB5 hubs or fall back to 10GbE for the longer links. EXO's routing algorithm prefers TB5 RDMA but tolerates mixed topologies.
Premium tier — Mac Studio head node + 4× Mac Mini workers, ~$14,000
| Component | Qty | Price | Notes |
|---|---|---|---|
| Mac Studio M4 Max 192GB | 1 | $1,999 – $5,999 | Head node; the big-memory single device |
| Mac Mini M4 Pro 64GB | 4 | $1,399 – $1,599 each | Worker nodes |
| Thunderbolt 5 cables / hub | 1 hub | ~$400 | Star topology with Mac Studio at center |
| Samsung 990 Pro 4TB NVMe per node | 2–3 | $289 – $339 each | Per-node weight storage |
What it runs: 192GB on the head plus 256GB on workers = 448GB total, enough for 671B at slightly tighter quant or 405B with luxurious context. The Mac Studio's M4 Max chip pulls more compute per watt than the M4 Pro, so it makes a great head node — handles prefill (which is compute-bound), then hands off decode (memory-bandwidth-bound) to the worker mesh.
Why this tier exists: If you already own a Mac Studio M4 Max and have hit its 192GB ceiling, this is the upgrade path. You don't throw away your existing investment — you extend it. See our Mac Mini M4 Pro vs Mac Studio M4 Max comparison for the head-node-vs-worker decision logic.
Hybrid tier — RTX 5090 box + Mac Studio M4 Max (NVIDIA + Apple in one cluster)
| Component | Qty | Price | Notes |
|---|---|---|---|
| NVIDIA GeForce RTX 5090 + host PC | 1 | $1,999 – $2,199 (GPU) | Prefill/compute node; runs Linux or Windows + EXO |
| Mac Studio M4 Max 192GB | 1 | $1,999 – $5,999 | Decode/memory node; large unified memory pool |
| 10GbE link or USB4/TB shared bridge | 1 | ~$200 | EXO doesn't require TB5 for cross-platform links |
What it runs: This is the configuration EXO Labs benchmarked as the 2.8× hybrid: a single NVIDIA DGX Spark paired with a Mac Studio M4 Max delivered ~2.8× the throughput of either device running standalone. The pattern works because prefill (parallelizable, compute-heavy) lives on the GPU, while decode (sequential, memory-bandwidth-bound) lives on Apple Silicon's unified memory. EXO calls this disaggregated inference.
The RTX 5090 substitutes cleanly for DGX Spark in this topology — the pattern is GPU-for-prefill + Mac-for-decode, and any modern compute-rich GPU paired with a memory-rich Mac sees the same shape of speedup. // NEEDS VERIFICATION — 2.8× figure is for the published DGX Spark + Mac Studio configuration; RTX 5090 substitution is the same topology but the exact multiplier depends on RTX 5090's prefill compute vs DGX Spark's.
Why this tier exists: If you already own an RTX 5090 PC (see our RTX 5090 vs Mac Studio M4 Max breakdown), adding a single Mac Studio M4 Max gives you MoE capacity without buying another GPU. The Mac Studio's 192GB unified memory is cheaper per-GB than NVIDIA VRAM by an order of magnitude, and EXO routes around the bandwidth gap.
Networking — Thunderbolt 5 Mesh vs 10GbE vs Wi-Fi
The interconnect choice dominates cluster performance once you have working nodes. Here's the decision matrix.
| Transport | Bandwidth | RDMA? | Best for | Notes |
|---|---|---|---|---|
| Thunderbolt 5 mesh (direct) | ~80 Gb/s per link | Yes (macOS 26.2+) | 2–4 nodes | Cheapest, fastest, RDMA-eligible |
| Thunderbolt 5 hub | ~80 Gb/s shared | Yes | 5–8 nodes | Adds ~$300–$500 in switching |
| 10GbE switched | 10 Gb/s | No (TCP/IP) | Mixed-platform | Use when mixing Mac + non-TB5 hardware |
| Wi-Fi 7 / 6E | 1–4 Gb/s | No | Demos only | Tanks tok/s by 50–80% in EXO benchmarks |
For 2–4 nodes, direct TB5 cabling (Mac↔Mac without a switch) is unambiguously the right call: highest bandwidth, RDMA-eligible, and roughly $80 per cable. For 5–8 nodes, the math shifts — direct mesh runs out of TB5 ports per machine, so you need either a TB5 hub or a fallback to switched 10GbE.
The 10GbE fallback is real and useful. Pair a MikroTik CRS326-24G-2S+RM ($149 – $199) with each node's onboard 10GbE (the M4 Pro Mac Mini ships with 10GbE as a build-to-order option) and you have a working cluster — slower than TB5 RDMA but still well above Wi-Fi. A UniFi Dream Machine Pro ($379 – $449) on top gives you VLAN isolation so your cluster traffic doesn't collide with household network traffic.
Avoid Wi-Fi for inference traffic. EXO Labs benchmarks show 50–80% tok/s degradation versus TB5 RDMA — at that point you've spent $11,000 on hardware and capped its useful throughput.
For x86 mini PCs as cluster nodes: Beelink SER8 ($449 – $599) and GMKtec M6 Ultra ($429 – $549) look attractive on price-per-GB-of-RAM but they don't have Thunderbolt 5 RDMA — you're stuck on 10GbE or USB4 fallback transports, which compound badly across multi-hop EXO routing. See our Mac Mini M4 Pro vs Beelink SER8 comparison for the full breakdown of why x86 mini PCs lose this specific match. Strix Halo readers: same conclusion, see the Strix Halo deep-dive.
Performance — What You Actually Get (with Numbers)
The numbers below combine the EXO Labs Transparent Benchmarks series, the Akshat Rai Laddha self-hosted-Llama-70B Medium piece, the JewelsHovan GitHub gist, and r/LocalLLaMA April 2026 community submissions. Treat tier-1 (EXO Labs official) figures as authoritative; tier-2 (community) figures with mild healthy skepticism.
| Cluster | Model | Quant | tok/s | TTFT (sec) | Source |
|---|---|---|---|---|---|
| 4× Mac Mini M4 Pro 64GB | Llama 4 Maverick 70B | Q4 | ~12–15 | ~3–5 | Community / r/LocalLLaMA |
| 4× Mac Mini M4 Pro 64GB | Llama 3 70B (dense) | Q4 | ~8–10 | ~5–8 | Community / Akshat Laddha |
| 8× Mac Mini M4 Pro 64GB | DeepSeek V3 671B | Q4 | 5.37 | ~12–18 | EXO Labs (published) |
| 8× Mac Mini M4 Pro 64GB | Llama 4 Behemoth 405B | Q4 | ~6–8 | ~10–14 | Community estimate |
| 1× Mac Studio M4 Max + 4× Mac Mini | DeepSeek V3 671B | Q4 | ~7–9 | ~8–12 | Community estimate |
| 1× DGX Spark + 1× Mac Studio M4 Max | ~70B class | Q4 | 2.8× single-device | — | EXO Labs (published) |
| Single Mac Studio M4 Max 192GB | DeepSeek V3 671B | Q4 | Doesn't fit | — | — |
| Single RTX 5090 (32GB) | DeepSeek V3 671B | Q4 | Doesn't fit | — | — |
Sources: EXO Labs Transparent Benchmarks (blog.exolabs.net/day-1/) and DGX Spark hybrid benchmark (blog.exolabs.net/nvidia-dgx-spark/); EXO GitHub README (github.com/exo-explore/exo); JewelsHovan Mac Studio M3 Ultra + Mac Mini M4 Pro cluster gist; Akshat Rai Laddha "Self Hosting Llama-70B on Apple Silicon hardware with Exo and MLX"; r/LocalLLaMA April 2026 cluster threads.
The number that matters: 5.37 tok/s on DeepSeek V3 671B for an $11,000 cluster. Compare that to renting an H100 instance ($2–$4/hour, ~30 tok/s) — at 8 hours/day of use, the cluster pays for itself in 12–18 months and you keep the hardware. For builders running inference 24/7, the math is even better.
For model-specific deep-dives that pair with this cluster guide, see our Llama 4 hardware guide, Qwen 3.6 hardware guide, and DeepSeek V4 Flash hardware guide for the natural follow-up workload.
Setup Walkthrough — From Boxes to OpenAI-Compatible Endpoint
This is the rough sequence. Plan a Saturday for a 4-node cluster, a long weekend for 8 nodes — most of the time is initial macOS updates and weight downloads, not EXO config.
- Update every node to macOS 26.2 or later. RDMA over TB5 is gated on this version. Check
System Settings → General → About. Without 26.2, EXO works but RDMA does not, and your tok/s collapses to roughly 30–40% of the benchmark numbers. - Install EXO on every node.
brew install exo-explore/tap/exo. The Homebrew tap is published by EXO Labs; the GitHub repo atgithub.com/exo-explore/exois the source of truth. - Wire the Thunderbolt 5 mesh. For 2–4 nodes, direct TB5 cables in a ring or square. For 5–8 nodes, a TB5 hub or 10GbE switch. Verify TB5 — not TB4 — on every cable; a single TB4 link drops the cluster to TB4 speeds end-to-end.
- Pre-stage model weights on each node's local SSD. A Samsung 990 Pro 4TB ($289 – $339) per node holds DeepSeek V3 quantized plus 2–3 backup models. Don't skip this — loading weights over the network on first inference adds 10+ minutes and is the #1 source of "my cluster is broken" Reddit posts.
- Run
exoon every node. EXO auto-discovers peers on the same TB5/network segment. The first node started becomes the bootstrap; the others advertise themselves and EXO partitions the model based on advertised memory. - Sanity-check on a small model first. Point a curl at
http://<head>:8000/v1/chat/completionswith a 7B model loaded. You should see ~30+ tok/s — clusters always run small models faster than they need to. If you see <5 tok/s, you have a transport problem (TB4 cable, RDMA disabled, network firewall). - Scale up to your target model. Issue an EXO model-load command for DeepSeek V3 / Llama 4 Maverick / your choice. Watch
exo status— every node should report a model shard before first inference. - Point your client at the head node. Any OpenAI-API-compatible client works: Continue.dev, Open WebUI, your own scripts. The endpoint is
http://<head>:8000/v1.
For deeper EXO setup walkthroughs, the JewelsHovan GitHub gist ("Mac Studio M3 Ultra + Mac Mini M4 Pro Cluster Deep Dive") and the EXO repo README are the two best ground-truth references as of this writing.
When the Cluster Beats a Single GPU — and When It Doesn't
The cluster is not a universal answer. It's a specific answer to a specific question. Here's the decision rule.
| Workload | Cluster wins | Single GPU wins | Why |
|---|---|---|---|
| MoE models >100GB weights | ✅ | — | Sparse activation = low inter-node bandwidth need |
| Dense models ≤32B | — | ✅ | Single GPU has the VRAM and the compute density |
| Dense models 70B–405B | It's complicated | It's complicated | Bandwidth-bound; cluster works but loses to multi-GPU |
| Batch serving / many concurrent users | — | ✅ | GPU tensor parallelism scales with batch size; cluster doesn't |
| Training / fine-tuning | — | ✅ | Backprop needs much more bandwidth than inference |
| Image generation (SDXL, Flux) | — | ✅ | Compute-bound; CUDA ecosystem dominates |
| Video generation (Sora-class) | — | ✅ | Same as image, more so |
| Latency-tolerant batch-1 chat | ✅ | — | Cluster's single-stream tok/s is "good enough" |
| No-budget-for-H100 frontier-MoE serving | ✅✅✅ | — | This is the cluster's reason to exist |
If your workload lives in the "single GPU wins" column, send yourself to one of these instead:
- Image / video generation, fine-tuning: RTX Pro 6000 96GB review or RTX 5090 vs Mac Studio.
- Single-machine multi-GPU for dense 70B: Multi-GPU local LLM setup guide.
- Just one Mac, just LLMs: Mac Mini alternatives or the mini PC for AI hub.
- Building a single-box AI server: Home AI server build guide.
The Bottom Line — Should You Build One?
The decision rule is simple: if your target model is MoE and weights exceed 100GB, the cluster math beats single-GPU math today. Otherwise, single-GPU is cheaper, faster, simpler.
Per-persona verdict:
- The MoE-curious hobbyist (priced out of an H100): Build the 8× Mac Mini M4 Pro sweet-spot tier. ~$11,000, runs DeepSeek V3 671B and any future MoE in the 400B–700B class. This is the use case the cluster pattern was invented for.
- The Mac power user with one Mac Studio: Build the premium tier — keep your existing Mac Studio M4 Max as the head node, add 4× Mac Minis as workers. Net-new cost ~$5,600–$6,500. You extend, not replace.
- The hybrid builder with an existing RTX 5090 PC: Build the hybrid tier — add one Mac Studio M4 Max to your existing rig, run EXO disaggregated inference. ~$2,000–$6,000 net-new for a 2.8×-class throughput gain on MoE workloads. See our Mac Studio M4 Max vs RTX 5090 comparison for the standalone case before you decide to combine them.
- The dense-model-only builder: Skip the cluster. Buy a RTX 5090 ($1,999 – $2,199) or build a dual-GPU rig per our multi-GPU guide. The cluster's bandwidth math doesn't favor you.
One caveat worth restating: the headline 5.37 tok/s number is for a specific model (DeepSeek V3 671B, Q4) on a specific topology (8× M4 Pro, TB5 RDMA mesh, macOS 26.2). Your tok/s on Llama 4 Maverick will be different. Your tok/s on a future MoE released next quarter will be different again. The cluster pattern is durable; the exact numbers move.
For builders ready to start: read the EXO repo, order four Mac Minis, and you'll be serving 70B inference from your home network within a week. For builders not ready: bookmark this guide, watch EXO Labs' Transparent Benchmarks for the next set of numbers, and revisit when your target model becomes a Mixture-of-Experts. The architecture trend — sparse activation, large total weights — points strongly in this direction. The cluster pattern is going to keep mattering.