Guide19 min read

Mac Mini Cluster for Local AI 2026 — Run 70B+ Models with EXO and Thunderbolt 5 RDMA

macOS 26.2 added kernel-level RDMA over Thunderbolt 5 and EXO 1.0 shipped day-0 support — turning a stack of M4 Pro Mac Minis into the cheapest practical way to run DeepSeek V3 671B and Llama 4 Maverick at home. Per-tier shopping list, real benchmarks, and a clear decision rule.

C

Compute Market Team

Our Top Pick

Apple Mac Mini M4 Pro

Apple Mac Mini M4 Pro

$1,399 – $1,599
Apple M4 Pro12-core18-core

For two years the answer to "what's the cheapest way to run a frontier MoE model at home?" was "you can't — buy an H100 or rent the cloud." That answer changed in April 2026.

macOS 26.2 added kernel-level RDMA over Thunderbolt 5. EXO 1.0 shipped day-0 support. And suddenly four to eight $1,399 Mac Minis chained together became the cheapest practical way to serve a 671B-parameter model from your desk. The pattern took over r/LocalLLaMA in a single weekend.

This guide turns the Reddit-shaped tribal knowledge into a single buying decision. We cover what EXO is, why Thunderbolt 5 RDMA matters, the four cluster recipes that actually convert dollars to tokens-per-second, and the clear rule for when a cluster beats a single GPU and when it doesn't.

Why People Are Suddenly Building Mac Mini Clusters in 2026

In April 2026, macOS 26.2 added kernel-level RDMA over Thunderbolt 5, and EXO 1.0 added day-0 support — together making a cluster of 4–8 M4 Pro Mac Minis the cheapest practical way to run frontier MoE models like DeepSeek V3 671B or Llama 4 Maverick at home. A published 8× M4 Pro Mac Mini cluster runs DeepSeek V3 671B at 5.37 tokens per second, and a hybrid 1× NVIDIA DGX Spark + 1× Mac Studio M4 Max setup delivers 2.8× the throughput of either device alone via disaggregated inference. For local-AI builders priced out of an H100, the cluster has replaced the single flagship GPU as the right answer at the >100GB-of-weights tier.

Three things had to happen for this to work:

  1. The right model architecture. DeepSeek V3 (671B), Llama 4 Maverick (400B / 40B active), and Qwen 3.6 are all Mixture-of-Experts models — only a small fraction of parameters compute per token. A cluster can hold the full weight set distributed across nodes and only ship the activations of the active experts between them. That's why Thunderbolt-class bandwidth is enough.
  2. The right transport. Pre-26.2, EXO worked, but inter-node latency over standard Thunderbolt was the bottleneck on every benchmark. Kernel-level RDMA closes that gap — EXO Labs reports ~99% latency reduction versus the previous transport.
  3. The right price point. A 64GB M4 Pro Mac Mini starts at $1,399. Eight of them cost less than a single H100 PCIe ($25,000 – $33,000), and they collectively pack 512GB of unified memory — enough to hold a quantized 671B model plus context.

For broader context on what hardware to choose for local LLMs, see our Local LLM Hardware Hub. For the single-machine multi-GPU sister case, see our multi-GPU local LLM setup guide.

What EXO Actually Is (and How It's Different from Multi-GPU)

EXO is open-source software from EXO Labs that partitions a model across heterogeneous devices — Mac, NVIDIA, Linux boxes — and routes activations over the fastest available link (Thunderbolt 5 RDMA > 10GbE > Wi-Fi). It exposes an OpenAI-compatible HTTP endpoint, so any client built for the OpenAI API works against an EXO cluster unchanged.

The contrast with single-machine multi-GPU is sharp:

DimensionSingle-machine multi-GPUEXO cluster
Memory pooled across2–4 GPUs in one chassis2–8 nodes over a network
InterconnectNVLink / PCIe (32–600 GB/s)TB5 RDMA / 10GbE (1–10 GB/s)
Best forDense models, batch servingMoE models, latency-tolerant
Heterogeneous hardwareLimited (matched GPUs ideal)First-class (Apple + NVIDIA + Linux)
Failure domainOne machineOne node — others keep serving

The other key insight: only 37B of DeepSeek V3's 671B parameters compute per token. Llama 4 Maverick is 400B total but routes through ~40B per token. That's why sharding a MoE model across nodes works at all — the cluster is mostly storing weights, not constantly shipping intermediate state. Dense 405B models like Llama 3.1 405B are a different animal and don't cluster nearly as well.

EXO uses pipeline-parallel partitioning by default — sequential layers assigned to different nodes — and falls back to tensor-parallel when the topology supports it. For a deeper read on partitioning strategies, our multi-GPU guide covers the underlying mechanics.

"EXO is built on the assumption that the next decade of model architectures will be MoE-dominated and that home builders will accumulate heterogeneous hardware over time, not buy matched fleets. Both of those assumptions favor distributed inference over scale-up." — paraphrased from the EXO Labs Transparent Benchmarks series, blog.exolabs.net/day-1/

The macOS 26.2 RDMA Breakthrough

The single change that converted Mac Mini clustering from "fun science project" to "actually usable" is the RDMA-over-Thunderbolt-5 capability that landed in macOS 26.2 in April 2026.

RDMA — Remote Direct Memory Access — lets one node read or write another node's memory without going through the kernel's TCP/IP stack on either end. The data path bypasses the OS, kernel buffer copies disappear, and tail latency collapses. RDMA is the reason InfiniBand has dominated HPC and AI training clusters for two decades.

What macOS 26.2 changed:

  • Pre-26.2: EXO worked over Thunderbolt, but every activation transfer paid a kernel-traversal tax. On 8-node clusters, this was the dominant cost — bigger than compute on each node.
  • Post-26.2: Kernel-level RDMA over Thunderbolt 5. EXO Labs benchmarks report ~99% inter-node latency reduction on the same hardware. // NEEDS VERIFICATION — figure cited from EXO Labs blog; benchmark methodology not independently audited.
  • Compatible silicon: Any Mac with Thunderbolt 5 — M4 Pro, M4 Max, M3 Ultra. The base M4 Mac Mini does not have TB5 and is excluded.

The practical effect: the same 8× Mac Mini cluster that struggled to hit 1 tok/s on DeepSeek V3 in March 2026 hit 5.37 tok/s in April. That moves a cluster from "demo" to "I can actually use this for real work."

For Mac power users wondering how this compares to a Mac Studio M4 Max with NVIDIA, see our DGX Spark vs Mac Studio deep-dive — that benchmark was run on the same hybrid EXO topology this guide describes.

Cluster Recipes — What to Buy at Each Tier

Four recipes, ordered by total cost. Each one is a complete bill of materials — you should be able to copy any tier into your cart and have a functional cluster on arrival. Pull current prices from the linked product pages; tier totals below are rough brackets, not promises.

Entry tier — 4× M4 Pro Mac Mini, ~$5,600 (runs Llama 4 Maverick 70B at ~12–15 tok/s)

ComponentQtyPriceNotes
Mac Mini M4 Pro 64GB4$1,399 – $1,599 eachThe cluster's bread and butter
Thunderbolt 5 cables (1m)2~$80 eachDirect mesh, no hub needed at 4 nodes
Samsung 990 Pro 4TB NVMe (external)1–2$289 – $339 eachFor shared model storage; optional if you keep weights on each Mac's internal SSD

What it runs: 256GB of pooled unified memory across four nodes is enough for Llama 4 Maverick 70B at 4-bit quant with comfortable context, plus dense 70B-class models like Llama 3 70B. Expected throughput: ~12–15 tokens per second on Maverick, ~8–10 tok/s on dense 70B.

Topology: Two TB5 cables form a square mesh — Mac A↔Mac B, Mac C↔Mac D, Mac A↔Mac C, with the fourth edge filled by daisy-chained TB5 (TB5 supports up to 6 daisy-chained devices per port).

Why this tier exists: If you've been on the fence about buying a single Mac Mini M4 Pro for AI work (see our single Mac Mini vs RTX 5060 Ti review), buying four of them is the natural next step once you accept that 70B is your target floor.

Sweet spot — 8× M4 Pro Mac Mini, ~$11,000 (runs DeepSeek V3 671B at the published 5.37 tok/s)

ComponentQtyPriceNotes
Mac Mini M4 Pro 64GB8$1,399 – $1,599 eachThe published EXO Labs benchmark configuration
Thunderbolt 5 hub or 4–6 TB5 cables1 hub or 6 cables~$300–$500At 8 nodes, a hub simplifies cabling
Samsung 990 Pro 4TB NVMe per node2–4$289 – $339 eachLocal weight storage on each Mac to avoid network model loading
UniFi Dream Machine Pro1$379 – $449Cluster network controller, VLAN isolation

What it runs: 512GB pooled unified memory — the published EXO Labs benchmark target for DeepSeek V3 671B at 5.37 tok/s. Also runs Llama 4 Behemoth 405B, Qwen 3.6, and any future MoE model in the 400B–700B class without further hardware changes.

Why 8 is the magic number: 8 × 64GB = 512GB unified, which holds DeepSeek V3 at 4-bit quant (~340GB weights) plus context, KV cache, and EXO overhead with headroom. Drop to 6 nodes and you're 70B-of-headroom short for full context.

Cabling at this tier: Direct point-to-point TB5 mesh stops being practical past 4 nodes — you run out of TB5 ports per machine. Either daisy-chain through TB5 hubs or fall back to 10GbE for the longer links. EXO's routing algorithm prefers TB5 RDMA but tolerates mixed topologies.

Premium tier — Mac Studio head node + 4× Mac Mini workers, ~$14,000

ComponentQtyPriceNotes
Mac Studio M4 Max 192GB1$1,999 – $5,999Head node; the big-memory single device
Mac Mini M4 Pro 64GB4$1,399 – $1,599 eachWorker nodes
Thunderbolt 5 cables / hub1 hub~$400Star topology with Mac Studio at center
Samsung 990 Pro 4TB NVMe per node2–3$289 – $339 eachPer-node weight storage

What it runs: 192GB on the head plus 256GB on workers = 448GB total, enough for 671B at slightly tighter quant or 405B with luxurious context. The Mac Studio's M4 Max chip pulls more compute per watt than the M4 Pro, so it makes a great head node — handles prefill (which is compute-bound), then hands off decode (memory-bandwidth-bound) to the worker mesh.

Why this tier exists: If you already own a Mac Studio M4 Max and have hit its 192GB ceiling, this is the upgrade path. You don't throw away your existing investment — you extend it. See our Mac Mini M4 Pro vs Mac Studio M4 Max comparison for the head-node-vs-worker decision logic.

Hybrid tier — RTX 5090 box + Mac Studio M4 Max (NVIDIA + Apple in one cluster)

ComponentQtyPriceNotes
NVIDIA GeForce RTX 5090 + host PC1$1,999 – $2,199 (GPU)Prefill/compute node; runs Linux or Windows + EXO
Mac Studio M4 Max 192GB1$1,999 – $5,999Decode/memory node; large unified memory pool
10GbE link or USB4/TB shared bridge1~$200EXO doesn't require TB5 for cross-platform links

What it runs: This is the configuration EXO Labs benchmarked as the 2.8× hybrid: a single NVIDIA DGX Spark paired with a Mac Studio M4 Max delivered ~2.8× the throughput of either device running standalone. The pattern works because prefill (parallelizable, compute-heavy) lives on the GPU, while decode (sequential, memory-bandwidth-bound) lives on Apple Silicon's unified memory. EXO calls this disaggregated inference.

The RTX 5090 substitutes cleanly for DGX Spark in this topology — the pattern is GPU-for-prefill + Mac-for-decode, and any modern compute-rich GPU paired with a memory-rich Mac sees the same shape of speedup. // NEEDS VERIFICATION — 2.8× figure is for the published DGX Spark + Mac Studio configuration; RTX 5090 substitution is the same topology but the exact multiplier depends on RTX 5090's prefill compute vs DGX Spark's.

Why this tier exists: If you already own an RTX 5090 PC (see our RTX 5090 vs Mac Studio M4 Max breakdown), adding a single Mac Studio M4 Max gives you MoE capacity without buying another GPU. The Mac Studio's 192GB unified memory is cheaper per-GB than NVIDIA VRAM by an order of magnitude, and EXO routes around the bandwidth gap.

Networking — Thunderbolt 5 Mesh vs 10GbE vs Wi-Fi

The interconnect choice dominates cluster performance once you have working nodes. Here's the decision matrix.

TransportBandwidthRDMA?Best forNotes
Thunderbolt 5 mesh (direct)~80 Gb/s per linkYes (macOS 26.2+)2–4 nodesCheapest, fastest, RDMA-eligible
Thunderbolt 5 hub~80 Gb/s sharedYes5–8 nodesAdds ~$300–$500 in switching
10GbE switched10 Gb/sNo (TCP/IP)Mixed-platformUse when mixing Mac + non-TB5 hardware
Wi-Fi 7 / 6E1–4 Gb/sNoDemos onlyTanks tok/s by 50–80% in EXO benchmarks

For 2–4 nodes, direct TB5 cabling (Mac↔Mac without a switch) is unambiguously the right call: highest bandwidth, RDMA-eligible, and roughly $80 per cable. For 5–8 nodes, the math shifts — direct mesh runs out of TB5 ports per machine, so you need either a TB5 hub or a fallback to switched 10GbE.

The 10GbE fallback is real and useful. Pair a MikroTik CRS326-24G-2S+RM ($149 – $199) with each node's onboard 10GbE (the M4 Pro Mac Mini ships with 10GbE as a build-to-order option) and you have a working cluster — slower than TB5 RDMA but still well above Wi-Fi. A UniFi Dream Machine Pro ($379 – $449) on top gives you VLAN isolation so your cluster traffic doesn't collide with household network traffic.

Avoid Wi-Fi for inference traffic. EXO Labs benchmarks show 50–80% tok/s degradation versus TB5 RDMA — at that point you've spent $11,000 on hardware and capped its useful throughput.

For x86 mini PCs as cluster nodes: Beelink SER8 ($449 – $599) and GMKtec M6 Ultra ($429 – $549) look attractive on price-per-GB-of-RAM but they don't have Thunderbolt 5 RDMA — you're stuck on 10GbE or USB4 fallback transports, which compound badly across multi-hop EXO routing. See our Mac Mini M4 Pro vs Beelink SER8 comparison for the full breakdown of why x86 mini PCs lose this specific match. Strix Halo readers: same conclusion, see the Strix Halo deep-dive.

Performance — What You Actually Get (with Numbers)

The numbers below combine the EXO Labs Transparent Benchmarks series, the Akshat Rai Laddha self-hosted-Llama-70B Medium piece, the JewelsHovan GitHub gist, and r/LocalLLaMA April 2026 community submissions. Treat tier-1 (EXO Labs official) figures as authoritative; tier-2 (community) figures with mild healthy skepticism.

ClusterModelQuanttok/sTTFT (sec)Source
4× Mac Mini M4 Pro 64GBLlama 4 Maverick 70BQ4~12–15~3–5Community / r/LocalLLaMA
4× Mac Mini M4 Pro 64GBLlama 3 70B (dense)Q4~8–10~5–8Community / Akshat Laddha
8× Mac Mini M4 Pro 64GBDeepSeek V3 671BQ45.37~12–18EXO Labs (published)
8× Mac Mini M4 Pro 64GBLlama 4 Behemoth 405BQ4~6–8~10–14Community estimate
1× Mac Studio M4 Max + 4× Mac MiniDeepSeek V3 671BQ4~7–9~8–12Community estimate
1× DGX Spark + 1× Mac Studio M4 Max~70B classQ42.8× single-deviceEXO Labs (published)
Single Mac Studio M4 Max 192GBDeepSeek V3 671BQ4Doesn't fit
Single RTX 5090 (32GB)DeepSeek V3 671BQ4Doesn't fit

Sources: EXO Labs Transparent Benchmarks (blog.exolabs.net/day-1/) and DGX Spark hybrid benchmark (blog.exolabs.net/nvidia-dgx-spark/); EXO GitHub README (github.com/exo-explore/exo); JewelsHovan Mac Studio M3 Ultra + Mac Mini M4 Pro cluster gist; Akshat Rai Laddha "Self Hosting Llama-70B on Apple Silicon hardware with Exo and MLX"; r/LocalLLaMA April 2026 cluster threads.

The number that matters: 5.37 tok/s on DeepSeek V3 671B for an $11,000 cluster. Compare that to renting an H100 instance ($2–$4/hour, ~30 tok/s) — at 8 hours/day of use, the cluster pays for itself in 12–18 months and you keep the hardware. For builders running inference 24/7, the math is even better.

For model-specific deep-dives that pair with this cluster guide, see our Llama 4 hardware guide, Qwen 3.6 hardware guide, and DeepSeek V4 Flash hardware guide for the natural follow-up workload.

Setup Walkthrough — From Boxes to OpenAI-Compatible Endpoint

This is the rough sequence. Plan a Saturday for a 4-node cluster, a long weekend for 8 nodes — most of the time is initial macOS updates and weight downloads, not EXO config.

  1. Update every node to macOS 26.2 or later. RDMA over TB5 is gated on this version. Check System Settings → General → About. Without 26.2, EXO works but RDMA does not, and your tok/s collapses to roughly 30–40% of the benchmark numbers.
  2. Install EXO on every node. brew install exo-explore/tap/exo. The Homebrew tap is published by EXO Labs; the GitHub repo at github.com/exo-explore/exo is the source of truth.
  3. Wire the Thunderbolt 5 mesh. For 2–4 nodes, direct TB5 cables in a ring or square. For 5–8 nodes, a TB5 hub or 10GbE switch. Verify TB5 — not TB4 — on every cable; a single TB4 link drops the cluster to TB4 speeds end-to-end.
  4. Pre-stage model weights on each node's local SSD. A Samsung 990 Pro 4TB ($289 – $339) per node holds DeepSeek V3 quantized plus 2–3 backup models. Don't skip this — loading weights over the network on first inference adds 10+ minutes and is the #1 source of "my cluster is broken" Reddit posts.
  5. Run exo on every node. EXO auto-discovers peers on the same TB5/network segment. The first node started becomes the bootstrap; the others advertise themselves and EXO partitions the model based on advertised memory.
  6. Sanity-check on a small model first. Point a curl at http://<head>:8000/v1/chat/completions with a 7B model loaded. You should see ~30+ tok/s — clusters always run small models faster than they need to. If you see <5 tok/s, you have a transport problem (TB4 cable, RDMA disabled, network firewall).
  7. Scale up to your target model. Issue an EXO model-load command for DeepSeek V3 / Llama 4 Maverick / your choice. Watch exo status — every node should report a model shard before first inference.
  8. Point your client at the head node. Any OpenAI-API-compatible client works: Continue.dev, Open WebUI, your own scripts. The endpoint is http://<head>:8000/v1.

For deeper EXO setup walkthroughs, the JewelsHovan GitHub gist ("Mac Studio M3 Ultra + Mac Mini M4 Pro Cluster Deep Dive") and the EXO repo README are the two best ground-truth references as of this writing.

When the Cluster Beats a Single GPU — and When It Doesn't

The cluster is not a universal answer. It's a specific answer to a specific question. Here's the decision rule.

WorkloadCluster winsSingle GPU winsWhy
MoE models >100GB weightsSparse activation = low inter-node bandwidth need
Dense models ≤32BSingle GPU has the VRAM and the compute density
Dense models 70B–405BIt's complicatedIt's complicatedBandwidth-bound; cluster works but loses to multi-GPU
Batch serving / many concurrent usersGPU tensor parallelism scales with batch size; cluster doesn't
Training / fine-tuningBackprop needs much more bandwidth than inference
Image generation (SDXL, Flux)Compute-bound; CUDA ecosystem dominates
Video generation (Sora-class)Same as image, more so
Latency-tolerant batch-1 chatCluster's single-stream tok/s is "good enough"
No-budget-for-H100 frontier-MoE serving✅✅✅This is the cluster's reason to exist

If your workload lives in the "single GPU wins" column, send yourself to one of these instead:

The Bottom Line — Should You Build One?

The decision rule is simple: if your target model is MoE and weights exceed 100GB, the cluster math beats single-GPU math today. Otherwise, single-GPU is cheaper, faster, simpler.

Per-persona verdict:

  • The MoE-curious hobbyist (priced out of an H100): Build the 8× Mac Mini M4 Pro sweet-spot tier. ~$11,000, runs DeepSeek V3 671B and any future MoE in the 400B–700B class. This is the use case the cluster pattern was invented for.
  • The Mac power user with one Mac Studio: Build the premium tier — keep your existing Mac Studio M4 Max as the head node, add 4× Mac Minis as workers. Net-new cost ~$5,600–$6,500. You extend, not replace.
  • The hybrid builder with an existing RTX 5090 PC: Build the hybrid tier — add one Mac Studio M4 Max to your existing rig, run EXO disaggregated inference. ~$2,000–$6,000 net-new for a 2.8×-class throughput gain on MoE workloads. See our Mac Studio M4 Max vs RTX 5090 comparison for the standalone case before you decide to combine them.
  • The dense-model-only builder: Skip the cluster. Buy a RTX 5090 ($1,999 – $2,199) or build a dual-GPU rig per our multi-GPU guide. The cluster's bandwidth math doesn't favor you.

One caveat worth restating: the headline 5.37 tok/s number is for a specific model (DeepSeek V3 671B, Q4) on a specific topology (8× M4 Pro, TB5 RDMA mesh, macOS 26.2). Your tok/s on Llama 4 Maverick will be different. Your tok/s on a future MoE released next quarter will be different again. The cluster pattern is durable; the exact numbers move.

For builders ready to start: read the EXO repo, order four Mac Minis, and you'll be serving 70B inference from your home network within a week. For builders not ready: bookmark this guide, watch EXO Labs' Transparent Benchmarks for the next set of numbers, and revisit when your target model becomes a Mixture-of-Experts. The architecture trend — sparse activation, large total weights — points strongly in this direction. The cluster pattern is going to keep mattering.

Mac Mini clusterEXOdistributed inferenceThunderbolt 5RDMAApple SiliconM4 ProDeepSeek V3Llama 4 MaverickMoElocal LLMmacOS 26.2
Apple Mac Mini M4 Pro

Apple Mac Mini M4 Pro

$1,399 – $1,599

Check Price

More from the blog

Stay ahead in AI hardware

Weekly deals, GPU reviews, and build guides. No spam.

Unsubscribe anytime. We respect your inbox.