Can you actually cluster Mac Minis to run a 671B parameter model?

Yes. As of April 2026, an 8× Mac Mini M4 Pro 64GB cluster running EXO 1.0 over Thunderbolt 5 RDMA serves DeepSeek V3 671B at a published 5.37 tokens per second, per the EXO Labs Transparent Benchmarks series. The trick is that DeepSeek V3 is a Mixture-of-Experts model — only ~37B of its 671B parameters compute per token — so the cluster moves weights between nodes rather than constant activations. That is why Thunderbolt 5 bandwidth is enough to make the math work.

What is RDMA over Thunderbolt 5 and why does it matter?

RDMA (Remote Direct Memory Access) lets one node read or write another node's memory without going through the kernel's TCP stack. macOS 26.2 (April 2026) added kernel-level RDMA support over Thunderbolt 5 on M4 Pro, M4 Max, and M3 Ultra silicon. EXO Labs benchmarks report ~99% inter-node latency reduction versus pre-26.2 Thunderbolt transport. That moves Mac Mini clusters from 'fun science project' to a usable production-class home setup for MoE inference.

How much does an 8× Mac Mini M4 Pro cluster cost?

Approximately $11,000–$13,000 for the compute. That's 8× Mac Mini M4 Pro 64GB at $1,399 – $1,599 each ($11,200–$12,800 total) plus a Thunderbolt 5 cable mesh and per-node SSD. This is the configuration that runs DeepSeek V3 671B at 5.37 tok/s. By comparison, an NVIDIA RTX Pro 6000 96GB workstation with comparable model-fitting capacity runs $9,000–$10,000 for the GPU alone before the rest of the build, and an H100 PCIe is $25,000 – $33,000.

When should I build a cluster vs buy a single bigger GPU?

Cluster wins for MoE models above 70B (DeepSeek V3, Llama 4 Maverick, Qwen 3.6), latency-tolerant batch-1 workloads, and builders priced out of an H100. Single GPU wins for dense models 32B and below, batch inference serving multiple users, training and fine-tuning, and image or video generation. The decision rule: if your target model is MoE and weights exceed 100GB, the cluster math beats single-GPU math today. Otherwise, a single RTX 5090 or Mac Studio is simpler and faster.

Guide19 min read

Mac Mini Cluster for Local AI 2026 — Run 70B+ Models with EXO and Thunderbolt 5 RDMA

Q: Can I mix a Mac Studio with an NVIDIA GPU in an EXO cluster?

Yes. EXO supports heterogeneous clusters that span Apple Silicon, NVIDIA, and Linux nodes. Per the EXO Labs DGX Spark benchmark, a hybrid 1× NVIDIA DGX Spark + 1× Mac Studio M4 Max setup delivers 2.8× the throughput of either device alone via disaggregated inference (prefill on the GPU, decode on the Mac). The same pattern works with an RTX 5090 box plus a Mac Studio M4 Max for builders who already own one of each.

macOS 26.2 added kernel-level RDMA over Thunderbolt 5 and EXO 1.0 shipped day-0 support — turning a stack of M4 Pro Mac Minis into the cheapest practical way to run DeepSeek V3 671B and Llama 4 Maverick at home. Per-tier shopping list, real benchmarks, and a clear decision rule.

Compute Market Team

Published May 1, 2026

Our Top Pick

Apple Mac Mini M4 Pro

$1,399 – $1,599

Apple M4 Pro12-core18-core

Check Price on Amazon Full review →

For two years the answer to "what's the cheapest way to run a frontier MoE model at home?" was "you can't — buy an H100 or rent the cloud." That answer changed in April 2026.

macOS 26.2 added kernel-level RDMA over Thunderbolt 5. EXO 1.0 shipped day-0 support. And suddenly four to eight $1,399 Mac Minis chained together became the cheapest practical way to serve a 671B-parameter model from your desk. The pattern took over r/LocalLLaMA in a single weekend.

This guide turns the Reddit-shaped tribal knowledge into a single buying decision. We cover what EXO is, why Thunderbolt 5 RDMA matters, the four cluster recipes that actually convert dollars to tokens-per-second, and the clear rule for when a cluster beats a single GPU and when it doesn't.

Why People Are Suddenly Building Mac Mini Clusters in 2026

In April 2026, macOS 26.2 added kernel-level RDMA over Thunderbolt 5, and EXO 1.0 added day-0 support — together making a cluster of 4–8 M4 Pro Mac Minis the cheapest practical way to run frontier MoE models like DeepSeek V3 671B or Llama 4 Maverick at home. A published 8× M4 Pro Mac Mini cluster runs DeepSeek V3 671B at 5.37 tokens per second, and a hybrid 1× NVIDIA DGX Spark + 1× Mac Studio M4 Max setup delivers 2.8× the throughput of either device alone via disaggregated inference. For local-AI builders priced out of an H100, the cluster has replaced the single flagship GPU as the right answer at the >100GB-of-weights tier.

Three things had to happen for this to work:

The right model architecture. DeepSeek V3 (671B), Llama 4 Maverick (400B / 40B active), and Qwen 3.6 are all Mixture-of-Experts models — only a small fraction of parameters compute per token. A cluster can hold the full weight set distributed across nodes and only ship the activations of the active experts between them. That's why Thunderbolt-class bandwidth is enough.
The right transport. Pre-26.2, EXO worked, but inter-node latency over standard Thunderbolt was the bottleneck on every benchmark. Kernel-level RDMA closes that gap — EXO Labs reports ~99% latency reduction versus the previous transport.
The right price point. A 64GB M4 Pro Mac Mini starts at $1,399. Eight of them cost less than a single H100 PCIe ($25,000 – $33,000), and they collectively pack 512GB of unified memory — enough to hold a quantized 671B model plus context.

For broader context on what hardware to choose for local LLMs, see our Local LLM Hardware Hub. For the single-machine multi-GPU sister case, see our multi-GPU local LLM setup guide.

What EXO Actually Is (and How It's Different from Multi-GPU)

EXO is open-source software from EXO Labs that partitions a model across heterogeneous devices — Mac, NVIDIA, Linux boxes — and routes activations over the fastest available link (Thunderbolt 5 RDMA > 10GbE > Wi-Fi). It exposes an OpenAI-compatible HTTP endpoint, so any client built for the OpenAI API works against an EXO cluster unchanged.

The contrast with single-machine multi-GPU is sharp:

Dimension	Single-machine multi-GPU	EXO cluster
Memory pooled across	2–4 GPUs in one chassis	2–8 nodes over a network
Interconnect	NVLink / PCIe (32–600 GB/s)	TB5 RDMA / 10GbE (1–10 GB/s)
Best for	Dense models, batch serving	MoE models, latency-tolerant
Heterogeneous hardware	Limited (matched GPUs ideal)	First-class (Apple + NVIDIA + Linux)
Failure domain	One machine	One node — others keep serving

The other key insight: only 37B of DeepSeek V3's 671B parameters compute per token. Llama 4 Maverick is 400B total but routes through ~40B per token. That's why sharding a MoE model across nodes works at all — the cluster is mostly storing weights, not constantly shipping intermediate state. Dense 405B models like Llama 3.1 405B are a different animal and don't cluster nearly as well.

EXO uses pipeline-parallel partitioning by default — sequential layers assigned to different nodes — and falls back to tensor-parallel when the topology supports it. For a deeper read on partitioning strategies, our multi-GPU guide covers the underlying mechanics.

"EXO is built on the assumption that the next decade of model architectures will be MoE-dominated and that home builders will accumulate heterogeneous hardware over time, not buy matched fleets. Both of those assumptions favor distributed inference over scale-up." — paraphrased from the EXO Labs Transparent Benchmarks series, blog.exolabs.net/day-1/

The macOS 26.2 RDMA Breakthrough

The single change that converted Mac Mini clustering from "fun science project" to "actually usable" is the RDMA-over-Thunderbolt-5 capability that landed in macOS 26.2 in April 2026.

RDMA — Remote Direct Memory Access — lets one node read or write another node's memory without going through the kernel's TCP/IP stack on either end. The data path bypasses the OS, kernel buffer copies disappear, and tail latency collapses. RDMA is the reason InfiniBand has dominated HPC and AI training clusters for two decades.

What macOS 26.2 changed:

Pre-26.2: EXO worked over Thunderbolt, but every activation transfer paid a kernel-traversal tax. On 8-node clusters, this was the dominant cost — bigger than compute on each node.
Post-26.2: Kernel-level RDMA over Thunderbolt 5. EXO Labs benchmarks report ~99% inter-node latency reduction on the same hardware. // NEEDS VERIFICATION — figure cited from EXO Labs blog; benchmark methodology not independently audited.
Compatible silicon: Any Mac with Thunderbolt 5 — M4 Pro, M4 Max, M3 Ultra. The base M4 Mac Mini does not have TB5 and is excluded.

The practical effect: the same 8× Mac Mini cluster that struggled to hit 1 tok/s on DeepSeek V3 in March 2026 hit 5.37 tok/s in April. That moves a cluster from "demo" to "I can actually use this for real work."

For Mac power users wondering how this compares to a Mac Studio M4 Max with NVIDIA, see our DGX Spark vs Mac Studio deep-dive — that benchmark was run on the same hybrid EXO topology this guide describes.

Cluster Recipes — What to Buy at Each Tier

Four recipes, ordered by total cost. Each one is a complete bill of materials — you should be able to copy any tier into your cart and have a functional cluster on arrival. Pull current prices from the linked product pages; tier totals below are rough brackets, not promises.

Entry tier — 4× M4 Pro Mac Mini, ~$5,600 (runs Llama 4 Maverick 70B at ~12–15 tok/s)

Component	Qty	Price	Notes
Mac Mini M4 Pro 64GB	4	$1,399 – $1,599 each	The cluster's bread and butter
Thunderbolt 5 cables (1m)	2	~$80 each	Direct mesh, no hub needed at 4 nodes
Samsung 990 Pro 4TB NVMe (external)	1–2	$289 – $339 each	For shared model storage; optional if you keep weights on each Mac's internal SSD

What it runs: 256GB of pooled unified memory across four nodes is enough for Llama 4 Maverick 70B at 4-bit quant with comfortable context, plus dense 70B-class models like Llama 3 70B. Expected throughput: ~12–15 tokens per second on Maverick, ~8–10 tok/s on dense 70B.

Topology: Two TB5 cables form a square mesh — Mac A↔Mac B, Mac C↔Mac D, Mac A↔Mac C, with the fourth edge filled by daisy-chained TB5 (TB5 supports up to 6 daisy-chained devices per port).

Why this tier exists: If you've been on the fence about buying a single Mac Mini M4 Pro for AI work (see our single Mac Mini vs RTX 5060 Ti review), buying four of them is the natural next step once you accept that 70B is your target floor.

Sweet spot — 8× M4 Pro Mac Mini, ~$11,000 (runs DeepSeek V3 671B at the published 5.37 tok/s)

Component	Qty	Price	Notes
Mac Mini M4 Pro 64GB	8	$1,399 – $1,599 each	The published EXO Labs benchmark configuration
Thunderbolt 5 hub or 4–6 TB5 cables	1 hub or 6 cables	~$300–$500	At 8 nodes, a hub simplifies cabling
Samsung 990 Pro 4TB NVMe per node	2–4	$289 – $339 each	Local weight storage on each Mac to avoid network model loading
UniFi Dream Machine Pro	1	$379 – $449	Cluster network controller, VLAN isolation

What it runs: 512GB pooled unified memory — the published EXO Labs benchmark target for DeepSeek V3 671B at 5.37 tok/s. Also runs Llama 4 Behemoth 405B, Qwen 3.6, and any future MoE model in the 400B–700B class without further hardware changes.

Why 8 is the magic number: 8 × 64GB = 512GB unified, which holds DeepSeek V3 at 4-bit quant (~340GB weights) plus context, KV cache, and EXO overhead with headroom. Drop to 6 nodes and you're 70B-of-headroom short for full context.

Cabling at this tier: Direct point-to-point TB5 mesh stops being practical past 4 nodes — you run out of TB5 ports per machine. Either daisy-chain through TB5 hubs or fall back to 10GbE for the longer links. EXO's routing algorithm prefers TB5 RDMA but tolerates mixed topologies.

Premium tier — Mac Studio head node + 4× Mac Mini workers, ~$14,000

Component	Qty	Price	Notes
Mac Studio M4 Max 192GB	1	$1,999 – $5,999	Head node; the big-memory single device
Mac Mini M4 Pro 64GB	4	$1,399 – $1,599 each	Worker nodes
Thunderbolt 5 cables / hub	1 hub	~$400	Star topology with Mac Studio at center
Samsung 990 Pro 4TB NVMe per node	2–3	$289 – $339 each	Per-node weight storage

What it runs: 192GB on the head plus 256GB on workers = 448GB total, enough for 671B at slightly tighter quant or 405B with luxurious context. The Mac Studio's M4 Max chip pulls more compute per watt than the M4 Pro, so it makes a great head node — handles prefill (which is compute-bound), then hands off decode (memory-bandwidth-bound) to the worker mesh.

Why this tier exists: If you already own a Mac Studio M4 Max and have hit its 192GB ceiling, this is the upgrade path. You don't throw away your existing investment — you extend it. See our Mac Mini M4 Pro vs Mac Studio M4 Max comparison for the head-node-vs-worker decision logic.

Hybrid tier — RTX 5090 box + Mac Studio M4 Max (NVIDIA + Apple in one cluster)

Component	Qty	Price	Notes
NVIDIA GeForce RTX 5090 + host PC	1	$1,999 – $2,199 (GPU)	Prefill/compute node; runs Linux or Windows + EXO
Mac Studio M4 Max 192GB	1	$1,999 – $5,999	Decode/memory node; large unified memory pool
10GbE link or USB4/TB shared bridge	1	~$200	EXO doesn't require TB5 for cross-platform links

What it runs: This is the configuration EXO Labs benchmarked as the 2.8× hybrid: a single NVIDIA DGX Spark paired with a Mac Studio M4 Max delivered ~2.8× the throughput of either device running standalone. The pattern works because prefill (parallelizable, compute-heavy) lives on the GPU, while decode (sequential, memory-bandwidth-bound) lives on Apple Silicon's unified memory. EXO calls this disaggregated inference.

The RTX 5090 substitutes cleanly for DGX Spark in this topology — the pattern is GPU-for-prefill + Mac-for-decode, and any modern compute-rich GPU paired with a memory-rich Mac sees the same shape of speedup. // NEEDS VERIFICATION — 2.8× figure is for the published DGX Spark + Mac Studio configuration; RTX 5090 substitution is the same topology but the exact multiplier depends on RTX 5090's prefill compute vs DGX Spark's.

Why this tier exists: If you already own an RTX 5090 PC (see our RTX 5090 vs Mac Studio M4 Max breakdown), adding a single Mac Studio M4 Max gives you MoE capacity without buying another GPU. The Mac Studio's 192GB unified memory is cheaper per-GB than NVIDIA VRAM by an order of magnitude, and EXO routes around the bandwidth gap.

Networking — Thunderbolt 5 Mesh vs 10GbE vs Wi-Fi

The interconnect choice dominates cluster performance once you have working nodes. Here's the decision matrix.

Transport	Bandwidth	RDMA?	Best for	Notes
Thunderbolt 5 mesh (direct)	~80 Gb/s per link	Yes (macOS 26.2+)	2–4 nodes	Cheapest, fastest, RDMA-eligible
Thunderbolt 5 hub	~80 Gb/s shared	Yes	5–8 nodes	Adds ~$300–$500 in switching
10GbE switched	10 Gb/s	No (TCP/IP)	Mixed-platform	Use when mixing Mac + non-TB5 hardware
Wi-Fi 7 / 6E	1–4 Gb/s	No	Demos only	Tanks tok/s by 50–80% in EXO benchmarks

For 2–4 nodes, direct TB5 cabling (Mac↔Mac without a switch) is unambiguously the right call: highest bandwidth, RDMA-eligible, and roughly $80 per cable. For 5–8 nodes, the math shifts — direct mesh runs out of TB5 ports per machine, so you need either a TB5 hub or a fallback to switched 10GbE.

The 10GbE fallback is real and useful. Pair a MikroTik CRS326-24G-2S+RM ($149 – $199) with each node's onboard 10GbE (the M4 Pro Mac Mini ships with 10GbE as a build-to-order option) and you have a working cluster — slower than TB5 RDMA but still well above Wi-Fi. A UniFi Dream Machine Pro ($379 – $449) on top gives you VLAN isolation so your cluster traffic doesn't collide with household network traffic.

Avoid Wi-Fi for inference traffic. EXO Labs benchmarks show 50–80% tok/s degradation versus TB5 RDMA — at that point you've spent $11,000 on hardware and capped its useful throughput.

For x86 mini PCs as cluster nodes: Beelink SER8 ($449 – $599) and GMKtec M6 Ultra ($429 – $549) look attractive on price-per-GB-of-RAM but they don't have Thunderbolt 5 RDMA — you're stuck on 10GbE or USB4 fallback transports, which compound badly across multi-hop EXO routing. See our Mac Mini M4 Pro vs Beelink SER8 comparison for the full breakdown of why x86 mini PCs lose this specific match. Strix Halo readers: same conclusion, see the Strix Halo deep-dive.

Performance — What You Actually Get (with Numbers)

The numbers below combine the EXO Labs Transparent Benchmarks series, the Akshat Rai Laddha self-hosted-Llama-70B Medium piece, the JewelsHovan GitHub gist, and r/LocalLLaMA April 2026 community submissions. Treat tier-1 (EXO Labs official) figures as authoritative; tier-2 (community) figures with mild healthy skepticism.

Cluster	Model	Quant	tok/s	TTFT (sec)	Source
4× Mac Mini M4 Pro 64GB	Llama 4 Maverick 70B	Q4	~12–15	~3–5	Community / r/LocalLLaMA
4× Mac Mini M4 Pro 64GB	Llama 3 70B (dense)	Q4	~8–10	~5–8	Community / Akshat Laddha
8× Mac Mini M4 Pro 64GB	DeepSeek V3 671B	Q4	5.37	~12–18	EXO Labs (published)
8× Mac Mini M4 Pro 64GB	Llama 4 Behemoth 405B	Q4	~6–8	~10–14	Community estimate
1× Mac Studio M4 Max + 4× Mac Mini	DeepSeek V3 671B	Q4	~7–9	~8–12	Community estimate
1× DGX Spark + 1× Mac Studio M4 Max	~70B class	Q4	2.8× single-device	—	EXO Labs (published)
Single Mac Studio M4 Max 192GB	DeepSeek V3 671B	Q4	Doesn't fit	—	—
Single RTX 5090 (32GB)	DeepSeek V3 671B	Q4	Doesn't fit	—	—

Sources: EXO Labs Transparent Benchmarks (blog.exolabs.net/day-1/) and DGX Spark hybrid benchmark (blog.exolabs.net/nvidia-dgx-spark/); EXO GitHub README (github.com/exo-explore/exo); JewelsHovan Mac Studio M3 Ultra + Mac Mini M4 Pro cluster gist; Akshat Rai Laddha "Self Hosting Llama-70B on Apple Silicon hardware with Exo and MLX"; r/LocalLLaMA April 2026 cluster threads.

The number that matters: 5.37 tok/s on DeepSeek V3 671B for an $11,000 cluster. Compare that to renting an H100 instance ($2–$4/hour, ~30 tok/s) — at 8 hours/day of use, the cluster pays for itself in 12–18 months and you keep the hardware. For builders running inference 24/7, the math is even better.

For model-specific deep-dives that pair with this cluster guide, see our Llama 4 hardware guide, Qwen 3.6 hardware guide, and DeepSeek V4 Flash hardware guide for the natural follow-up workload.

Setup Walkthrough — From Boxes to OpenAI-Compatible Endpoint

This is the rough sequence. Plan a Saturday for a 4-node cluster, a long weekend for 8 nodes — most of the time is initial macOS updates and weight downloads, not EXO config.

Update every node to macOS 26.2 or later. RDMA over TB5 is gated on this version. Check System Settings → General → About. Without 26.2, EXO works but RDMA does not, and your tok/s collapses to roughly 30–40% of the benchmark numbers.
Install EXO on every node. brew install exo-explore/tap/exo. The Homebrew tap is published by EXO Labs; the GitHub repo at github.com/exo-explore/exo is the source of truth.
Wire the Thunderbolt 5 mesh. For 2–4 nodes, direct TB5 cables in a ring or square. For 5–8 nodes, a TB5 hub or 10GbE switch. Verify TB5 — not TB4 — on every cable; a single TB4 link drops the cluster to TB4 speeds end-to-end.
Pre-stage model weights on each node's local SSD. A Samsung 990 Pro 4TB ($289 – $339) per node holds DeepSeek V3 quantized plus 2–3 backup models. Don't skip this — loading weights over the network on first inference adds 10+ minutes and is the #1 source of "my cluster is broken" Reddit posts.
Run exo on every node. EXO auto-discovers peers on the same TB5/network segment. The first node started becomes the bootstrap; the others advertise themselves and EXO partitions the model based on advertised memory.
Sanity-check on a small model first. Point a curl at http://<head>:8000/v1/chat/completions with a 7B model loaded. You should see ~30+ tok/s — clusters always run small models faster than they need to. If you see <5 tok/s, you have a transport problem (TB4 cable, RDMA disabled, network firewall).
Scale up to your target model. Issue an EXO model-load command for DeepSeek V3 / Llama 4 Maverick / your choice. Watch exo status — every node should report a model shard before first inference.
Point your client at the head node. Any OpenAI-API-compatible client works: Continue.dev, Open WebUI, your own scripts. The endpoint is http://<head>:8000/v1.

For deeper EXO setup walkthroughs, the JewelsHovan GitHub gist ("Mac Studio M3 Ultra + Mac Mini M4 Pro Cluster Deep Dive") and the EXO repo README are the two best ground-truth references as of this writing.

When the Cluster Beats a Single GPU — and When It Doesn't

The cluster is not a universal answer. It's a specific answer to a specific question. Here's the decision rule.

Workload	Cluster wins	Single GPU wins	Why
MoE models >100GB weights	✅	—	Sparse activation = low inter-node bandwidth need
Dense models ≤32B	—	✅	Single GPU has the VRAM and the compute density
Dense models 70B–405B	It's complicated	It's complicated	Bandwidth-bound; cluster works but loses to multi-GPU
Batch serving / many concurrent users	—	✅	GPU tensor parallelism scales with batch size; cluster doesn't
Training / fine-tuning	—	✅	Backprop needs much more bandwidth than inference
Image generation (SDXL, Flux)	—	✅	Compute-bound; CUDA ecosystem dominates
Video generation (Sora-class)	—	✅	Same as image, more so
Latency-tolerant batch-1 chat	✅	—	Cluster's single-stream tok/s is "good enough"
No-budget-for-H100 frontier-MoE serving	✅✅✅	—	This is the cluster's reason to exist

If your workload lives in the "single GPU wins" column, send yourself to one of these instead:

Image / video generation, fine-tuning: RTX Pro 6000 96GB review or RTX 5090 vs Mac Studio.
Single-machine multi-GPU for dense 70B: Multi-GPU local LLM setup guide.
Just one Mac, just LLMs: Mac Mini alternatives or the mini PC for AI hub.
Building a single-box AI server: Home AI server build guide.

The Bottom Line — Should You Build One?

The decision rule is simple: if your target model is MoE and weights exceed 100GB, the cluster math beats single-GPU math today. Otherwise, single-GPU is cheaper, faster, simpler.

Per-persona verdict:

The MoE-curious hobbyist (priced out of an H100): Build the 8× Mac Mini M4 Pro sweet-spot tier. ~$11,000, runs DeepSeek V3 671B and any future MoE in the 400B–700B class. This is the use case the cluster pattern was invented for.
The Mac power user with one Mac Studio: Build the premium tier — keep your existing Mac Studio M4 Max as the head node, add 4× Mac Minis as workers. Net-new cost ~$5,600–$6,500. You extend, not replace.
The hybrid builder with an existing RTX 5090 PC: Build the hybrid tier — add one Mac Studio M4 Max to your existing rig, run EXO disaggregated inference. ~$2,000–$6,000 net-new for a 2.8×-class throughput gain on MoE workloads. See our Mac Studio M4 Max vs RTX 5090 comparison for the standalone case before you decide to combine them.
The dense-model-only builder: Skip the cluster. Buy a RTX 5090 ($1,999 – $2,199) or build a dual-GPU rig per our multi-GPU guide. The cluster's bandwidth math doesn't favor you.

One caveat worth restating: the headline 5.37 tok/s number is for a specific model (DeepSeek V3 671B, Q4) on a specific topology (8× M4 Pro, TB5 RDMA mesh, macOS 26.2). Your tok/s on Llama 4 Maverick will be different. Your tok/s on a future MoE released next quarter will be different again. The cluster pattern is durable; the exact numbers move.

For builders ready to start: read the EXO repo, order four Mac Minis, and you'll be serving 70B inference from your home network within a week. For builders not ready: bookmark this guide, watch EXO Labs' Transparent Benchmarks for the next set of numbers, and revisit when your target model becomes a Mixture-of-Experts. The architecture trend — sparse activation, large total weights — points strongly in this direction. The cluster pattern is going to keep mattering.

Pair-buy essentials

Pairs with your Apple Mac Mini M4 Pro

Apple Silicon ships with great compute but minimal I/O. These extend the box without breaking the silent-and-clean aesthetic.

CalDigit TS4 Thunderbolt 4 Dock
$320 – $400
18 ports, 98W charging, 2.5GbE — the only TB4 dock most Macs ever need.
Shop on Amazon
OWC Envoy Express Thunderbolt NVMe Enclosure
$80 – $110
TB3 NVMe at ~2,800 MB/s sustained. Apple's internal-storage tax is 4× the price/GB.
Shop on Amazon
Monoprice Cat6A SlimRun Ethernet — 10ft
$10 – $16
Double-shielded S/FTP, snagless — ready for the 10GbE port on Mac Studio / mini Pro.
Shop on Amazon

Show 3 more →

HumanCentric Mac Mini VESA Mount
$30 – $40
Snaps onto any 75/100mm VESA arm — hide the mini behind the screen. Verify your Mac mini revision.
Shop on Amazon
CyberPower CP850PFCLCD Pure-Sine UPS
$130 – $180
850VA pure sine + AVR — right-sized for Mac mini / Studio, with runtime for clean shutdown.
Shop on Amazon
ACASIS NVMe-to-USB Docking Station
$30 – $45
Slot any M.2 SSD over USB — handy for archiving model checkpoints off Apple's expensive internal storage. ~1 GB/s sustained, fine for cold loads.
Shop on Amazon

Includes paid promotion from ACASIS via Amazon Creator Connections. We earn a commission on qualifying purchases at no cost to you.

Mac Mini clusterEXOdistributed inferenceThunderbolt 5RDMAApple SiliconM4 ProDeepSeek V3Llama 4 MaverickMoElocal LLMmacOS 26.2