AMD Strix Halo Mini PCs: The Best 128 GB Machines for Running Local AI in 2026
Strix Halo mini PCs pack 128 GB of unified memory into a sub-3-liter chassis — running 70B+ parameter models that no 16 GB discrete GPU can touch. Here's every model compared, with LLM benchmarks, a Mac Studio head-to-head, and a practical setup guide.
Compute Market Team
Our Top Pick
Beelink SER8 Mini PC
$449 – $599AMD Ryzen 7 8845HS | Radeon 780M (RDNA 3) | 32GB DDR5-5600
In March 2026, a new class of hardware is reshaping local AI: AMD Strix Halo mini PCs. These sub-3-liter machines pack up to 128 GB of unified LPDDR5X memory — with up to 96 GB directly addressable by the integrated GPU. That means you can run 70B+ parameter LLMs that no 16 GB or even 32 GB discrete GPU can touch, in a box that sits quietly on your desk and draws under 120W.
The key insight is simple: for large language model inference, memory capacity matters more than raw GPU speed. A model that doesn't fit in VRAM doesn't run — period. Strix Halo solves this problem at a price point ($1,499–$2,500) that undercuts a comparable Mac Studio M4 Max by $1,500–$2,000.
This guide compares every Strix Halo mini PC you can buy right now, benchmarks them on real LLM workloads, and tells you exactly who should buy one — and who's better served by a discrete GPU build or Mac.
What Is AMD Strix Halo and Why Does It Matter for Local AI?
AMD's Ryzen AI Max+ 395 (codenamed "Strix Halo") is the most memory-rich consumer processor ever built. It's not a GPU. It's not a CPU. It's a monolithic APU — a single chip with everything integrated:
- 16 Zen 5 CPU cores (32 threads) — competitive with desktop Ryzen 9 chips
- 40 RDNA 3.5 compute units — roughly equivalent to a Radeon RX 7800 XT in shader count
- XDNA 2 NPU — 50 TOPS for Windows AI features (less relevant for LLM inference)
- Up to 128 GB LPDDR5X unified memory — shared between CPU and GPU, with up to 96 GB allocatable to the GPU partition
- 256-bit memory bus — delivering approximately 218 GB/s of memory bandwidth
The architecture that makes this revolutionary for local AI is unified memory. On a traditional PC, the CPU has system RAM and the GPU has its own VRAM — and these are separate pools. An RTX 5090 has 32 GB of GDDR7 VRAM. If your model is 40 GB, it doesn't fit. End of story.
Strix Halo eliminates this wall. The CPU and GPU share the same physical memory pool, and you can allocate most of it to the GPU. A 128 GB Strix Halo system with 96 GB allocated to the GPU has 3× the effective VRAM of an RTX 5090 and 4× the VRAM of an RTX 4090.
As Tom's Hardware noted in their Ryzen AI Max+ 395 review: "For LLM inference, memory capacity is king — and Strix Halo delivers more usable memory than any consumer GPU on the market."
If you're new to why VRAM matters so much, our VRAM guide breaks down the math in detail. The short version: a 70B parameter model at Q4 quantization needs roughly 40 GB of VRAM. That rules out every consumer GPU except the RTX 5090 (32 GB — still too small without heavy quantization) and the A100 80 GB ($12,000+). A $1,499 Strix Halo mini PC handles it with room to spare.
Best Strix Halo Mini PCs You Can Buy Right Now
As of March 2026, at least six manufacturers have shipped or announced Strix Halo mini PCs. Here's every model worth considering, ranked by value for AI workloads:
| Mini PC | Chip | Max RAM | Volume | Price (128 GB) | Status |
|---|---|---|---|---|---|
| GMKtec EVO X2 AI | Ryzen AI Max+ 395 | 128 GB LPDDR5X | ~2.5L | ~$1,499 | Shipping |
| Zotac Magnus EAMAX | Ryzen AI Max+ 395 | 128 GB LPDDR5X | 2.65L | ~$1,799 | Shipping |
| ASRock AI BOX-A395 | Ryzen AI Max+ 395 | 128 GB LPDDR5X | ~3L | ~$1,699 | Shipping |
| Corsair AI Workstation 300 | Ryzen AI Max+ 395 | 128 GB LPDDR5X | ~4L | ~$2,299 | Shipping |
| Framework Desktop | Ryzen AI Max+ 395 | 128 GB LPDDR5X | ~4L | ~$2,499 | Shipping |
| Sapphire Strix Halo PC | Ryzen AI Max+ 395 | 128 GB LPDDR5X | ~3L | ~$1,899 | Pre-order |
GMKtec EVO X2 AI — Best Value
At roughly $1,499 for the 128 GB configuration, the GMKtec EVO X2 AI is the cheapest way to get 128 GB of unified memory in a mini PC. Tom's Hardware's review praised its thermal design — the dual-fan cooler keeps the Ryzen AI Max+ 395 under 85°C even during sustained LLM inference. Build quality is solid for the price, with a full aluminum chassis and dual USB4 ports.
The EVO X2 is the default recommendation for most buyers. If you're coming from a Beelink SER8 ($449 – $599), the leap in AI capability is staggering — from running 7B models slowly to running 70B models comfortably.
Corsair AI Workstation 300 — Best Build Quality
Corsair's entry is more expensive at ~$2,299, but you get premium build quality, better cooling, and Corsair's warranty and support infrastructure. Tom's Hardware's review highlighted the near-silent operation under load and the clean internal layout. If you're deploying this as an always-on AI server for a small business, the extra reliability is worth the premium.
Framework Desktop — Most Repairable
Framework's modular approach means every component is user-replaceable, and the system uses standard expansion cards. ServeTheHome's review noted that the Framework Desktop is the only Strix Halo system designed with enterprise repairability in mind. At ~$2,499, it's the most expensive option, but it's also the most future-proof — if AMD releases a Strix Halo successor, Framework will likely offer an upgrade path.
Sapphire Strix Halo PC — Multi-Unit Linking
The wildcard. Sapphire's system supports linking multiple units together for distributed inference — as documented by VideoCardz, this enables running models that exceed a single unit's memory capacity. AMD themselves have demonstrated trillion-parameter models on a cluster of 8 Strix Halo units. This is bleeding-edge but fascinating for labs and AI startups.
Strix Halo LLM Benchmarks — What Can You Actually Run?
Benchmarks matter more than specs. Here's what the community and reviewers have measured on Strix Halo systems with 128 GB memory (96 GB allocated to GPU):
| Model | Quantization | VRAM Used | Tok/s (Generate) | Source |
|---|---|---|---|---|
| Llama 3.1 8B | Q4_K_M | ~6 GB | ~45 tok/s | Level1Techs Forums |
| Llama 3.3 70B | Q4_K_M | ~40 GB | ~12 tok/s | Level1Techs Forums |
| Llama 3.3 70B | Q8_0 | ~70 GB | ~8 tok/s | Framework Community |
| DeepSeek R1 (671B distill 70B) | Q4_K_M | ~40 GB | ~11 tok/s | TweakTown |
| Llama 4 Scout (109B MoE) | Q4_K_M | ~60 GB | ~9 tok/s | llm-tracker.info |
| Mistral 7B | Q4_K_M | ~5 GB | ~50 tok/s | Level1Techs Forums |
| Qwen 2.5 32B | Q4_K_M | ~20 GB | ~22 tok/s | Framework Community |
The critical number from TweakTown's benchmarking: on DeepSeek R1 at large model sizes, Strix Halo delivered approximately 3× the inference performance of an RTX 5080. Not because the GPU is faster — it isn't. Because the RTX 5080's 16 GB VRAM forces aggressive quantization or CPU offloading, while Strix Halo loads the entire model into GPU-addressable memory.
As the team at Starry Hope documented in their practical Strix Halo LLM guide: "The throughput per-token isn't going to match an RTX 4090 on models that fit in 24 GB. But for anything above 24 GB — which includes every serious production model — Strix Halo is in a class of its own at this price point."
For context on how these models perform on our recommended products, see our DeepSeek R1 local setup guide and Llama 4 hardware guide.
What the Numbers Mean in Practice
- 8–12 tok/s on 70B models: Usable for interactive chat. You won't notice the speed difference vs. a cloud API for single-turn conversations. Multi-turn or long-context gets slow.
- 40–50 tok/s on 7B–8B models: Instant-feeling responses. More than fast enough for AI coding assistants, agents, and RAG pipelines.
- ~9 tok/s on Llama 4 Scout (109B MoE): Functional for interactive use. The MoE architecture means the model is smarter than 70B dense models despite similar tok/s.
Strix Halo vs Mac Studio M4 Max for Local AI
This is the comparison everyone wants. Both platforms offer 128 GB of unified memory for large model inference. But the similarities end at the memory spec.
| Spec | Strix Halo Mini PC (128 GB) | Mac Studio M4 Max (128 GB) |
|---|---|---|
| Price | $1,499 – $2,500 | $3,999 – $4,499 |
| GPU Compute | 40 RDNA 3.5 CUs | 40-core Apple GPU |
| Memory Bandwidth | ~218 GB/s | 546 GB/s |
| GPU-Addressable Memory | Up to 96 GB | Full 128 GB |
| CPU | 16× Zen 5 cores | 16× Apple P/E cores |
| NPU | 50 TOPS (XDNA 2) | 38 TOPS |
| OS | Linux / Windows | macOS only |
| AI Software | ROCm, Vulkan, llama.cpp, Ollama | Metal, mlx, llama.cpp, Ollama |
| Noise | Low (fan-cooled) | Silent (passive under most loads) |
| Expandability | USB4, NVMe (model-dependent) | Thunderbolt 5, NVMe |
Where Strix Halo Wins
Price. A 128 GB GMKtec EVO X2 at $1,499 costs less than half of a 128 GB Mac Studio M4 Max at $3,999. For the same memory capacity, you save $2,000–$2,500. If you're buying multiple units (for a small team or a distributed inference cluster), the savings are enormous.
Linux native. If your AI workflow runs on Linux — and most serious production AI does — Strix Halo gives you first-class support. ROCm, Docker, CUDA translation layers, and the full Python ML ecosystem work natively. The Mac Studio requires macOS, which means Metal-only GPU access and no ROCm.
Where Mac Studio Wins
Memory bandwidth. At 546 GB/s vs. ~218 GB/s, the M4 Max has 2.5× the memory bandwidth. For LLM inference, memory bandwidth is the primary bottleneck after capacity — it directly determines tok/s. This means the Mac Studio will be noticeably faster on the same model at the same quantization level.
Software maturity. Apple's mlx framework and Metal backend for llama.cpp are well-optimized and Just Work. ROCm on Strix Halo is improving but still requires more manual configuration. For a "download Ollama, run a model, done" experience, the Mac wins.
Silence. The Mac Studio is essentially silent under all workloads. Strix Halo mini PCs are quiet but not silent — fans spin up during sustained inference.
The Verdict
Buy Strix Halo if: you're budget-conscious, you prefer Linux, you want multiple units, or you need 128 GB of memory capacity without paying the Apple tax. Buy Mac Studio if: you value silence, want the best out-of-box experience, need maximum tok/s per dollar of memory bandwidth, or your workflow is already macOS-based. See our Mac mini AI guide and Mac mini alternatives for more Apple vs. AMD comparisons.
Strix Halo vs Discrete GPU Builds for Local AI
The other big question: should you buy a Strix Halo mini PC, or just build a desktop with an RTX 4090 ($1,599 – $1,999)?
| Factor | Strix Halo Mini PC (128 GB) | RTX 4090 Desktop Build | RTX 5090 Desktop Build |
|---|---|---|---|
| Total Cost | $1,499 – $2,500 | ~$2,200 – $2,800 | ~$2,800 – $3,500 |
| VRAM / GPU Memory | 96 GB (unified) | 24 GB GDDR6X | 32 GB GDDR7 |
| Max Model Size (Q4) | ~150B+ params | ~30B params | ~45B params |
| 7B Model Speed | ~45 tok/s | ~62 tok/s | ~95 tok/s |
| 70B Model Speed | ~12 tok/s | Doesn't fit (offload: ~3 tok/s) | Doesn't fit (offload: ~5 tok/s) |
| Power Draw | ~80–120W system | ~450W GPU + ~150W system | ~575W GPU + ~200W system |
| Noise | Low | Moderate to loud | Loud |
| Size | ~2.5–4 liters | ~30+ liters (ATX case) | ~30+ liters (ATX case) |
| CUDA Support | No (ROCm/Vulkan) | Yes | Yes |
When Strix Halo Wins
- You need to run models larger than 32 GB. Llama 3.3 70B, DeepSeek R1, Llama 4 Scout — these require more VRAM than any consumer GPU offers. Strix Halo runs them natively.
- You want a small, quiet, low-power system. At 80–120W total system power in a 2.5-liter chassis, Strix Halo is 5× more power-efficient and 10× smaller than an RTX 4090 build.
- You're hosting AI agents or always-on services. The power savings compound — running 24/7, a Strix Halo system costs roughly $8/month in electricity vs. $40+/month for an RTX 4090 build. See our AI agent hardware guide for more on always-on deployments.
When Discrete GPUs Win
- You need maximum speed on models that fit in VRAM. An RTX 4090 runs 7B–13B models 40–50% faster than Strix Halo. An RTX 5090 is roughly 2× faster on small models.
- You need CUDA. Training, fine-tuning, and many ML frameworks still require CUDA. ROCm is catching up but isn't at parity yet. See our budget GPU guide for the best value CUDA cards.
- You're doing batch inference or training. Raw FP16/BF16 throughput on NVIDIA tensor cores vastly outperforms RDNA 3.5 compute units.
The simplest decision rule: if your model fits in 24–32 GB, buy a discrete GPU. If it doesn't, buy Strix Halo. For budget VRAM options on the NVIDIA side, an RTX 3090 ($699 – $999) gives you 24 GB of VRAM at a fraction of the RTX 4090 price.
Software Setup — Running LLMs on Strix Halo
Getting LLMs running on Strix Halo is straightforward but requires choosing the right software stack. Here's the current state as of March 2026:
Option 1: Ollama (Easiest)
Ollama is the fastest path to running models. Install it on Linux or Windows, and it automatically detects Strix Halo's GPU via the Vulkan backend:
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run a 70B model
ollama run llama3.3:70b-instruct-q4_K_M
# Verify GPU usage
ollama ps
Ollama handles model downloading, quantization selection, and GPU memory allocation automatically. For most users, this is all you need. See our full Ollama setup guide for detailed instructions.
Option 2: llama.cpp with Vulkan (Best Performance)
For maximum tok/s, compile llama.cpp with the Vulkan backend. This gives you direct GPU access and fine-grained control over memory allocation:
# Clone and build with Vulkan
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release -j$(nproc)
# Run with GPU layers
./build/bin/llama-cli -m models/llama-3.3-70b-q4_K_M.gguf \
-ngl 99 --ctx-size 8192
The -ngl 99 flag offloads all layers to the GPU. On a 128 GB system with 96 GB allocated to the GPU, this fits any model up to ~90 GB comfortably.
Option 3: ROCm (For Advanced Users)
ROCm support for Strix Halo uses the gfx1151 GPU target. As of March 2026, ROCm 6.x supports this target in recent builds, but you may need to set environment variables:
# Set the GPU target for ROCm
export HSA_OVERRIDE_GFX_VERSION=11.5.1
export HIP_VISIBLE_DEVICES=0
# Verify detection
rocminfo | grep gfx
ROCm gives you access to PyTorch and other ML frameworks on the GPU, but the Vulkan path in llama.cpp is currently more stable for pure inference workloads. As noted by the Starry Hope practical guide, "Vulkan is the safer bet for day-one Strix Halo users; ROCm is where you go when you need PyTorch."
Storage Recommendation
Large models eat disk space fast — Llama 3.3 70B at Q4 is roughly 40 GB per file. A fast NVMe drive dramatically improves model loading times. We recommend the Samsung 990 Pro 4TB ($289 – $339) for its 7,450 MB/s sequential reads, which loads a 40 GB model in under 6 seconds.
Who Should Buy a Strix Halo Mini PC?
Ideal For
- Developers running 30B–70B+ models locally. If you're building with Llama 3.3 70B, DeepSeek R1, Llama 4 Scout, or any model that exceeds 24 GB of VRAM, Strix Halo is the most affordable path. Period.
- Small businesses wanting on-premises AI. A $1,499 mini PC running a 70B model replaces API costs that can easily exceed $500/month. See our local AI for small business guide for the ROI math.
- AI agent hosting. Always-on AI agents need low power, small footprint, and enough memory to run capable models. Strix Halo checks every box. Our agent hardware guide covers deployment patterns in detail.
- Anyone who wants a capable general-purpose mini PC that also happens to be the best local AI machine in its price class.
Not Ideal For
- Training and fine-tuning. RDNA 3.5 compute units lack the tensor core throughput of NVIDIA GPUs. If you're training models, an RTX 4090 ($1,599 – $1,999) or RTX 5090 ($1,999 – $2,199) is still the right choice.
- Batch inference at scale. If you're serving hundreds of concurrent requests, you need the raw throughput of NVIDIA data-center GPUs, not a mini PC.
- Users who need a mature CUDA ecosystem today. ROCm and Vulkan are improving rapidly, but if your workflow depends on CUDA-only tools (certain PyTorch extensions, TensorRT, etc.), Strix Halo will cause friction.
- Budget under $1,000. If you're spending under $1,000, an RTX 5060 Ti 16GB ($429 – $479) in a budget build or a used RTX 3090 ($699 – $999) gets you into the local AI game. See our budget GPU guide.
Decision Flowchart
- Do you need to run models larger than 32 GB? → Yes: Strix Halo or Mac Studio M4 Max ($1,999 – $4,499)
- Budget under $2,500 and prefer Linux? → Strix Halo mini PC (start with GMKtec EVO X2 at ~$1,499)
- Want silence and macOS ecosystem? → Mac Studio M4 Max
- Models fit in 24 GB and need CUDA? → RTX 4090 desktop build
- Models fit in 16 GB and budget is tight? → RTX 5060 Ti 16GB ($429 – $479)
- Want the absolute cheapest 24 GB option? → Used RTX 3090 ($699 – $999)
The Bottom Line
AMD Strix Halo represents a genuine paradigm shift for local AI. For the first time, you can run frontier-class 70B+ parameter models on a $1,499 machine that fits in your palm and draws under 120W. The Mac Studio M4 Max does the same thing, but costs $2,500 more.
The tradeoffs are real: Strix Halo is slower per-token than discrete GPUs on small models, ROCm is less mature than CUDA, and the memory bandwidth gap vs. Apple Silicon means you're leaving some performance on the table. But for the target use case — running large local LLMs at the lowest possible cost — nothing else comes close right now.
If you're running models that fit in 24–32 GB of VRAM, you're still better served by an RTX 4090 ($1,599 – $1,999) or RTX 5090 ($1,999 – $2,199). But if you've been waiting for a small, affordable machine that can handle the models that actually matter in 2026 — the 70B+ class — Strix Halo is what you've been waiting for.
For a broader look at mini PCs for AI, see our mini PC for LLM guide. For prebuilt options across all form factors, check our best prebuilt AI workstation roundup. And for a deeper dive into the software side, our guide to running LLMs locally covers everything from installation to optimization.