How much VRAM does an AMD Strix Halo mini PC have for AI?

Strix Halo mini PCs with the Ryzen AI Max+ 395 support up to 128 GB of LPDDR5X unified memory, with up to 96 GB allocatable directly to the integrated GPU. This is 4× the VRAM of an RTX 4090 (24 GB) and 3× the VRAM of an RTX 5090 (32 GB), making it the most memory-rich consumer platform for running large language models locally.

Can a Strix Halo mini PC run a 70B parameter model?

Yes. A 128 GB Strix Halo mini PC with 96 GB allocated to the GPU can comfortably run Llama 3.3 70B at Q4 quantization (~40 GB), and even at Q8 (~70 GB). Community benchmarks from Level1Techs show approximately 10–14 tok/s on 70B models — slower than an RTX 4090 on small models, but the RTX 4090 can't load 70B at all without heavy quantization or offloading.

Is Strix Halo better than a Mac Studio for local AI?

It depends on your budget and ecosystem preference. A 128 GB Strix Halo mini PC costs $1,499–$2,500 vs. $3,999–$4,499 for a 128 GB Mac Studio M4 Max. Strix Halo offers better price-per-GB of memory and runs Linux natively. The Mac Studio has a more mature software ecosystem (Metal, mlx) and is completely silent. For pure value on large model inference, Strix Halo wins.

What software do I need to run LLMs on Strix Halo?

The easiest path is Ollama with the Vulkan or ROCm backend — it works on both Linux and Windows. For maximum performance, use llama.cpp compiled with the Vulkan backend on Linux. ROCm support for Strix Halo is improving rapidly as of March 2026, with the gfx1151 target now supported in recent builds.

Should I buy a Strix Halo mini PC or build a desktop with an RTX 4090?

Buy Strix Halo if you need to run models larger than 24 GB (30B–70B+ parameters) in a compact, quiet form factor. Build an RTX 4090 desktop ($1,599 – $1,999 for the GPU alone, plus $500+ for the rest) if you need maximum tok/s on models that fit in 24 GB, or if you need CUDA for training and fine-tuning workloads. Strix Halo is about memory capacity; discrete GPUs are about raw throughput.

Guide18 min read

AMD Strix Halo Mini PCs: The Best 128 GB Machines for Running Local AI in 2026

Strix Halo mini PCs pack 128 GB of unified memory into a sub-3-liter chassis — running 70B+ parameter models that no 16 GB discrete GPU can touch. Here's every model compared, with LLM benchmarks, a Mac Studio head-to-head, and a practical setup guide.

Compute Market Team

Published March 22, 2026

Our Top Pick

Beelink SER8 Mini PC

$449 – $599

AMD Ryzen 7 8845HSRadeon 780M (RDNA 3)32GB DDR5-5600

Check Price on Amazon Full review →

In March 2026, a new class of hardware is reshaping local AI: AMD Strix Halo mini PCs. These sub-3-liter machines pack up to 128 GB of unified LPDDR5X memory — with up to 96 GB directly addressable by the integrated GPU. That means you can run 70B+ parameter LLMs that no 16 GB or even 32 GB discrete GPU can touch, in a box that sits quietly on your desk and draws under 120W.

The key insight is simple: for large language model inference, memory capacity matters more than raw GPU speed. A model that doesn't fit in VRAM doesn't run — period. Strix Halo solves this problem at a price point ($1,499–$2,500) that undercuts a comparable Mac Studio M4 Max by $1,500–$2,000.

This guide compares every Strix Halo mini PC you can buy right now, benchmarks them on real LLM workloads, and tells you exactly who should buy one — and who's better served by a discrete GPU build or Mac.

What Is AMD Strix Halo and Why Does It Matter for Local AI?

AMD's Ryzen AI Max+ 395 (codenamed "Strix Halo") is the most memory-rich consumer processor ever built. It's not a GPU. It's not a CPU. It's a monolithic APU — a single chip with everything integrated:

16 Zen 5 CPU cores (32 threads) — competitive with desktop Ryzen 9 chips
40 RDNA 3.5 compute units — roughly equivalent to a Radeon RX 7800 XT in shader count
XDNA 2 NPU — 50 TOPS for Windows AI features (less relevant for LLM inference)
Up to 128 GB LPDDR5X unified memory — shared between CPU and GPU, with up to 96 GB allocatable to the GPU partition
256-bit memory bus — delivering approximately 218 GB/s of memory bandwidth

The architecture that makes this revolutionary for local AI is unified memory. On a traditional PC, the CPU has system RAM and the GPU has its own VRAM — and these are separate pools. An RTX 5090 has 32 GB of GDDR7 VRAM. If your model is 40 GB, it doesn't fit. End of story.

Strix Halo eliminates this wall. The CPU and GPU share the same physical memory pool, and you can allocate most of it to the GPU. A 128 GB Strix Halo system with 96 GB allocated to the GPU has 3× the effective VRAM of an RTX 5090 and 4× the VRAM of an RTX 4090.

As Tom's Hardware noted in their Ryzen AI Max+ 395 review: "For LLM inference, memory capacity is king — and Strix Halo delivers more usable memory than any consumer GPU on the market."

If you're new to why VRAM matters so much, our VRAM guide breaks down the math in detail. The short version: a 70B parameter model at Q4 quantization needs roughly 40 GB of VRAM. That rules out every consumer GPU except the RTX 5090 (32 GB — still too small without heavy quantization) and the A100 80 GB ($12,000+). A $1,499 Strix Halo mini PC handles it with room to spare.

Best Strix Halo Mini PCs You Can Buy Right Now

As of March 2026, at least six manufacturers have shipped or announced Strix Halo mini PCs. Here's every model worth considering, ranked by value for AI workloads:

Mini PC	Chip	Max RAM	Volume	Price (128 GB)	Status
GMKtec EVO X2 AI	Ryzen AI Max+ 395	128 GB LPDDR5X	~2.5L	~$1,499	Shipping
Zotac Magnus EAMAX	Ryzen AI Max+ 395	128 GB LPDDR5X	2.65L	~$1,799	Shipping
ASRock AI BOX-A395	Ryzen AI Max+ 395	128 GB LPDDR5X	~3L	~$1,699	Shipping
Corsair AI Workstation 300	Ryzen AI Max+ 395	128 GB LPDDR5X	~4L	~$2,299	Shipping
Framework Desktop	Ryzen AI Max+ 395	128 GB LPDDR5X	~4L	~$2,499	Shipping
Sapphire Strix Halo PC	Ryzen AI Max+ 395	128 GB LPDDR5X	~3L	~$1,899	Pre-order

GMKtec EVO X2 AI — Best Value

At roughly $1,499 for the 128 GB configuration, the GMKtec EVO X2 AI is the cheapest way to get 128 GB of unified memory in a mini PC. Tom's Hardware's review praised its thermal design — the dual-fan cooler keeps the Ryzen AI Max+ 395 under 85°C even during sustained LLM inference. Build quality is solid for the price, with a full aluminum chassis and dual USB4 ports.

The EVO X2 is the default recommendation for most buyers. If you're coming from a Beelink SER8 ($449 – $599), the leap in AI capability is staggering — from running 7B models slowly to running 70B models comfortably.

Corsair AI Workstation 300 — Best Build Quality

Corsair's entry is more expensive at ~$2,299, but you get premium build quality, better cooling, and Corsair's warranty and support infrastructure. Tom's Hardware's review highlighted the near-silent operation under load and the clean internal layout. If you're deploying this as an always-on AI server for a small business, the extra reliability is worth the premium.

Framework Desktop — Most Repairable

Framework's modular approach means every component is user-replaceable, and the system uses standard expansion cards. ServeTheHome's review noted that the Framework Desktop is the only Strix Halo system designed with enterprise repairability in mind. At ~$2,499, it's the most expensive option, but it's also the most future-proof — if AMD releases a Strix Halo successor, Framework will likely offer an upgrade path.

Sapphire Strix Halo PC — Multi-Unit Linking

The wildcard. Sapphire's system supports linking multiple units together for distributed inference — as documented by VideoCardz, this enables running models that exceed a single unit's memory capacity. AMD themselves have demonstrated trillion-parameter models on a cluster of 8 Strix Halo units. This is bleeding-edge but fascinating for labs and AI startups.

Strix Halo LLM Benchmarks — What Can You Actually Run?

Benchmarks matter more than specs. Here's what the community and reviewers have measured on Strix Halo systems with 128 GB memory (96 GB allocated to GPU):

Model	Quantization	VRAM Used	Tok/s (Generate)	Source
Llama 3.1 8B	Q4_K_M	~6 GB	~45 tok/s	Level1Techs Forums
Llama 3.3 70B	Q4_K_M	~40 GB	~12 tok/s	Level1Techs Forums
Llama 3.3 70B	Q8_0	~70 GB	~8 tok/s	Framework Community
DeepSeek R1 (671B distill 70B)	Q4_K_M	~40 GB	~11 tok/s	TweakTown
Llama 4 Scout (109B MoE)	Q4_K_M	~60 GB	~9 tok/s	llm-tracker.info
Mistral 7B	Q4_K_M	~5 GB	~50 tok/s	Level1Techs Forums
Qwen 2.5 32B	Q4_K_M	~20 GB	~22 tok/s	Framework Community

The critical number from TweakTown's benchmarking: on DeepSeek R1 at large model sizes, Strix Halo delivered approximately 3× the inference performance of an RTX 5080. Not because the GPU is faster — it isn't. Because the RTX 5080's 16 GB VRAM forces aggressive quantization or CPU offloading, while Strix Halo loads the entire model into GPU-addressable memory.

As the team at Starry Hope documented in their practical Strix Halo LLM guide: "The throughput per-token isn't going to match an RTX 4090 on models that fit in 24 GB. But for anything above 24 GB — which includes every serious production model — Strix Halo is in a class of its own at this price point."

For context on how these models perform on our recommended products, see our DeepSeek R1 local setup guide and Llama 4 hardware guide.

What the Numbers Mean in Practice

8–12 tok/s on 70B models: Usable for interactive chat. You won't notice the speed difference vs. a cloud API for single-turn conversations. Multi-turn or long-context gets slow.
40–50 tok/s on 7B–8B models: Instant-feeling responses. More than fast enough for AI coding assistants, agents, and RAG pipelines.
~9 tok/s on Llama 4 Scout (109B MoE): Functional for interactive use. The MoE architecture means the model is smarter than 70B dense models despite similar tok/s.

Strix Halo vs Mac Studio M4 Max for Local AI

This is the comparison everyone wants. Both platforms offer 128 GB of unified memory for large model inference. But the similarities end at the memory spec.

Spec	Strix Halo Mini PC (128 GB)	Mac Studio M4 Max (128 GB)
Price	$1,499 – $2,500	$3,999 – $4,499
GPU Compute	40 RDNA 3.5 CUs	40-core Apple GPU
Memory Bandwidth	~218 GB/s	546 GB/s
GPU-Addressable Memory	Up to 96 GB	Full 128 GB
CPU	16× Zen 5 cores	16× Apple P/E cores
NPU	50 TOPS (XDNA 2)	38 TOPS
OS	Linux / Windows	macOS only
AI Software	ROCm, Vulkan, llama.cpp, Ollama	Metal, mlx, llama.cpp, Ollama
Noise	Low (fan-cooled)	Silent (passive under most loads)
Expandability	USB4, NVMe (model-dependent)	Thunderbolt 5, NVMe

Where Strix Halo Wins

Price. A 128 GB GMKtec EVO X2 at $1,499 costs less than half of a 128 GB Mac Studio M4 Max at $3,999. For the same memory capacity, you save $2,000–$2,500. If you're buying multiple units (for a small team or a distributed inference cluster), the savings are enormous.

Linux native. If your AI workflow runs on Linux — and most serious production AI does — Strix Halo gives you first-class support. ROCm, Docker, CUDA translation layers, and the full Python ML ecosystem work natively. The Mac Studio requires macOS, which means Metal-only GPU access and no ROCm.

Where Mac Studio Wins

Memory bandwidth. At 546 GB/s vs. ~218 GB/s, the M4 Max has 2.5× the memory bandwidth. For LLM inference, memory bandwidth is the primary bottleneck after capacity — it directly determines tok/s. This means the Mac Studio will be noticeably faster on the same model at the same quantization level.

Software maturity. Apple's mlx framework and Metal backend for llama.cpp are well-optimized and Just Work. ROCm on Strix Halo is improving but still requires more manual configuration. For a "download Ollama, run a model, done" experience, the Mac wins.

Silence. The Mac Studio is essentially silent under all workloads. Strix Halo mini PCs are quiet but not silent — fans spin up during sustained inference.

The Verdict

Buy Strix Halo if: you're budget-conscious, you prefer Linux, you want multiple units, or you need 128 GB of memory capacity without paying the Apple tax. Buy Mac Studio if: you value silence, want the best out-of-box experience, need maximum tok/s per dollar of memory bandwidth, or your workflow is already macOS-based. See our Mac mini AI guide and Mac mini alternatives for more Apple vs. AMD comparisons.

Strix Halo vs Discrete GPU Builds for Local AI

The other big question: should you buy a Strix Halo mini PC, or just build a desktop with an RTX 4090 ($1,599 – $1,999)?

Factor	Strix Halo Mini PC (128 GB)	RTX 4090 Desktop Build	RTX 5090 Desktop Build
Total Cost	$1,499 – $2,500	~$2,200 – $2,800	~$2,800 – $3,500
VRAM / GPU Memory	96 GB (unified)	24 GB GDDR6X	32 GB GDDR7
Max Model Size (Q4)	~150B+ params	~30B params	~45B params
7B Model Speed	~45 tok/s	~62 tok/s	~95 tok/s
70B Model Speed	~12 tok/s	Doesn't fit (offload: ~3 tok/s)	Doesn't fit (offload: ~5 tok/s)
Power Draw	~80–120W system	~450W GPU + ~150W system	~575W GPU + ~200W system
Noise	Low	Moderate to loud	Loud
Size	~2.5–4 liters	~30+ liters (ATX case)	~30+ liters (ATX case)
CUDA Support	No (ROCm/Vulkan)	Yes	Yes

When Strix Halo Wins

You need to run models larger than 32 GB. Llama 3.3 70B, DeepSeek R1, Llama 4 Scout — these require more VRAM than any consumer GPU offers. Strix Halo runs them natively.
You want a small, quiet, low-power system. At 80–120W total system power in a 2.5-liter chassis, Strix Halo is 5× more power-efficient and 10× smaller than an RTX 4090 build.
You're hosting AI agents or always-on services. The power savings compound — running 24/7, a Strix Halo system costs roughly $8/month in electricity vs. $40+/month for an RTX 4090 build. See our AI agent hardware guide for more on always-on deployments.

When Discrete GPUs Win

You need maximum speed on models that fit in VRAM. An RTX 4090 runs 7B–13B models 40–50% faster than Strix Halo. An RTX 5090 is roughly 2× faster on small models.
You need CUDA. Training, fine-tuning, and many ML frameworks still require CUDA. ROCm is catching up but isn't at parity yet. See our budget GPU guide for the best value CUDA cards.
You're doing batch inference or training. Raw FP16/BF16 throughput on NVIDIA tensor cores vastly outperforms RDNA 3.5 compute units.

The simplest decision rule: if your model fits in 24–32 GB, buy a discrete GPU. If it doesn't, buy Strix Halo. For budget VRAM options on the NVIDIA side, an RTX 3090 ($699 – $999) gives you 24 GB of VRAM at a fraction of the RTX 4090 price.

Software Setup — Running LLMs on Strix Halo

Getting LLMs running on Strix Halo is straightforward but requires choosing the right software stack. Here's the current state as of March 2026:

Option 1: Ollama (Easiest)

Ollama is the fastest path to running models. Install it on Linux or Windows, and it automatically detects Strix Halo's GPU via the Vulkan backend:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run a 70B model
ollama run llama3.3:70b-instruct-q4_K_M

# Verify GPU usage
ollama ps

Ollama handles model downloading, quantization selection, and GPU memory allocation automatically. For most users, this is all you need. See our full Ollama setup guide for detailed instructions.

Option 2: llama.cpp with Vulkan (Best Performance)

For maximum tok/s, compile llama.cpp with the Vulkan backend. This gives you direct GPU access and fine-grained control over memory allocation:

# Clone and build with Vulkan
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release -j$(nproc)

# Run with GPU layers
./build/bin/llama-cli -m models/llama-3.3-70b-q4_K_M.gguf \
  -ngl 99 --ctx-size 8192

The -ngl 99 flag offloads all layers to the GPU. On a 128 GB system with 96 GB allocated to the GPU, this fits any model up to ~90 GB comfortably.

Option 3: ROCm (For Advanced Users)

ROCm support for Strix Halo uses the gfx1151 GPU target. As of March 2026, ROCm 6.x supports this target in recent builds, but you may need to set environment variables:

# Set the GPU target for ROCm
export HSA_OVERRIDE_GFX_VERSION=11.5.1
export HIP_VISIBLE_DEVICES=0

# Verify detection
rocminfo | grep gfx

ROCm gives you access to PyTorch and other ML frameworks on the GPU, but the Vulkan path in llama.cpp is currently more stable for pure inference workloads. As noted by the Starry Hope practical guide, "Vulkan is the safer bet for day-one Strix Halo users; ROCm is where you go when you need PyTorch."

Storage Recommendation

Large models eat disk space fast — Llama 3.3 70B at Q4 is roughly 40 GB per file. A fast NVMe drive dramatically improves model loading times. We recommend the Samsung 990 Pro 4TB ($289 – $339) for its 7,450 MB/s sequential reads, which loads a 40 GB model in under 6 seconds.

Who Should Buy a Strix Halo Mini PC?

Ideal For

Developers running 30B–70B+ models locally. If you're building with Llama 3.3 70B, DeepSeek R1, Llama 4 Scout, or any model that exceeds 24 GB of VRAM, Strix Halo is the most affordable path. Period.
Small businesses wanting on-premises AI. A $1,499 mini PC running a 70B model replaces API costs that can easily exceed $500/month. See our local AI for small business guide for the ROI math.
AI agent hosting. Always-on AI agents need low power, small footprint, and enough memory to run capable models. Strix Halo checks every box. Our agent hardware guide covers deployment patterns in detail.
Anyone who wants a capable general-purpose mini PC that also happens to be the best local AI machine in its price class.

Not Ideal For

Training and fine-tuning. RDNA 3.5 compute units lack the tensor core throughput of NVIDIA GPUs. If you're training models, an RTX 4090 ($1,599 – $1,999) or RTX 5090 ($1,999 – $2,199) is still the right choice.
Batch inference at scale. If you're serving hundreds of concurrent requests, you need the raw throughput of NVIDIA data-center GPUs, not a mini PC.
Users who need a mature CUDA ecosystem today. ROCm and Vulkan are improving rapidly, but if your workflow depends on CUDA-only tools (certain PyTorch extensions, TensorRT, etc.), Strix Halo will cause friction.
Budget under $1,000. If you're spending under $1,000, an RTX 5060 Ti 16GB ($429 – $479) in a budget build or a used RTX 3090 ($699 – $999) gets you into the local AI game. See our budget GPU guide.

Decision Flowchart

Do you need to run models larger than 32 GB? → Yes: Strix Halo or Mac Studio M4 Max ($1,999 – $4,499)
Budget under $2,500 and prefer Linux? → Strix Halo mini PC (start with GMKtec EVO X2 at ~$1,499)
Want silence and macOS ecosystem? → Mac Studio M4 Max
Models fit in 24 GB and need CUDA? → RTX 4090 desktop build
Models fit in 16 GB and budget is tight? → RTX 5060 Ti 16GB ($429 – $479)
Want the absolute cheapest 24 GB option? → Used RTX 3090 ($699 – $999)

The Bottom Line

AMD Strix Halo represents a genuine paradigm shift for local AI. For the first time, you can run frontier-class 70B+ parameter models on a $1,499 machine that fits in your palm and draws under 120W. The Mac Studio M4 Max does the same thing, but costs $2,500 more.

The tradeoffs are real: Strix Halo is slower per-token than discrete GPUs on small models, ROCm is less mature than CUDA, and the memory bandwidth gap vs. Apple Silicon means you're leaving some performance on the table. But for the target use case — running large local LLMs at the lowest possible cost — nothing else comes close right now.

If you're running models that fit in 24–32 GB of VRAM, you're still better served by an RTX 4090 ($1,599 – $1,999) or RTX 5090 ($1,999 – $2,199). But if you've been waiting for a small, affordable machine that can handle the models that actually matter in 2026 — the 70B+ class — Strix Halo is what you've been waiting for.

For a broader look at mini PCs for AI, see our mini PC for LLM guide. For prebuilt options across all form factors, check our best prebuilt AI workstation roundup. And for a deeper dive into the software side, our guide to running LLMs locally covers everything from installation to optimization.

Pair-buy essentials

Pairs with your Beelink SER8 Mini PC

Small footprint, but you'll want fast external storage and clean I/O. Pairs with any Beelink, Minisforum, NUC, or AI laptop.

Acasis TBU405 Pro 40Gbps NVMe Enclosure
$95 – $130
USB4 / TB4 NVMe at 40 Gbps with active cooling — sustained 3,000+ MB/s, no throttle.
Shop on Amazon
Anker 11-in-1 USB-C Hub (10Gbps)
$70 – $100
10 Gbps USB-C + 4K HDMI + 100W PD on one cable. Mini PCs ship light on ports.
Shop on Amazon
Monoprice Cat6A SlimRun Ethernet — 25ft
$12 – $18
10G-ready, snagless, shielded — Wi-Fi will bottleneck model downloads, hardline it.
Shop on Amazon

Show 4 more →

HUANUO Single-Monitor Gas-Spring Arm
$40 – $70
13–32" support, gas spring, VESA 75/100. Pair with a Mac mini bracket for a hidden mount.
Shop on Amazon
Belkin 12-Outlet Surge Protector (USB-C + USB-A)
$45 – $65
4000J + USB-C + 2× USB-A on one strip. Cheaper than a dead mini PC.
Shop on Amazon
ACASIS NVMe-to-USB Docking Station
$30 – $45
M.2 NVMe + SATA reader on one dock. Clone OS, swap drives, or hot-load model libraries from a spare SSD.
Shop on Amazon
tomtoc G47 Carrying Case (Steam Deck / ROG Ally)
$30 – $45
Hardshell case for a portable AI rig — Steam Deck, ROG Ally, or a small mini PC + power brick. Travels safely.
Shop on Amazon

Includes paid promotion from ACASIS via Amazon Creator Connections. We earn a commission on qualifying purchases at no cost to you.

Strix HaloAMD Ryzen AI Maxmini PClocal AI128GBLLMunified memoryGMKtecMac StudioRDNA 3.5ROCmOllama