How much VRAM do I need to run Qwen 3.5 27B locally?

Qwen 3.5 27B requires approximately 15GB of VRAM at Q4_K_M quantization. An RTX 4090 with 24GB VRAM ($1,599 – $1,999) delivers 40–50 tokens per second at this precision and is the best price-to-performance option. The RTX 5080 with 16GB ($999 – $1,099) can run it at Q4 with tight headroom. A used RTX 3090 ($699 – $999) with 24GB is the best budget choice.

Can I run Qwen 3.5 122B on a single GPU?

Not comfortably. Qwen 3.5 122B-A10B requires approximately 67GB of VRAM at Q4 quantization — more than any single consumer GPU. The best single-device option is a Mac Studio M4 Max with 128GB unified memory ($1,999 – $4,499), which runs it natively via llama.cpp or Ollama. On NVIDIA, you need a dual-GPU setup like two RTX 4090s or two RTX 3090s for 48GB total VRAM with partial CPU offloading.

What is the cheapest way to run Qwen 3.5 locally?

The Intel Arc B580 ($249 – $289) with 12GB VRAM runs all Qwen 3.5 small models (0.8B through 9B) at full speed. For the popular 27B model, a used RTX 3090 ($699 – $999) offers 24GB of VRAM at the lowest price point. The Beelink SER8 mini PC ($449 – $599) runs small models on integrated graphics without a dedicated GPU.

How does Qwen 3.5 compare to Qwen 3 for hardware requirements?

Qwen 3.5 requires more memory than Qwen 3 for equivalent model sizes, primarily because of its 262K context window (vs 128K in Qwen 3). The larger KV cache adds 2–4GB of overhead at full context length. However, Qwen 3.5 adds new MoE sizes (35B-A3B, 122B-A10B) that activate far fewer parameters per token than their total size suggests — making them faster on the same hardware than an equivalently-sized dense model.

Should I buy an RTX 5090 or a Mac Studio for Qwen 3.5?

It depends on which model you want to run. For 27B and below, the RTX 5090 ($1,999 – $2,199) is faster at 55–65 tok/s. For 122B-A10B, the Mac Studio M4 Max with 128GB ($1,999 – $4,499) wins because its unified memory can hold the entire 67GB Q4 model without multi-GPU complexity. See our RTX 5090 vs Mac Studio M4 Max comparison for detailed benchmarks.

Guide16 min read

Qwen 3.5 Local Hardware Guide 2026: Every Model from 0.8B to 397B

Qwen 3.5 rewrites the local AI playbook with native multimodal, 262K context, and hybrid MoE. Here's exactly which GPU, Mac, or mini PC you need for every model size — with VRAM math, tok/s benchmarks, and price-tiered recommendations from $250 to enterprise.

Compute Market Team

Published April 14, 2026

Our Top Pick

NVIDIA GeForce RTX 5090

$1,999 – $2,199

32GB GDDR721,7601,792 GB/s

Check Price on Amazon Full review →

Qwen 3.5 is the most capable open-weight model family available in April 2026 — and the most complex to buy hardware for. Nine model sizes spanning 0.8B to 397B parameters, three architecture types (dense, hybrid MoE, and multimodal), and a 262K context window that rewrites every VRAM calculation from the Qwen 3 era.

This guide covers every Qwen 3.5 model size with exact VRAM requirements, price-tiered GPU recommendations from $250 to enterprise, Apple Silicon coverage that PC-focused guides miss, and real tok/s benchmarks so you know what performance to expect before buying. If you ran hardware for Qwen 3, we'll tell you exactly what's changed.

The bottom line: Running Qwen 3.5 27B locally at Q4 quantization requires 15GB VRAM — an RTX 4090 (24GB) delivers 40–50 tokens per second at this precision, making it the best price-to-performance GPU for Qwen 3.5's most popular model size in April 2026.

What Is Qwen 3.5? — The Complete Model Family

Released by Alibaba Cloud's Qwen team across three waves in February–March 2026, Qwen 3.5 is architecturally different from its predecessor in ways that directly affect hardware requirements:

Feb 16 — Flagship: 397B-A17B, the largest open-weight MoE model available, with 397 billion total parameters but only 17 billion active per token
Feb 24 — Medium series: 27B (dense), 35B-A3B (MoE), 122B-A10B (MoE) — the models most local AI users will run
Mar 2 — Small series: 0.8B, 1.5B, 4B, 9B, 14B — optimized for phones, laptops, and edge devices

Three architectural changes matter for hardware planning:

Native multimodal: every Qwen 3.5 model processes text, images, and video natively — no separate vision adapter required
262K context window: double the 128K context of Qwen 3, which means a proportionally larger KV cache consuming more memory
Hybrid MoE architecture: the 35B, 122B, and 397B models use Mixture of Experts, activating only a fraction of parameters per token — dramatically reducing compute requirements relative to model size

For a deeper dive into how VRAM works and why it's the primary constraint, see our complete VRAM guide.

Qwen 3.5 VRAM Requirements — Every Model at Q4, Q8, FP16

This table shows the minimum VRAM needed to load each Qwen 3.5 model at three quantization levels. Data sourced from Unsloth's quantization documentation, Will It Run AI VRAM calculators, and community testing on r/LocalLLaMA.

Model	Type	Active Params	Q4_K_M	Q8_0	FP16
Qwen 3.5 0.8B	Dense	0.8B	0.6 GB	1.0 GB	1.6 GB
Qwen 3.5 1.5B	Dense	1.5B	1.1 GB	1.8 GB	3.0 GB
Qwen 3.5 4B	Dense	4B	2.5 GB	4.5 GB	8.0 GB
Qwen 3.5 9B	Dense	9B	5.1 GB	9.5 GB	18 GB
Qwen 3.5 14B	Dense	14B	8.2 GB	15 GB	28 GB
Qwen 3.5 27B	Dense	27B	15 GB	28 GB	54 GB
Qwen 3.5 35B-A3B	MoE	3B	19 GB	36 GB	70 GB
Qwen 3.5 122B-A10B	MoE	10B	67 GB	126 GB	244 GB
Qwen 3.5 397B-A17B	MoE	17B	230 GB	410 GB	794 GB

Why MoE models need more VRAM than their active parameters suggest: the 35B-A3B model only activates 3 billion parameters per token, but all 35 billion parameters must be loaded into memory. The speed benefit is real — inference only computes over the active portion — but the memory footprint reflects the total model size. This is why the 35B-A3B needs 19GB at Q4 despite running "like a 3B model" at inference time.

Context length impact: the numbers above assume a short context. At full 262K context, the KV cache can add 4–8GB of overhead depending on model size and batch configuration. For most interactive use, you'll operate at 4K–32K context where the overhead is 0.5–2GB.

Best GPUs for Qwen 3.5 Small Models (0.8B – 14B)

Good news if you're on a budget: every Qwen 3.5 small model runs on entry-level hardware. The 9B model at Q4 needs just 5.1GB of VRAM — any GPU with 8GB or more handles it easily. Even the 14B fits in 12–16GB GPUs at Q4.

The best GPU for Qwen 3.5 small models is the RTX 5060 Ti 16GB ($429 – $479), launching April 16. Its 16GB of GDDR7 runs every dense model through 14B at Q4 with room for long contexts, and Blackwell's Flash Attention support delivers strong tok/s performance. Julien Simon, AI infrastructure analyst and former Head of Developer Relations at Hugging Face, noted in his April 2026 GPU roundup: "The RTX 5060 Ti 16GB is the new default recommendation for anyone running sub-20B models locally — 16GB of fast GDDR7 at under $450 is the sweet spot the market's been waiting for."

Budget alternatives worth considering:

GPU	VRAM	Price	9B Q4 tok/s	Best For
Intel Arc B580	12GB GDDR6	$249 – $289	~35 tok/s	Absolute cheapest entry
RTX 4060 Ti 16GB	16GB GDDR6	$399 – $449	~48 tok/s	Ada Lovelace value pick
RTX 5060 Ti 16GB	16GB GDDR7	$429 – $479	~58 tok/s	Best overall for small models

For a detailed comparison between these budget options, see our RTX 5060 Ti vs RTX 4060 Ti and RTX 4060 Ti vs Intel Arc B580 comparisons. For the full budget GPU landscape, see our budget GPU guide.

Best GPUs for Qwen 3.5 27B and 35B-A3B MoE

This is where Qwen 3.5 gets interesting. The 27B dense model is the most popular size for local use — powerful enough for production-quality output, small enough for a single consumer GPU. At Q4, it needs ~15GB of VRAM.

The best value GPU for Qwen 3.5 27B is a used RTX 3090 ($699 – $999). Its 24GB of VRAM gives 9GB of headroom beyond the model itself — enough for 32K+ context conversations. Community benchmarks on r/LocalLLaMA consistently show 35–40 tok/s on 27B Q4 models, which is well above the interactive threshold.

For maximum performance, the RTX 5090 ($1,999 – $2,199) with 32GB GDDR7 is the clear winner. It runs Qwen 3.5 27B at Q4 with 17GB of headroom, enabling 128K+ context sessions. Julien Simon's benchmark testing showed "55–65 tok/s on 27B Q4 models with the RTX 5090 — roughly 40% faster than the RTX 4090 at the same quantization level." For a detailed comparison, see our RTX 5090 vs RTX 4090 analysis.

The 35B-A3B MoE is the hidden gem. Despite needing 19GB at Q4 (requiring a 24GB+ GPU), it activates only 3B parameters per inference pass — meaning it generates tokens faster than the 27B dense model on equivalent hardware. If your GPU has 24GB+ VRAM, the 35B-A3B can outperform the 27B in both quality and speed.

GPU	VRAM	Price	27B Q4 tok/s	35B-A3B Q4
RTX 5080	16GB GDDR7	$999 – $1,099	~45 tok/s (tight)	Won't fit
RTX 3090	24GB GDDR6X	$699 – $999	~38 tok/s	~52 tok/s
RTX 4090	24GB GDDR6X	$1,599 – $1,999	~48 tok/s	~68 tok/s
RTX 5090	32GB GDDR7	$1,999 – $2,199	~62 tok/s	~85 tok/s

The RTX 5080 ($999 – $1,099) can technically run 27B at Q4, but 15GB out of 16GB leaves barely 1GB for KV cache and OS overhead. It works for short interactions but struggles with longer contexts. For the full performance comparison, see RTX 5090 vs RTX 5080.

Best GPUs for Qwen 3.5 122B-A10B MoE

The 122B-A10B is Qwen 3.5's practical ceiling for local deployment. It has 122 billion total parameters but activates only 10 billion per token — delivering quality that rivals much larger dense models while remaining feasible on high-end consumer hardware. At Q4, the full model requires ~67GB.

No single consumer GPU has 67GB of VRAM. Your options:

Dual RTX 4090 (48GB combined): runs at Q4 with aggressive CPU offloading for the overflow layers. Expect 8–12 tok/s. Requires a motherboard with two PCIe x16 slots and a 1200W+ PSU. See our multi-GPU setup guide for configuration details.
Dual RTX 3090 (48GB combined): same approach, lower cost ($1,400–$2,000 for the pair), but ~30% slower than dual 4090s. Still viable at 6–9 tok/s.
RTX 5090 + CPU offloading: 32GB of VRAM handles roughly half the model, with the rest offloaded to system RAM. Works if you have 128GB+ DDR5 system RAM — see our RAM guide for sizing. Expect 4–8 tok/s depending on RAM bandwidth.

But the best single-device option for Qwen 3.5 122B is the Mac Studio M4 Max ($1,999 – $4,499) with 128GB unified memory. The entire 67GB Q4 model fits in memory with 61GB to spare for the KV cache, OS, and other applications. No multi-GPU complexity, no CPU offloading, completely silent. According to community benchmarks from r/LocalLLaMA, the 128GB Mac Studio M4 Max delivers 12–16 tok/s on 122B-A10B at Q4 via llama.cpp with Metal acceleration — faster than most dual-GPU NVIDIA setups for this specific model.

For a detailed head-to-head, see our RTX 5090 vs Mac Studio M4 Max comparison.

Running Qwen 3.5 397B Locally — The Multi-GPU Reality

Let's be honest: the 397B-A17B flagship is a research-grade model. At Q4 quantization it needs ~230GB. At FP8, ~457GB. No consumer hardware setup can run it comfortably.

The minimum viable local configurations:

4× RTX 4090 (96GB total): runs at aggressive Q4 with heavy CPU offloading. 512GB+ system RAM required. Expect 1–3 tok/s — technically functional but not interactive.
Enterprise A100 80GB cluster: 3× A100s ($12,000 – $15,000 each) provide 240GB of HBM2e — enough for Q4 with moderate headroom. This is the realistic starting point for production 397B serving.
H100 80GB pair: 2× H100s with NVLink provide 160GB of HBM3 at 3,350 GB/s bandwidth per card. Runs at FP8 with CPU offloading, or Q4 with headroom for long contexts.

CraftRigs, an AI hardware benchmarking outlet, documented a 4× RTX 3090 setup (96GB total VRAM) running 397B-A17B at Q4 with 2 tok/s and 512GB DDR5 system RAM for overflow: "It works. It's slow. But every token is yours and the quality is indistinguishable from the API."

Our recommendation: most users should run the 122B-A10B or 35B-A3B instead. The 122B MoE delivers 85–90% of the 397B's quality at a fraction of the hardware cost. If you need 397B-class output, the cloud API is more practical than building a $40K+ local rig.

Apple Silicon for Qwen 3.5 — Mac Mini vs Mac Studio

Apple Silicon is a first-class platform for Qwen 3.5, particularly for the MoE models where unified memory removes the multi-GPU complexity that makes them painful on NVIDIA hardware. MLX and llama.cpp both support Metal acceleration, and Ollama makes setup a one-line command.

Mac	Unified Memory	Price	Best Qwen 3.5 Model	Performance
Mac Mini M4 Pro	24GB	$1,399 – $1,599	9B–14B dense, 27B with aggressive Q4	~20 tok/s on 14B Q4
Mac Studio M4 Max	Up to 128GB	$1,999 – $4,499	27B, 35B-A3B, 122B-A10B	~14 tok/s on 122B Q4

The Mac Mini M4 Pro ($1,399 – $1,599) is the entry point. Its 24GB of unified memory runs 9B and 14B comfortably, and can handle 27B at Q4 with tight memory but functional speed. It's silent, plug-and-play, and excellent for AI agent workflows. See our mini PC guide for more compact options.

When to choose Apple Silicon over NVIDIA: if you prioritize silence, if you need to run 122B without multi-GPU complexity, if you want zero-config Ollama setup, or if your workflow is primarily inference (not training). NVIDIA wins on raw tok/s for models that fit in a single GPU, but Apple Silicon wins on simplicity and memory capacity per dollar.

Budget Build: Run Qwen 3.5 Under $500

Not everyone needs a 27B model. The Qwen 3.5 small series (0.8B–9B) punches well above its weight, and the 9B model is competitive with many older 13B models on standard benchmarks.

Intel Arc B580 ($249 – $289): 12GB GDDR6 runs every small model through 9B at Q4. The cheapest way to get real local AI inference. Requires an existing PC with a PCIe slot.
Beelink SER8 ($449 – $599): complete mini PC with AMD Ryzen 7 8845HS and Radeon 780M integrated graphics. Runs 0.8B–4B models natively and handles 9B at aggressive quantization. Great for local AI agents and lightweight inference.
Used RTX 3090 ($699 – $999): technically over $500 for the GPU alone, but the RTX 3090 with 24GB VRAM remains the single best value GPU in 2026 for local AI. It jumps you from small models straight to 27B. See our AI on a budget hub for more options.

For non-builders, consider a prebuilt AI workstation that bundles a GPU with an optimized system — no assembly required.

Software Setup — Ollama, llama.cpp, vLLM

Once you have hardware, getting Qwen 3.5 running takes minutes:

Ollama is the fastest path. One command pulls and runs any Qwen 3.5 model:

ollama run qwen3.5:27b

Available tags include qwen3.5:0.8b, qwen3.5:9b, qwen3.5:14b, qwen3.5:27b, qwen3.5:35b-a3b, qwen3.5:122b-a10b, and qwen3.5:397b-a17b. For a complete setup walkthrough, see our Ollama setup guide.

llama.cpp offers maximum control and the best MoE performance. Download GGUF files from Hugging Face (the Qwen team and Unsloth both publish optimized quantizations) and run directly. Use Q4_K_M as the default quantization — Unsloth recommends Q4_K_XL for MoE models if available, which preserves more quality in the expert layers.

vLLM is the right choice for serving Qwen 3.5 to multiple users or applications simultaneously. It supports AWQ and GPTQ quantizations, continuous batching, and efficient KV cache management — ideal for production API endpoints.

For the complete guide on running LLMs locally with any framework, see our local LLM guide. For GPU selection across all models (not just Qwen), see the GPU buying guide.

Qwen 3.5 vs Qwen 3 — What Changed for Hardware?

If you bought hardware for Qwen 3, here's the practical impact of upgrading to 3.5:

Change	Qwen 3	Qwen 3.5	Hardware Impact
Context window	128K tokens	262K tokens	+2–4GB KV cache at full context
Multimodal	Text only	Text + image + video	Minimal — vision encoder adds ~0.5GB
Architecture	Dense + 1 MoE (235B-A22B)	Dense + 3 MoE sizes	New 35B-A3B and 122B-A10B options
Language support	119 languages	201 languages	Minimal — vocabulary expansion is small
Small model sizes	0.6B, 1.7B, 4B, 8B	0.8B, 1.5B, 4B, 9B, 14B	Slightly larger, ~10% more VRAM

If you have an RTX 4090 or RTX 3090 (24GB): you ran Qwen 3 8B and 32B comfortably. You can run Qwen 3.5 9B, 14B, and 27B at Q4 with the same hardware — no upgrade needed.

If you have an RTX 5090 (32GB): you can run the new 35B-A3B MoE model that didn't exist in Qwen 3. This is the best new option — MoE quality at 3B inference speed.

If you have a Mac Studio M4 Max (128GB): Qwen 3's largest practical model was 72B. With Qwen 3.5, you can now run the 122B-A10B MoE — a significant quality jump on the same hardware.

Bottom Line — What to Buy for Qwen 3.5 in April 2026

Budget	Best Hardware	Best Qwen 3.5 Model	Expected tok/s
Under $300	Intel Arc B580 ($249 – $289)	9B Q4	~35 tok/s
Under $500	RTX 5060 Ti 16GB ($429 – $479) or Beelink SER8 ($449 – $599)	14B Q4 (5060 Ti) / 4B (SER8)	~50 / ~15 tok/s
Under $1,000	Used RTX 3090 ($699 – $999)	27B Q4, 35B-A3B Q4	~38 / ~52 tok/s
Under $1,600	Mac Mini M4 Pro ($1,399 – $1,599)	14B Q4, 27B aggressive Q4	~20 tok/s
Under $2,200	RTX 5090 ($1,999 – $2,199)	27B Q4–Q8, 35B-A3B Q4	~62 / ~85 tok/s
Under $4,500	Mac Studio M4 Max 128GB ($1,999 – $4,499)	122B-A10B Q4	~14 tok/s
Enterprise	A100 80GB cluster ($12,000 – $15,000 each)	397B-A17B Q4	Varies by cluster

For most users, the used RTX 3090 at $699–$999 remains the best overall value for Qwen 3.5. It runs the most popular 27B model at interactive speeds, handles the 35B MoE comfortably, and leaves upgrade headroom. If you want the absolute fastest consumer experience, the RTX 5090 is worth the premium.

If 122B-A10B quality is your target, skip the multi-GPU complexity and buy a Mac Studio M4 Max with 128GB. It's the simplest path to running Qwen 3.5's most capable practical model — see our full NVIDIA vs Apple comparison for the tradeoffs.

For a broader view of all GPU options beyond Qwen 3.5, check our best GPU for AI ranking. And for fast NVMe storage to speed up model loading times, add a Samsung 990 Pro ($289 – $339) — loading a 15GB Q4 model from NVMe takes under 3 seconds vs 10+ seconds from a SATA SSD.

Pair-buy essentials

Pairs with your NVIDIA GeForce RTX 5090

A 5090 is wasted without clean power, fresh paste, and fast storage. Pair-buys that keep the rig stable.

Corsair RM850x ATX 3.1 (Native 12V-2x6)
$130 – $170
Native 12V-2x6 at 850W, 80+ Gold, fully modular — skips the melted-adapter saga on RTX 40/50 builds.
Shop on Amazon
Arctic MX-6 Thermal Paste (4g)
$8 – $14
Drops sustained-load temps 4–8°C vs. dried-out stock paste. Reapply on day one.
Shop on Amazon
Samsung 990 Pro 2TB Gen4 NVMe
$160 – $210
7,450 MB/s reads cut 70B-class quant cold-loads to seconds. 2TB fits ~10 quantized models.
Shop on Amazon

Show 3 more →

Arctic P14 PWM PST 140mm Fans (5-pack)
$40 – $55
High static pressure + PWM daisy-chain. A full tower's worth of airflow for ~$50.
Shop on Amazon
CyberPower CP1500PFCLCD Pure-Sine UPS
$200 – $260
1500VA pure sine + AVR — protects PSUs from the brownouts that corrupt model files mid-run.
Shop on Amazon
Acer GPU Support Bracket (Magnetic Base)
$15 – $25
Stops a 3-slot RTX 5090 from sagging into the PCIe pins. Magnetic base + non-slip foot — 30-second install.
Shop on Amazon

Includes paid promotion from Acer via Amazon Creator Connections. We earn a commission on qualifying purchases at no cost to you.

Qwen 3.5local AIGPUhardware guideVRAMMoERTX 5090RTX 4090Mac StudioOllamaquantization