Qwen 3.5 Local Hardware Guide 2026: Every Model from 0.8B to 397B
Qwen 3.5 rewrites the local AI playbook with native multimodal, 262K context, and hybrid MoE. Here's exactly which GPU, Mac, or mini PC you need for every model size — with VRAM math, tok/s benchmarks, and price-tiered recommendations from $250 to enterprise.
Compute Market Team
Our Top Pick

Qwen 3.5 is the most capable open-weight model family available in April 2026 — and the most complex to buy hardware for. Nine model sizes spanning 0.8B to 397B parameters, three architecture types (dense, hybrid MoE, and multimodal), and a 262K context window that rewrites every VRAM calculation from the Qwen 3 era.
This guide covers every Qwen 3.5 model size with exact VRAM requirements, price-tiered GPU recommendations from $250 to enterprise, Apple Silicon coverage that PC-focused guides miss, and real tok/s benchmarks so you know what performance to expect before buying. If you ran hardware for Qwen 3, we'll tell you exactly what's changed.
The bottom line: Running Qwen 3.5 27B locally at Q4 quantization requires 15GB VRAM — an RTX 4090 (24GB) delivers 40–50 tokens per second at this precision, making it the best price-to-performance GPU for Qwen 3.5's most popular model size in April 2026.
What Is Qwen 3.5? — The Complete Model Family
Released by Alibaba Cloud's Qwen team across three waves in February–March 2026, Qwen 3.5 is architecturally different from its predecessor in ways that directly affect hardware requirements:
- Feb 16 — Flagship: 397B-A17B, the largest open-weight MoE model available, with 397 billion total parameters but only 17 billion active per token
- Feb 24 — Medium series: 27B (dense), 35B-A3B (MoE), 122B-A10B (MoE) — the models most local AI users will run
- Mar 2 — Small series: 0.8B, 1.5B, 4B, 9B, 14B — optimized for phones, laptops, and edge devices
Three architectural changes matter for hardware planning:
- Native multimodal: every Qwen 3.5 model processes text, images, and video natively — no separate vision adapter required
- 262K context window: double the 128K context of Qwen 3, which means a proportionally larger KV cache consuming more memory
- Hybrid MoE architecture: the 35B, 122B, and 397B models use Mixture of Experts, activating only a fraction of parameters per token — dramatically reducing compute requirements relative to model size
For a deeper dive into how VRAM works and why it's the primary constraint, see our complete VRAM guide.
Qwen 3.5 VRAM Requirements — Every Model at Q4, Q8, FP16
This table shows the minimum VRAM needed to load each Qwen 3.5 model at three quantization levels. Data sourced from Unsloth's quantization documentation, Will It Run AI VRAM calculators, and community testing on r/LocalLLaMA.
| Model | Type | Active Params | Q4_K_M | Q8_0 | FP16 |
|---|---|---|---|---|---|
| Qwen 3.5 0.8B | Dense | 0.8B | 0.6 GB | 1.0 GB | 1.6 GB |
| Qwen 3.5 1.5B | Dense | 1.5B | 1.1 GB | 1.8 GB | 3.0 GB |
| Qwen 3.5 4B | Dense | 4B | 2.5 GB | 4.5 GB | 8.0 GB |
| Qwen 3.5 9B | Dense | 9B | 5.1 GB | 9.5 GB | 18 GB |
| Qwen 3.5 14B | Dense | 14B | 8.2 GB | 15 GB | 28 GB |
| Qwen 3.5 27B | Dense | 27B | 15 GB | 28 GB | 54 GB |
| Qwen 3.5 35B-A3B | MoE | 3B | 19 GB | 36 GB | 70 GB |
| Qwen 3.5 122B-A10B | MoE | 10B | 67 GB | 126 GB | 244 GB |
| Qwen 3.5 397B-A17B | MoE | 17B | 230 GB | 410 GB | 794 GB |
Why MoE models need more VRAM than their active parameters suggest: the 35B-A3B model only activates 3 billion parameters per token, but all 35 billion parameters must be loaded into memory. The speed benefit is real — inference only computes over the active portion — but the memory footprint reflects the total model size. This is why the 35B-A3B needs 19GB at Q4 despite running "like a 3B model" at inference time.
Context length impact: the numbers above assume a short context. At full 262K context, the KV cache can add 4–8GB of overhead depending on model size and batch configuration. For most interactive use, you'll operate at 4K–32K context where the overhead is 0.5–2GB.
Best GPUs for Qwen 3.5 Small Models (0.8B – 14B)
Good news if you're on a budget: every Qwen 3.5 small model runs on entry-level hardware. The 9B model at Q4 needs just 5.1GB of VRAM — any GPU with 8GB or more handles it easily. Even the 14B fits in 12–16GB GPUs at Q4.
The best GPU for Qwen 3.5 small models is the RTX 5060 Ti 16GB ($429 – $479), launching April 16. Its 16GB of GDDR7 runs every dense model through 14B at Q4 with room for long contexts, and Blackwell's Flash Attention support delivers strong tok/s performance. Julien Simon, AI infrastructure analyst and former Head of Developer Relations at Hugging Face, noted in his April 2026 GPU roundup: "The RTX 5060 Ti 16GB is the new default recommendation for anyone running sub-20B models locally — 16GB of fast GDDR7 at under $450 is the sweet spot the market's been waiting for."
Budget alternatives worth considering:
| GPU | VRAM | Price | 9B Q4 tok/s | Best For |
|---|---|---|---|---|
| Intel Arc B580 | 12GB GDDR6 | $249 – $289 | ~35 tok/s | Absolute cheapest entry |
| RTX 4060 Ti 16GB | 16GB GDDR6 | $399 – $449 | ~48 tok/s | Ada Lovelace value pick |
| RTX 5060 Ti 16GB | 16GB GDDR7 | $429 – $479 | ~58 tok/s | Best overall for small models |
For a detailed comparison between these budget options, see our RTX 5060 Ti vs RTX 4060 Ti and RTX 4060 Ti vs Intel Arc B580 comparisons. For the full budget GPU landscape, see our budget GPU guide.
Best GPUs for Qwen 3.5 27B and 35B-A3B MoE
This is where Qwen 3.5 gets interesting. The 27B dense model is the most popular size for local use — powerful enough for production-quality output, small enough for a single consumer GPU. At Q4, it needs ~15GB of VRAM.
The best value GPU for Qwen 3.5 27B is a used RTX 3090 ($699 – $999). Its 24GB of VRAM gives 9GB of headroom beyond the model itself — enough for 32K+ context conversations. Community benchmarks on r/LocalLLaMA consistently show 35–40 tok/s on 27B Q4 models, which is well above the interactive threshold.
For maximum performance, the RTX 5090 ($1,999 – $2,199) with 32GB GDDR7 is the clear winner. It runs Qwen 3.5 27B at Q4 with 17GB of headroom, enabling 128K+ context sessions. Julien Simon's benchmark testing showed "55–65 tok/s on 27B Q4 models with the RTX 5090 — roughly 40% faster than the RTX 4090 at the same quantization level." For a detailed comparison, see our RTX 5090 vs RTX 4090 analysis.
The 35B-A3B MoE is the hidden gem. Despite needing 19GB at Q4 (requiring a 24GB+ GPU), it activates only 3B parameters per inference pass — meaning it generates tokens faster than the 27B dense model on equivalent hardware. If your GPU has 24GB+ VRAM, the 35B-A3B can outperform the 27B in both quality and speed.
| GPU | VRAM | Price | 27B Q4 tok/s | 35B-A3B Q4 |
|---|---|---|---|---|
| RTX 5080 | 16GB GDDR7 | $999 – $1,099 | ~45 tok/s (tight) | Won't fit |
| RTX 3090 | 24GB GDDR6X | $699 – $999 | ~38 tok/s | ~52 tok/s |
| RTX 4090 | 24GB GDDR6X | $1,599 – $1,999 | ~48 tok/s | ~68 tok/s |
| RTX 5090 | 32GB GDDR7 | $1,999 – $2,199 | ~62 tok/s | ~85 tok/s |
The RTX 5080 ($999 – $1,099) can technically run 27B at Q4, but 15GB out of 16GB leaves barely 1GB for KV cache and OS overhead. It works for short interactions but struggles with longer contexts. For the full performance comparison, see RTX 5090 vs RTX 5080.
Best GPUs for Qwen 3.5 122B-A10B MoE
The 122B-A10B is Qwen 3.5's practical ceiling for local deployment. It has 122 billion total parameters but activates only 10 billion per token — delivering quality that rivals much larger dense models while remaining feasible on high-end consumer hardware. At Q4, the full model requires ~67GB.
No single consumer GPU has 67GB of VRAM. Your options:
- Dual RTX 4090 (48GB combined): runs at Q4 with aggressive CPU offloading for the overflow layers. Expect 8–12 tok/s. Requires a motherboard with two PCIe x16 slots and a 1200W+ PSU. See our multi-GPU setup guide for configuration details.
- Dual RTX 3090 (48GB combined): same approach, lower cost ($1,400–$2,000 for the pair), but ~30% slower than dual 4090s. Still viable at 6–9 tok/s.
- RTX 5090 + CPU offloading: 32GB of VRAM handles roughly half the model, with the rest offloaded to system RAM. Works if you have 128GB+ DDR5 system RAM — see our RAM guide for sizing. Expect 4–8 tok/s depending on RAM bandwidth.
But the best single-device option for Qwen 3.5 122B is the Mac Studio M4 Max ($1,999 – $4,499) with 128GB unified memory. The entire 67GB Q4 model fits in memory with 61GB to spare for the KV cache, OS, and other applications. No multi-GPU complexity, no CPU offloading, completely silent. According to community benchmarks from r/LocalLLaMA, the 128GB Mac Studio M4 Max delivers 12–16 tok/s on 122B-A10B at Q4 via llama.cpp with Metal acceleration — faster than most dual-GPU NVIDIA setups for this specific model.
For a detailed head-to-head, see our RTX 5090 vs Mac Studio M4 Max comparison.
Running Qwen 3.5 397B Locally — The Multi-GPU Reality
Let's be honest: the 397B-A17B flagship is a research-grade model. At Q4 quantization it needs ~230GB. At FP8, ~457GB. No consumer hardware setup can run it comfortably.
The minimum viable local configurations:
- 4× RTX 4090 (96GB total): runs at aggressive Q4 with heavy CPU offloading. 512GB+ system RAM required. Expect 1–3 tok/s — technically functional but not interactive.
- Enterprise A100 80GB cluster: 3× A100s ($12,000 – $15,000 each) provide 240GB of HBM2e — enough for Q4 with moderate headroom. This is the realistic starting point for production 397B serving.
- H100 80GB pair: 2× H100s with NVLink provide 160GB of HBM3 at 3,350 GB/s bandwidth per card. Runs at FP8 with CPU offloading, or Q4 with headroom for long contexts.
CraftRigs, an AI hardware benchmarking outlet, documented a 4× RTX 3090 setup (96GB total VRAM) running 397B-A17B at Q4 with 2 tok/s and 512GB DDR5 system RAM for overflow: "It works. It's slow. But every token is yours and the quality is indistinguishable from the API."
Our recommendation: most users should run the 122B-A10B or 35B-A3B instead. The 122B MoE delivers 85–90% of the 397B's quality at a fraction of the hardware cost. If you need 397B-class output, the cloud API is more practical than building a $40K+ local rig.
Apple Silicon for Qwen 3.5 — Mac Mini vs Mac Studio
Apple Silicon is a first-class platform for Qwen 3.5, particularly for the MoE models where unified memory removes the multi-GPU complexity that makes them painful on NVIDIA hardware. MLX and llama.cpp both support Metal acceleration, and Ollama makes setup a one-line command.
| Mac | Unified Memory | Price | Best Qwen 3.5 Model | Performance |
|---|---|---|---|---|
| Mac Mini M4 Pro | 24GB | $1,399 – $1,599 | 9B–14B dense, 27B with aggressive Q4 | ~20 tok/s on 14B Q4 |
| Mac Studio M4 Max | Up to 128GB | $1,999 – $4,499 | 27B, 35B-A3B, 122B-A10B | ~14 tok/s on 122B Q4 |
The Mac Mini M4 Pro ($1,399 – $1,599) is the entry point. Its 24GB of unified memory runs 9B and 14B comfortably, and can handle 27B at Q4 with tight memory but functional speed. It's silent, plug-and-play, and excellent for AI agent workflows. See our mini PC guide for more compact options.
When to choose Apple Silicon over NVIDIA: if you prioritize silence, if you need to run 122B without multi-GPU complexity, if you want zero-config Ollama setup, or if your workflow is primarily inference (not training). NVIDIA wins on raw tok/s for models that fit in a single GPU, but Apple Silicon wins on simplicity and memory capacity per dollar.
Budget Build: Run Qwen 3.5 Under $500
Not everyone needs a 27B model. The Qwen 3.5 small series (0.8B–9B) punches well above its weight, and the 9B model is competitive with many older 13B models on standard benchmarks.
- Intel Arc B580 ($249 – $289): 12GB GDDR6 runs every small model through 9B at Q4. The cheapest way to get real local AI inference. Requires an existing PC with a PCIe slot.
- Beelink SER8 ($449 – $599): complete mini PC with AMD Ryzen 7 8845HS and Radeon 780M integrated graphics. Runs 0.8B–4B models natively and handles 9B at aggressive quantization. Great for local AI agents and lightweight inference.
- Used RTX 3090 ($699 – $999): technically over $500 for the GPU alone, but the RTX 3090 with 24GB VRAM remains the single best value GPU in 2026 for local AI. It jumps you from small models straight to 27B. See our AI on a budget hub for more options.
For non-builders, consider a prebuilt AI workstation that bundles a GPU with an optimized system — no assembly required.
Software Setup — Ollama, llama.cpp, vLLM
Once you have hardware, getting Qwen 3.5 running takes minutes:
Ollama is the fastest path. One command pulls and runs any Qwen 3.5 model:
ollama run qwen3.5:27b
Available tags include qwen3.5:0.8b, qwen3.5:9b, qwen3.5:14b, qwen3.5:27b, qwen3.5:35b-a3b, qwen3.5:122b-a10b, and qwen3.5:397b-a17b. For a complete setup walkthrough, see our Ollama setup guide.
llama.cpp offers maximum control and the best MoE performance. Download GGUF files from Hugging Face (the Qwen team and Unsloth both publish optimized quantizations) and run directly. Use Q4_K_M as the default quantization — Unsloth recommends Q4_K_XL for MoE models if available, which preserves more quality in the expert layers.
vLLM is the right choice for serving Qwen 3.5 to multiple users or applications simultaneously. It supports AWQ and GPTQ quantizations, continuous batching, and efficient KV cache management — ideal for production API endpoints.
For the complete guide on running LLMs locally with any framework, see our local LLM guide. For GPU selection across all models (not just Qwen), see the GPU buying guide.
Qwen 3.5 vs Qwen 3 — What Changed for Hardware?
If you bought hardware for Qwen 3, here's the practical impact of upgrading to 3.5:
| Change | Qwen 3 | Qwen 3.5 | Hardware Impact |
|---|---|---|---|
| Context window | 128K tokens | 262K tokens | +2–4GB KV cache at full context |
| Multimodal | Text only | Text + image + video | Minimal — vision encoder adds ~0.5GB |
| Architecture | Dense + 1 MoE (235B-A22B) | Dense + 3 MoE sizes | New 35B-A3B and 122B-A10B options |
| Language support | 119 languages | 201 languages | Minimal — vocabulary expansion is small |
| Small model sizes | 0.6B, 1.7B, 4B, 8B | 0.8B, 1.5B, 4B, 9B, 14B | Slightly larger, ~10% more VRAM |
If you have an RTX 4090 or RTX 3090 (24GB): you ran Qwen 3 8B and 32B comfortably. You can run Qwen 3.5 9B, 14B, and 27B at Q4 with the same hardware — no upgrade needed.
If you have an RTX 5090 (32GB): you can run the new 35B-A3B MoE model that didn't exist in Qwen 3. This is the best new option — MoE quality at 3B inference speed.
If you have a Mac Studio M4 Max (128GB): Qwen 3's largest practical model was 72B. With Qwen 3.5, you can now run the 122B-A10B MoE — a significant quality jump on the same hardware.
Bottom Line — What to Buy for Qwen 3.5 in April 2026
| Budget | Best Hardware | Best Qwen 3.5 Model | Expected tok/s |
|---|---|---|---|
| Under $300 | Intel Arc B580 ($249 – $289) | 9B Q4 | ~35 tok/s |
| Under $500 | RTX 5060 Ti 16GB ($429 – $479) or Beelink SER8 ($449 – $599) | 14B Q4 (5060 Ti) / 4B (SER8) | ~50 / ~15 tok/s |
| Under $1,000 | Used RTX 3090 ($699 – $999) | 27B Q4, 35B-A3B Q4 | ~38 / ~52 tok/s |
| Under $1,600 | Mac Mini M4 Pro ($1,399 – $1,599) | 14B Q4, 27B aggressive Q4 | ~20 tok/s |
| Under $2,200 | RTX 5090 ($1,999 – $2,199) | 27B Q4–Q8, 35B-A3B Q4 | ~62 / ~85 tok/s |
| Under $4,500 | Mac Studio M4 Max 128GB ($1,999 – $4,499) | 122B-A10B Q4 | ~14 tok/s |
| Enterprise | A100 80GB cluster ($12,000 – $15,000 each) | 397B-A17B Q4 | Varies by cluster |
For most users, the used RTX 3090 at $699–$999 remains the best overall value for Qwen 3.5. It runs the most popular 27B model at interactive speeds, handles the 35B MoE comfortably, and leaves upgrade headroom. If you want the absolute fastest consumer experience, the RTX 5090 is worth the premium.
If 122B-A10B quality is your target, skip the multi-GPU complexity and buy a Mac Studio M4 Max with 128GB. It's the simplest path to running Qwen 3.5's most capable practical model — see our full NVIDIA vs Apple comparison for the tradeoffs.
For a broader view of all GPU options beyond Qwen 3.5, check our best GPU for AI ranking. And for fast NVMe storage to speed up model loading times, add a Samsung 990 Pro ($289 – $339) — loading a 15GB Q4 model from NVMe takes under 3 seconds vs 10+ seconds from a SATA SSD.