How Much VRAM Do You Need for AI in 2026?
A practical guide to GPU memory requirements for every AI workload — LLM inference, training, image generation, and video. Includes a complete VRAM lookup table by model and quantization level, plus hardware recommendations.
Compute Market Team
Our Top Pick
NVIDIA GeForce RTX 4090
$1,599 – $1,99924GB GDDR6X | 16,384 | 1,008 GB/s
VRAM Is the Bottleneck. Here's How Much You Actually Need.
Every AI workload lives or dies by one spec: VRAM (Video Random Access Memory) — the dedicated memory on your GPU. It determines which models you can load, how fast they run, and whether you can train or just run inference. Get the VRAM wrong, and your $2,000 GPU becomes a space heater that throws out-of-memory errors.
This guide gives you the exact numbers. We'll cover VRAM requirements for LLM inference, fine-tuning, image generation, and video — then map those requirements to real GPUs you can buy today.
Note
All VRAM numbers in this guide are measured values at default context lengths (typically 2,048–4,096 tokens). Extending context to 8K, 32K, or 128K tokens increases memory usage significantly — we cover that in the KV cache section below.
The Quick Formula
Before the tables, understand the math behind every number. The baseline VRAM required to load any model is:
VRAM (GB) = Parameters (billions) × Bytes per Parameter × 1.2 overhead
The "bytes per parameter" depends on precision:
| Precision | Bytes per Parameter | Relative Size | Quality Impact |
|---|---|---|---|
| FP32 (full) | 4 bytes | 100% (baseline) | Maximum precision — rarely used for inference |
| FP16 / BF16 | 2 bytes | 50% | Standard inference precision — negligible quality loss |
| INT8 (Q8) | 1 byte | 25% | Near-lossless — most benchmarks within 1% of FP16 |
| INT4 (Q4_K_M) | 0.5 bytes | 12.5% | 95–98% quality retention — the sweet spot for consumer GPUs |
The 1.2x overhead accounts for runtime buffers, CUDA kernels, and a minimal KV cache. Real-world usage typically falls within 1.1x–1.5x depending on context length and batch size.
For example: a 7B parameter model at INT4 quantization requires roughly 7 × 0.5 × 1.2 = 4.2 GB. In practice, tools like Ollama report ~5 GB for Llama 3.1 8B at Q4_K_M — right in line with the formula[1].
The Master VRAM Lookup Table
This is the table to bookmark. It covers the most popular open-source models at three quantization levels, with the 1.2x overhead already factored in.
| Model | Parameters | FP16 VRAM | INT8 (Q8) VRAM | INT4 (Q4_K_M) VRAM | Min. GPU |
|---|---|---|---|---|---|
| Llama 3.1 8B | 8B | ~16 GB | ~8.5 GB | ~5 GB | RTX 3060 12GB |
| Mistral 7B | 7.2B | ~15 GB | ~8 GB | ~4.5 GB | RTX 3060 12GB |
| DeepSeek-R1 14B | 14B | ~29 GB | ~15 GB | ~9 GB | RTX 3060 12GB (INT4) |
| Qwen 2.5 32B | 32B | ~66 GB | ~34 GB | ~20 GB | RTX 4090 24GB (INT4) |
| Llama 3.1 70B | 70B | ~140 GB | ~75 GB | ~38 GB | A100 80GB (INT8) or 2×RTX 4090 (INT4) |
| Llama 3.1 405B | 405B | ~810 GB | ~420 GB | ~220 GB | Multi-GPU cluster |
| Mixtral 8x7B | 46.7B (MoE) | ~93 GB | ~49 GB | ~26 GB | RTX 5090 32GB (INT4) |
| Phi-3 Mini 3.8B | 3.8B | ~8 GB | ~4.5 GB | ~2.8 GB | Any 4GB+ GPU |
| Gemma 2 9B | 9B | ~18 GB | ~10 GB | ~6 GB | RTX 3060 12GB |
How to read this table: Find your model, pick the quantization level that fits your GPU's VRAM, and ensure you have at least that amount of free memory. If you're between sizes, the INT4 (Q4_K_M) column is where most people land — it's the best tradeoff of quality vs. memory.
Pro Tip
Q4_K_M quantization retains 95–98% of full-precision model quality while using roughly 75% less VRAM. For most chat, coding, and creative writing tasks, users cannot distinguish Q4_K_M outputs from FP16[2]. This is the quantization level to default to on consumer hardware.
The Hidden VRAM Cost: KV Cache and Context Length
The lookup table above shows VRAM for loading the model weights. But there's a second cost that catches people off guard: the KV (Key-Value) cache — the memory the model uses to "remember" your conversation.
The KV cache grows linearly with context length. The longer your prompt or conversation, the more VRAM it consumes. For a 7B–8B model, each 1,000 tokens of context adds approximately 0.1–0.2 GB of VRAM. That sounds small, but it adds up fast at longer context windows:
| Context Length | KV Cache Overhead (8B model) | KV Cache Overhead (70B model) |
|---|---|---|
| 2,048 tokens | ~0.2 GB | ~1.6 GB |
| 4,096 tokens | ~0.4 GB | ~3.3 GB |
| 8,192 tokens | ~0.8 GB | ~6.5 GB |
| 32,768 tokens | ~3.2 GB | ~26 GB |
| 128,000 tokens | ~12 GB | ~100 GB+ |
At 128K context, the KV cache for a 70B model can exceed the weight memory itself — which is why running long-context models locally requires serious hardware[3].
Warning
If you're running out of VRAM mid-conversation but the model loaded fine initially, the KV cache is the culprit. Reduce context length with --num-ctx 2048 in Ollama or -c 2048 in llama.cpp to reclaim memory.
Inference vs. Training: Completely Different VRAM Budgets
Running a model (inference) and training a model are two very different workloads. Training requires dramatically more memory because of optimizer states, gradients, and activations stored during backpropagation.
| Workload | VRAM Multiplier vs. Model Size | Example: 7B Model at FP16 |
|---|---|---|
| Inference | 1.2x model weights | ~16 GB |
| LoRA fine-tuning (INT4) | 1.5–2x quantized weights | ~8–10 GB |
| Full fine-tuning (FP16) | 3–4x model weights | ~48–64 GB |
| Full pre-training (FP16) | 4–6x model weights | ~56–96 GB |
The 3–4x multiplier for full fine-tuning comes from storing: (1) the model parameters, (2) a full copy of gradients (same size as parameters), (3) Adam optimizer states (2x parameter size), and (4) forward-pass activations[4].
This is why LoRA (Low-Rank Adaptation) has become the default fine-tuning method for consumer hardware. By only training a small subset of parameters (typically 1–5%), LoRA keeps VRAM usage close to inference levels. You can LoRA fine-tune a 7B model on a single RTX 4080 SUPER with 16 GB VRAM — something that would be impossible with full fine-tuning.
Pro Tip
For fine-tuning on consumer GPUs, use QLoRA (quantized LoRA) with tools like Unsloth or Axolotl. QLoRA loads the base model in 4-bit and only trains the LoRA adapters in FP16, cutting VRAM usage by 70%+ compared to full fine-tuning while achieving comparable results.
VRAM for Image Generation (Stable Diffusion, SDXL, Flux)
Image generation models have different memory profiles than LLMs. They load model weights once, then generate images in passes — with VRAM usage spiking during the denoising process.
| Model | Resolution | VRAM Usage | Recommended GPU |
|---|---|---|---|
| Stable Diffusion 1.5 | 512×512 | ~4 GB | Any 6GB GPU |
| SDXL | 1024×1024 | ~8 GB base, ~12 GB with refiner | 12GB+ GPU |
| SDXL + ControlNet | 1024×1024 | ~14–16 GB | RTX 4080 SUPER 16GB |
| Flux | 1024×1024 | ~16 GB minimum, 24 GB comfortable | RTX 4090 24GB |
| Flux + upscaling | 2048×2048 | ~22–28 GB | RTX 5090 32GB |
SDXL occupies 8 GB for the base model alone. Loading the refiner model simultaneously (recommended for maximum quality) pushes usage to 12–16 GB[5]. Flux, the newer generation model from Black Forest Labs, requires about 50% more VRAM than SDXL at the same resolution — making 16 GB the practical minimum and 24 GB the comfortable target.
VRAM for AI Video Generation
Local video generation is the most VRAM-hungry AI workload in 2026. These models process hundreds of frames, and memory usage scales with both resolution and video length.
| Model | VRAM (Standard 49 frames) | Practical Minimum GPU |
|---|---|---|
| CogVideoX 2B | ~8 GB | RTX 3060 12GB |
| Wan 2.1 T2V-1.3B | ~10–12 GB | RTX 3060 12GB |
| Wan 2.1 T2V-14B | ~18 GB (optimized) | RTX 4090 24GB |
| Mochi 1 | ~12 GB (ComfyUI optimized), ~60 GB (standard) | RTX 4080 SUPER 16GB (optimized) |
| HunyuanVideo | ~16 GB (optimized), 24 GB comfortable | RTX 4090 24GB |
Video generation benefits enormously from frameworks like ComfyUI that manage model loading and unloading intelligently. Without these optimizations, many video models expect 48–80 GB of VRAM. With them, you can generate decent video on a 16–24 GB consumer GPU.
Special Case: Apple Silicon Unified Memory
Apple Silicon Macs (M1, M2, M3, M4 series) use unified memory — the CPU and GPU share the same RAM pool. This changes the VRAM equation fundamentally: a Mac Studio M4 Max with 128 GB unified memory can load models that would require a $25,000 enterprise GPU on the NVIDIA side.
The tradeoff is bandwidth. Apple's unified memory bandwidth (400–800 GB/s depending on chip) is lower than HBM3 (3,350 GB/s on the H100), which means token generation is slower. But for loading and running models that simply won't fit on a 24 GB GPU, Apple Silicon offers an accessible path.
| Apple Device | Unified Memory | Usable for AI (after OS) | Largest Model (Q4) |
|---|---|---|---|
| Mac Mini M4 Pro | 24 GB | ~18–20 GB | Qwen 32B, Mistral 22B |
| MacBook Pro M4 Max | 36–128 GB | ~28–110 GB | Llama 3.1 70B (at 48 GB+) |
| Mac Studio M4 Max | 64–128 GB | ~50–110 GB | Llama 3.1 70B+ comfortably |
Note
Apple Silicon runs LLMs via Metal/MLX acceleration — not CUDA. Most inference tools (Ollama, llama.cpp, LM Studio) fully support Metal. Training and fine-tuning support is more limited. If your workflow is primarily inference, Apple Silicon is excellent. For training, NVIDIA is still the safer choice.
What GPU for What Task: Buying Recommendations
Here's the practical mapping. Find your use case, get the right GPU.
| Use Case | VRAM Needed | Recommended GPU | Price Range |
|---|---|---|---|
| Small LLMs (3B–8B), SD 1.5 | 6–8 GB | RTX 3060 12GB (used) | $200–$300 |
| Mid-size LLMs (8B–14B), SDXL | 10–16 GB | RTX 4080 SUPER 16GB | $949–$1,099 |
| Large LLMs (32B Q4), Flux, LoRA fine-tuning | 20–24 GB | RTX 4090 24GB or RTX 3090 24GB | $699–$1,999 |
| 70B Q3 models, Flux upscaling, video gen | 28–32 GB | RTX 5090 32GB | $1,999–$2,199 |
| 70B FP16, multi-model serving, full fine-tuning | 80 GB | NVIDIA A100 80GB | $12,000–$15,000 |
| Production inference, 70B+ training | 80 GB (fast) | NVIDIA H100 80GB | $25,000–$33,000 |
| Massive models, research | 128 GB | AMD MI250X 128GB | $8,000–$11,000 |
| Silent home AI, 7B–32B (Q4), Ollama | 24–128 GB unified | Mac Studio M4 Max | $1,999–$4,499 |
| Edge inference, small models | 8 GB shared | Jetson Orin Nano | $199–$249 |
Expert Perspectives on VRAM
Tim Dettmers, researcher at the University of Washington and author of the widely cited Which GPU for Deep Learning guide, has consistently argued that VRAM is the single most important GPU spec for deep learning. In his hardware analysis, he notes that while raw compute (FLOPS) determines training speed, VRAM determines what you can run at all — making it the hard constraint rather than a performance variable[2].
George Hotz, founder of tinygrad and Tiny Corp, has taken this argument further in the consumer space. When evaluating AMD's 24 GB RX 7900 XTX against NVIDIA's 16 GB options, Hotz emphasized that the memory advantage matters more than raw compute for AI workloads: "If workloads need memory, you'd need three RTX 4060 cards to get the same 24 GB" — making the case that VRAM-per-dollar should be the primary buying metric for AI practitioners.
Five Common VRAM Mistakes to Avoid
- Ignoring the KV cache. Your model loaded? Great. Now add 2–8 GB for the conversation context. Always leave a 2–4 GB VRAM buffer above model weight size.
- Running at FP16 when INT4 would suffice. Unless you're doing research that demands maximum precision, Q4_K_M gives you 4x the model capacity for a barely measurable quality difference.
- Confusing system RAM with VRAM. 64 GB of DDR5 system RAM does not help your GPU. The model must fit in GPU VRAM (exception: Apple Silicon unified memory and CPU offloading, both slower).
- Buying 8 GB GPUs for AI. In 2026, 8 GB VRAM limits you to 3B–7B models at INT4 with minimal context. Given that 24 GB cards are available from $699 (used RTX 3090), 8 GB is false economy for AI.
- Sizing for inference when you plan to fine-tune. Fine-tuning uses 2–4x more memory than inference. If training is on your roadmap, size your GPU for training — not just inference.
The Decision Tree: How Much VRAM Do You Need?
Answer these three questions:
- What models will you run? Check the lookup table. Find the model size and quantization that matches your quality needs.
- What context length do you need? Add the KV cache overhead for your target context window to the model weight requirement.
- Will you fine-tune or just run inference? If fine-tuning, multiply your inference VRAM by 1.5x (LoRA) or 3–4x (full).
Sum those numbers. Then buy the GPU with the next tier up in VRAM — you'll always find a use for the headroom.
Our Recommendations by Budget
- Under $300: Used RTX 3060 12GB. Runs 7B–8B models in INT4. Enough for learning and lightweight inference.
- $700–$1,000: Used RTX 3090 24GB. The best VRAM-per-dollar in the market. Runs 32B models, handles SDXL and Flux, enables LoRA fine-tuning of 7B–13B models.
- $1,000–$1,100: RTX 4080 SUPER 16GB. Best for 7B–14B inference with excellent power efficiency. Not enough VRAM for 32B+ models.
- $1,500–$2,000: RTX 4090 24GB. The proven workhorse. 24 GB handles 32B Q4, Flux, and LoRA fine-tuning of larger models. Strong used-market availability.
- $2,000–$2,200: RTX 5090 32GB. The new consumer king. 32 GB opens up 70B models at aggressive quantization and full-resolution Flux without memory pressure. 78% more bandwidth than the 4090.
- $2,000–$4,500 (silent): Mac Studio M4 Max. Up to 128 GB unified memory for the largest open-source models. Zero noise. No driver hassles. Best for inference-focused workflows.
The Bottom Line
In 2026, 24 GB is the new baseline for serious AI work. It runs 90% of the models most practitioners need. 32 GB (the new RTX 5090) future-proofs you for the next generation of 20B–70B models. And if you need 70B+ at full precision or plan to fine-tune large models, you're looking at 80 GB enterprise cards or multi-GPU setups.
The single best value in the market right now is a used RTX 3090 for $700–$900 — it gives you 24 GB of VRAM at a price point that makes VRAM-per-dollar math absurd. If you're buying new, the RTX 5090 at $2,000 offers the best VRAM-per-dollar of any current-gen card at $65.60/GB.
Don't overthink it. Check the lookup table, find the model you want to run, and buy the GPU that covers it with 20% headroom. VRAM is the constraint that matters — everything else is optimization.
Further Reading
- Best GPU for AI in 2026: Complete Buyer's Guide
- RTX 5090 vs RTX 4090 for AI: Is the Upgrade Worth It?
- How to Run LLMs Locally: Complete Beginner's Guide
- How Much Does an AI Workstation Cost in 2026?
Sources
- LocalLLM.in, "Ollama VRAM Requirements: Complete Guide to GPU Memory for Local LLMs" (2026). Measured VRAM usage for popular models across quantization levels. localllm.in
- Tim Dettmers, "Which GPU(s) to Get for Deep Learning: My Experience and Advice" (2023). University of Washington. Comprehensive GPU hardware analysis including VRAM prioritization for deep learning. timdettmers.com
- Lyx, "Context Kills VRAM: How to Run LLMs on Consumer GPUs," Medium (2024). Analysis of KV cache scaling and context length impact on memory. medium.com
- Frank Denneman, "Training vs Inference — Memory Consumption by Neural Networks" (2022). Detailed breakdown of optimizer state, gradient, and activation memory during training. frankdenneman.nl
- Prompting Pixels, "System Requirements for Stable Diffusion" (2024). Benchmarked VRAM consumption for SD 1.5, SDXL, and Flux across GPU tiers. medium.com