Guide14 min read

How Much VRAM Do You Need for AI in 2026?

A practical guide to GPU memory requirements for every AI workload — LLM inference, training, image generation, and video. Includes a complete VRAM lookup table by model and quantization level, plus hardware recommendations.

C

Compute Market Team

Our Top Pick

NVIDIA GeForce RTX 4090

$1,599 – $1,999

24GB GDDR6X | 16,384 | 1,008 GB/s

Buy on Amazon

VRAM Is the Bottleneck. Here's How Much You Actually Need.

Every AI workload lives or dies by one spec: VRAM (Video Random Access Memory) — the dedicated memory on your GPU. It determines which models you can load, how fast they run, and whether you can train or just run inference. Get the VRAM wrong, and your $2,000 GPU becomes a space heater that throws out-of-memory errors.

This guide gives you the exact numbers. We'll cover VRAM requirements for LLM inference, fine-tuning, image generation, and video — then map those requirements to real GPUs you can buy today.

Note

All VRAM numbers in this guide are measured values at default context lengths (typically 2,048–4,096 tokens). Extending context to 8K, 32K, or 128K tokens increases memory usage significantly — we cover that in the KV cache section below.

The Quick Formula

Before the tables, understand the math behind every number. The baseline VRAM required to load any model is:

VRAM (GB) = Parameters (billions) × Bytes per Parameter × 1.2 overhead

The "bytes per parameter" depends on precision:

PrecisionBytes per ParameterRelative SizeQuality Impact
FP32 (full)4 bytes100% (baseline)Maximum precision — rarely used for inference
FP16 / BF162 bytes50%Standard inference precision — negligible quality loss
INT8 (Q8)1 byte25%Near-lossless — most benchmarks within 1% of FP16
INT4 (Q4_K_M)0.5 bytes12.5%95–98% quality retention — the sweet spot for consumer GPUs

The 1.2x overhead accounts for runtime buffers, CUDA kernels, and a minimal KV cache. Real-world usage typically falls within 1.1x–1.5x depending on context length and batch size.

For example: a 7B parameter model at INT4 quantization requires roughly 7 × 0.5 × 1.2 = 4.2 GB. In practice, tools like Ollama report ~5 GB for Llama 3.1 8B at Q4_K_M — right in line with the formula[1].

The Master VRAM Lookup Table

This is the table to bookmark. It covers the most popular open-source models at three quantization levels, with the 1.2x overhead already factored in.

ModelParametersFP16 VRAMINT8 (Q8) VRAMINT4 (Q4_K_M) VRAMMin. GPU
Llama 3.1 8B8B~16 GB~8.5 GB~5 GBRTX 3060 12GB
Mistral 7B7.2B~15 GB~8 GB~4.5 GBRTX 3060 12GB
DeepSeek-R1 14B14B~29 GB~15 GB~9 GBRTX 3060 12GB (INT4)
Qwen 2.5 32B32B~66 GB~34 GB~20 GBRTX 4090 24GB (INT4)
Llama 3.1 70B70B~140 GB~75 GB~38 GBA100 80GB (INT8) or 2×RTX 4090 (INT4)
Llama 3.1 405B405B~810 GB~420 GB~220 GBMulti-GPU cluster
Mixtral 8x7B46.7B (MoE)~93 GB~49 GB~26 GBRTX 5090 32GB (INT4)
Phi-3 Mini 3.8B3.8B~8 GB~4.5 GB~2.8 GBAny 4GB+ GPU
Gemma 2 9B9B~18 GB~10 GB~6 GBRTX 3060 12GB

How to read this table: Find your model, pick the quantization level that fits your GPU's VRAM, and ensure you have at least that amount of free memory. If you're between sizes, the INT4 (Q4_K_M) column is where most people land — it's the best tradeoff of quality vs. memory.

Pro Tip

Q4_K_M quantization retains 95–98% of full-precision model quality while using roughly 75% less VRAM. For most chat, coding, and creative writing tasks, users cannot distinguish Q4_K_M outputs from FP16[2]. This is the quantization level to default to on consumer hardware.

The Hidden VRAM Cost: KV Cache and Context Length

The lookup table above shows VRAM for loading the model weights. But there's a second cost that catches people off guard: the KV (Key-Value) cache — the memory the model uses to "remember" your conversation.

The KV cache grows linearly with context length. The longer your prompt or conversation, the more VRAM it consumes. For a 7B–8B model, each 1,000 tokens of context adds approximately 0.1–0.2 GB of VRAM. That sounds small, but it adds up fast at longer context windows:

Context LengthKV Cache Overhead (8B model)KV Cache Overhead (70B model)
2,048 tokens~0.2 GB~1.6 GB
4,096 tokens~0.4 GB~3.3 GB
8,192 tokens~0.8 GB~6.5 GB
32,768 tokens~3.2 GB~26 GB
128,000 tokens~12 GB~100 GB+

At 128K context, the KV cache for a 70B model can exceed the weight memory itself — which is why running long-context models locally requires serious hardware[3].

Warning

If you're running out of VRAM mid-conversation but the model loaded fine initially, the KV cache is the culprit. Reduce context length with --num-ctx 2048 in Ollama or -c 2048 in llama.cpp to reclaim memory.

Inference vs. Training: Completely Different VRAM Budgets

Running a model (inference) and training a model are two very different workloads. Training requires dramatically more memory because of optimizer states, gradients, and activations stored during backpropagation.

WorkloadVRAM Multiplier vs. Model SizeExample: 7B Model at FP16
Inference1.2x model weights~16 GB
LoRA fine-tuning (INT4)1.5–2x quantized weights~8–10 GB
Full fine-tuning (FP16)3–4x model weights~48–64 GB
Full pre-training (FP16)4–6x model weights~56–96 GB

The 3–4x multiplier for full fine-tuning comes from storing: (1) the model parameters, (2) a full copy of gradients (same size as parameters), (3) Adam optimizer states (2x parameter size), and (4) forward-pass activations[4].

This is why LoRA (Low-Rank Adaptation) has become the default fine-tuning method for consumer hardware. By only training a small subset of parameters (typically 1–5%), LoRA keeps VRAM usage close to inference levels. You can LoRA fine-tune a 7B model on a single RTX 4080 SUPER with 16 GB VRAM — something that would be impossible with full fine-tuning.

Pro Tip

For fine-tuning on consumer GPUs, use QLoRA (quantized LoRA) with tools like Unsloth or Axolotl. QLoRA loads the base model in 4-bit and only trains the LoRA adapters in FP16, cutting VRAM usage by 70%+ compared to full fine-tuning while achieving comparable results.

VRAM for Image Generation (Stable Diffusion, SDXL, Flux)

Image generation models have different memory profiles than LLMs. They load model weights once, then generate images in passes — with VRAM usage spiking during the denoising process.

ModelResolutionVRAM UsageRecommended GPU
Stable Diffusion 1.5512×512~4 GBAny 6GB GPU
SDXL1024×1024~8 GB base, ~12 GB with refiner12GB+ GPU
SDXL + ControlNet1024×1024~14–16 GBRTX 4080 SUPER 16GB
Flux1024×1024~16 GB minimum, 24 GB comfortableRTX 4090 24GB
Flux + upscaling2048×2048~22–28 GBRTX 5090 32GB

SDXL occupies 8 GB for the base model alone. Loading the refiner model simultaneously (recommended for maximum quality) pushes usage to 12–16 GB[5]. Flux, the newer generation model from Black Forest Labs, requires about 50% more VRAM than SDXL at the same resolution — making 16 GB the practical minimum and 24 GB the comfortable target.

VRAM for AI Video Generation

Local video generation is the most VRAM-hungry AI workload in 2026. These models process hundreds of frames, and memory usage scales with both resolution and video length.

ModelVRAM (Standard 49 frames)Practical Minimum GPU
CogVideoX 2B~8 GBRTX 3060 12GB
Wan 2.1 T2V-1.3B~10–12 GBRTX 3060 12GB
Wan 2.1 T2V-14B~18 GB (optimized)RTX 4090 24GB
Mochi 1~12 GB (ComfyUI optimized), ~60 GB (standard)RTX 4080 SUPER 16GB (optimized)
HunyuanVideo~16 GB (optimized), 24 GB comfortableRTX 4090 24GB

Video generation benefits enormously from frameworks like ComfyUI that manage model loading and unloading intelligently. Without these optimizations, many video models expect 48–80 GB of VRAM. With them, you can generate decent video on a 16–24 GB consumer GPU.

Special Case: Apple Silicon Unified Memory

Apple Silicon Macs (M1, M2, M3, M4 series) use unified memory — the CPU and GPU share the same RAM pool. This changes the VRAM equation fundamentally: a Mac Studio M4 Max with 128 GB unified memory can load models that would require a $25,000 enterprise GPU on the NVIDIA side.

The tradeoff is bandwidth. Apple's unified memory bandwidth (400–800 GB/s depending on chip) is lower than HBM3 (3,350 GB/s on the H100), which means token generation is slower. But for loading and running models that simply won't fit on a 24 GB GPU, Apple Silicon offers an accessible path.

Apple DeviceUnified MemoryUsable for AI (after OS)Largest Model (Q4)
Mac Mini M4 Pro24 GB~18–20 GBQwen 32B, Mistral 22B
MacBook Pro M4 Max36–128 GB~28–110 GBLlama 3.1 70B (at 48 GB+)
Mac Studio M4 Max64–128 GB~50–110 GBLlama 3.1 70B+ comfortably

Note

Apple Silicon runs LLMs via Metal/MLX acceleration — not CUDA. Most inference tools (Ollama, llama.cpp, LM Studio) fully support Metal. Training and fine-tuning support is more limited. If your workflow is primarily inference, Apple Silicon is excellent. For training, NVIDIA is still the safer choice.

What GPU for What Task: Buying Recommendations

Here's the practical mapping. Find your use case, get the right GPU.

Use CaseVRAM NeededRecommended GPUPrice Range
Small LLMs (3B–8B), SD 1.56–8 GBRTX 3060 12GB (used)$200–$300
Mid-size LLMs (8B–14B), SDXL10–16 GBRTX 4080 SUPER 16GB$949–$1,099
Large LLMs (32B Q4), Flux, LoRA fine-tuning20–24 GBRTX 4090 24GB or RTX 3090 24GB$699–$1,999
70B Q3 models, Flux upscaling, video gen28–32 GBRTX 5090 32GB$1,999–$2,199
70B FP16, multi-model serving, full fine-tuning80 GBNVIDIA A100 80GB$12,000–$15,000
Production inference, 70B+ training80 GB (fast)NVIDIA H100 80GB$25,000–$33,000
Massive models, research128 GBAMD MI250X 128GB$8,000–$11,000
Silent home AI, 7B–32B (Q4), Ollama24–128 GB unifiedMac Studio M4 Max$1,999–$4,499
Edge inference, small models8 GB sharedJetson Orin Nano$199–$249

Expert Perspectives on VRAM

Tim Dettmers, researcher at the University of Washington and author of the widely cited Which GPU for Deep Learning guide, has consistently argued that VRAM is the single most important GPU spec for deep learning. In his hardware analysis, he notes that while raw compute (FLOPS) determines training speed, VRAM determines what you can run at all — making it the hard constraint rather than a performance variable[2].

George Hotz, founder of tinygrad and Tiny Corp, has taken this argument further in the consumer space. When evaluating AMD's 24 GB RX 7900 XTX against NVIDIA's 16 GB options, Hotz emphasized that the memory advantage matters more than raw compute for AI workloads: "If workloads need memory, you'd need three RTX 4060 cards to get the same 24 GB" — making the case that VRAM-per-dollar should be the primary buying metric for AI practitioners.

Five Common VRAM Mistakes to Avoid

  1. Ignoring the KV cache. Your model loaded? Great. Now add 2–8 GB for the conversation context. Always leave a 2–4 GB VRAM buffer above model weight size.
  2. Running at FP16 when INT4 would suffice. Unless you're doing research that demands maximum precision, Q4_K_M gives you 4x the model capacity for a barely measurable quality difference.
  3. Confusing system RAM with VRAM. 64 GB of DDR5 system RAM does not help your GPU. The model must fit in GPU VRAM (exception: Apple Silicon unified memory and CPU offloading, both slower).
  4. Buying 8 GB GPUs for AI. In 2026, 8 GB VRAM limits you to 3B–7B models at INT4 with minimal context. Given that 24 GB cards are available from $699 (used RTX 3090), 8 GB is false economy for AI.
  5. Sizing for inference when you plan to fine-tune. Fine-tuning uses 2–4x more memory than inference. If training is on your roadmap, size your GPU for training — not just inference.

The Decision Tree: How Much VRAM Do You Need?

Answer these three questions:

  1. What models will you run? Check the lookup table. Find the model size and quantization that matches your quality needs.
  2. What context length do you need? Add the KV cache overhead for your target context window to the model weight requirement.
  3. Will you fine-tune or just run inference? If fine-tuning, multiply your inference VRAM by 1.5x (LoRA) or 3–4x (full).

Sum those numbers. Then buy the GPU with the next tier up in VRAM — you'll always find a use for the headroom.

Our Recommendations by Budget

  • Under $300: Used RTX 3060 12GB. Runs 7B–8B models in INT4. Enough for learning and lightweight inference.
  • $700–$1,000: Used RTX 3090 24GB. The best VRAM-per-dollar in the market. Runs 32B models, handles SDXL and Flux, enables LoRA fine-tuning of 7B–13B models.
  • $1,000–$1,100: RTX 4080 SUPER 16GB. Best for 7B–14B inference with excellent power efficiency. Not enough VRAM for 32B+ models.
  • $1,500–$2,000: RTX 4090 24GB. The proven workhorse. 24 GB handles 32B Q4, Flux, and LoRA fine-tuning of larger models. Strong used-market availability.
  • $2,000–$2,200: RTX 5090 32GB. The new consumer king. 32 GB opens up 70B models at aggressive quantization and full-resolution Flux without memory pressure. 78% more bandwidth than the 4090.
  • $2,000–$4,500 (silent): Mac Studio M4 Max. Up to 128 GB unified memory for the largest open-source models. Zero noise. No driver hassles. Best for inference-focused workflows.

The Bottom Line

In 2026, 24 GB is the new baseline for serious AI work. It runs 90% of the models most practitioners need. 32 GB (the new RTX 5090) future-proofs you for the next generation of 20B–70B models. And if you need 70B+ at full precision or plan to fine-tune large models, you're looking at 80 GB enterprise cards or multi-GPU setups.

The single best value in the market right now is a used RTX 3090 for $700–$900 — it gives you 24 GB of VRAM at a price point that makes VRAM-per-dollar math absurd. If you're buying new, the RTX 5090 at $2,000 offers the best VRAM-per-dollar of any current-gen card at $65.60/GB.

Don't overthink it. Check the lookup table, find the model you want to run, and buy the GPU that covers it with 20% headroom. VRAM is the constraint that matters — everything else is optimization.

Further Reading

Sources

  1. LocalLLM.in, "Ollama VRAM Requirements: Complete Guide to GPU Memory for Local LLMs" (2026). Measured VRAM usage for popular models across quantization levels. localllm.in
  2. Tim Dettmers, "Which GPU(s) to Get for Deep Learning: My Experience and Advice" (2023). University of Washington. Comprehensive GPU hardware analysis including VRAM prioritization for deep learning. timdettmers.com
  3. Lyx, "Context Kills VRAM: How to Run LLMs on Consumer GPUs," Medium (2024). Analysis of KV cache scaling and context length impact on memory. medium.com
  4. Frank Denneman, "Training vs Inference — Memory Consumption by Neural Networks" (2022). Detailed breakdown of optimizer state, gradient, and activation memory during training. frankdenneman.nl
  5. Prompting Pixels, "System Requirements for Stable Diffusion" (2024). Benchmarked VRAM consumption for SD 1.5, SDXL, and Flux across GPU tiers. medium.com
VRAMGPU memoryLLMquantizationinferencetraininghardware guide2026

More from the blog

Stay ahead in AI hardware

Weekly deals, GPU reviews, and build guides. No spam.

Unsubscribe anytime. We respect your inbox.