How much VRAM do I need for a 7B model?

A 7B parameter model at INT4 quantization needs roughly 5GB VRAM. At FP16, you need about 14GB. Ollama reports ~5GB for Llama 3.1 8B at Q4_K_M.

How much VRAM for 70B models?

70B models need ~38GB at INT4, ~75GB at INT8, and ~140GB at FP16. A single RTX 5090 (32GB) can run 70B at aggressive quantization; for full precision you need an A100 80GB or multi-GPU.

What is the VRAM formula for AI models?

VRAM (GB) = Parameters (billions) × Bytes per parameter × 1.2 overhead. INT4 uses 0.5 bytes/param, INT8 uses 1 byte, FP16 uses 2 bytes. The 1.2x accounts for KV cache and runtime buffers.

Guide14 min read

How Much VRAM Do You Need for AI in 2026?

A practical guide to GPU memory requirements for every AI workload — LLM inference, training, image generation, and video. Includes a complete VRAM lookup table by model and quantization level, plus hardware recommendations.

Compute Market Team

Published March 1, 2026Updated April 19, 2026

Our Top Pick

NVIDIA GeForce RTX 4090

$1,599 – $1,999

24GB GDDR6X16,3841,008 GB/s

Check Price on Amazon Full review →

VRAM Is the Bottleneck. Here's How Much You Actually Need.

Every AI workload lives or dies by one spec: VRAM (Video Random Access Memory) — the dedicated memory on your GPU. It determines which models you can load, how fast they run, and whether you can train or just run inference. Get the VRAM wrong, and your $2,000 GPU becomes a space heater that throws out-of-memory errors.

This guide gives you the exact numbers. We'll cover VRAM requirements for LLM inference, fine-tuning, image generation, and video — then map those requirements to real GPUs you can buy today.

Note

All VRAM numbers in this guide are measured values at default context lengths (typically 2,048–4,096 tokens). Extending context to 8K, 32K, or 128K tokens increases memory usage significantly — we cover that in the KV cache section below.

The Quick Formula

Before the tables, understand the math behind every number. The baseline VRAM required to load any model is:

VRAM (GB) = Parameters (billions) × Bytes per Parameter × 1.2 overhead

The "bytes per parameter" depends on precision:

Precision	Bytes per Parameter	Relative Size	Quality Impact
FP32 (full)	4 bytes	100% (baseline)	Maximum precision — rarely used for inference
FP16 / BF16	2 bytes	50%	Standard inference precision — negligible quality loss
INT8 (Q8)	1 byte	25%	Near-lossless — most benchmarks within 1% of FP16
INT4 (Q4_K_M)	0.5 bytes	12.5%	95–98% quality retention — the sweet spot for consumer GPUs

The 1.2x overhead accounts for runtime buffers, CUDA kernels, and a minimal KV cache. Real-world usage typically falls within 1.1x–1.5x depending on context length and batch size.

For example: a 7B parameter model at INT4 quantization requires roughly 7 × 0.5 × 1.2 = 4.2 GB. In practice, tools like Ollama report ~5 GB for Llama 3.1 8B at Q4_K_M — right in line with the formula^[1].

The Master VRAM Lookup Table

This is the table to bookmark. It covers the most popular open-source models at three quantization levels, with the 1.2x overhead already factored in.

Model	Parameters	FP16 VRAM	INT8 (Q8) VRAM	INT4 (Q4_K_M) VRAM	Min. GPU
Llama 3.1 8B	8B	~16 GB	~8.5 GB	~5 GB	RTX 3060 12GB
Mistral 7B	7.2B	~15 GB	~8 GB	~4.5 GB	RTX 3060 12GB
DeepSeek-R1 14B	14B	~29 GB	~15 GB	~9 GB	RTX 3060 12GB (INT4)
Qwen 2.5 32B	32B	~66 GB	~34 GB	~20 GB	RTX 4090 24GB (INT4)
Llama 3.1 70B	70B	~140 GB	~75 GB	~38 GB	A100 80GB (INT8) or 2×RTX 4090 (INT4)
Llama 3.1 405B	405B	~810 GB	~420 GB	~220 GB	Multi-GPU cluster
Mixtral 8x7B	46.7B (MoE)	~93 GB	~49 GB	~26 GB	RTX 5090 32GB (INT4)
Phi-3 Mini 3.8B	3.8B	~8 GB	~4.5 GB	~2.8 GB	Any 4GB+ GPU
Gemma 2 9B	9B	~18 GB	~10 GB	~6 GB	RTX 3060 12GB

How to read this table: Find your model, pick the quantization level that fits your GPU's VRAM, and ensure you have at least that amount of free memory. If you're between sizes, the INT4 (Q4_K_M) column is where most people land — it's the best tradeoff of quality vs. memory.

Pro Tip

Q4_K_M quantization retains 95–98% of full-precision model quality while using roughly 75% less VRAM. For most chat, coding, and creative writing tasks, users cannot distinguish Q4_K_M outputs from FP16^[2]. This is the quantization level to default to on consumer hardware.

The Hidden VRAM Cost: KV Cache and Context Length

The lookup table above shows VRAM for loading the model weights. But there's a second cost that catches people off guard: the KV (Key-Value) cache — the memory the model uses to "remember" your conversation.

The KV cache grows linearly with context length. The longer your prompt or conversation, the more VRAM it consumes. For a 7B–8B model, each 1,000 tokens of context adds approximately 0.1–0.2 GB of VRAM. That sounds small, but it adds up fast at longer context windows:

Context Length	KV Cache Overhead (8B model)	KV Cache Overhead (70B model)
2,048 tokens	~0.2 GB	~1.6 GB
4,096 tokens	~0.4 GB	~3.3 GB
8,192 tokens	~0.8 GB	~6.5 GB
32,768 tokens	~3.2 GB	~26 GB
128,000 tokens	~12 GB	~100 GB+

At 128K context, the KV cache for a 70B model can exceed the weight memory itself — which is why running long-context models locally requires serious hardware^[3].

Warning

If you're running out of VRAM mid-conversation but the model loaded fine initially, the KV cache is the culprit. Reduce context length with --num-ctx 2048 in Ollama or -c 2048 in llama.cpp to reclaim memory.

Inference vs. Training: Completely Different VRAM Budgets

Running a model (inference) and training a model are two very different workloads. Training requires dramatically more memory because of optimizer states, gradients, and activations stored during backpropagation.

Workload	VRAM Multiplier vs. Model Size	Example: 7B Model at FP16
Inference	1.2x model weights	~16 GB
LoRA fine-tuning (INT4)	1.5–2x quantized weights	~8–10 GB
Full fine-tuning (FP16)	3–4x model weights	~48–64 GB
Full pre-training (FP16)	4–6x model weights	~56–96 GB

The 3–4x multiplier for full fine-tuning comes from storing: (1) the model parameters, (2) a full copy of gradients (same size as parameters), (3) Adam optimizer states (2x parameter size), and (4) forward-pass activations^[4].

This is why LoRA (Low-Rank Adaptation) has become the default fine-tuning method for consumer hardware. By only training a small subset of parameters (typically 1–5%), LoRA keeps VRAM usage close to inference levels. You can LoRA fine-tune a 7B model on a single RTX 4080 SUPER with 16 GB VRAM — something that would be impossible with full fine-tuning.

Pro Tip

For fine-tuning on consumer GPUs, use QLoRA (quantized LoRA) with tools like Unsloth or Axolotl. QLoRA loads the base model in 4-bit and only trains the LoRA adapters in FP16, cutting VRAM usage by 70%+ compared to full fine-tuning while achieving comparable results.

VRAM for Image Generation (Stable Diffusion, SDXL, Flux)

Image generation models have different memory profiles than LLMs. They load model weights once, then generate images in passes — with VRAM usage spiking during the denoising process.

Model	Resolution	VRAM Usage	Recommended GPU
Stable Diffusion 1.5	512×512	~4 GB	Any 6GB GPU
SDXL	1024×1024	~8 GB base, ~12 GB with refiner	12GB+ GPU
SDXL + ControlNet	1024×1024	~14–16 GB	RTX 4080 SUPER 16GB
Flux	1024×1024	~16 GB minimum, 24 GB comfortable	RTX 4090 24GB
Flux + upscaling	2048×2048	~22–28 GB	RTX 5090 32GB

SDXL occupies 8 GB for the base model alone. Loading the refiner model simultaneously (recommended for maximum quality) pushes usage to 12–16 GB^[5]. Flux, the newer generation model from Black Forest Labs, requires about 50% more VRAM than SDXL at the same resolution — making 16 GB the practical minimum and 24 GB the comfortable target.

VRAM for AI Video Generation

Local video generation is the most VRAM-hungry AI workload in 2026. These models process hundreds of frames, and memory usage scales with both resolution and video length.

Model	VRAM (Standard 49 frames)	Practical Minimum GPU
CogVideoX 2B	~8 GB	RTX 3060 12GB
Wan 2.1 T2V-1.3B	~10–12 GB	RTX 3060 12GB
Wan 2.1 T2V-14B	~18 GB (optimized)	RTX 4090 24GB
Mochi 1	~12 GB (ComfyUI optimized), ~60 GB (standard)	RTX 4080 SUPER 16GB (optimized)
HunyuanVideo	~16 GB (optimized), 24 GB comfortable	RTX 4090 24GB

Video generation benefits enormously from frameworks like ComfyUI that manage model loading and unloading intelligently. Without these optimizations, many video models expect 48–80 GB of VRAM. With them, you can generate decent video on a 16–24 GB consumer GPU.

Special Case: Apple Silicon Unified Memory

Apple Silicon Macs (M1, M2, M3, M4 series) use unified memory — the CPU and GPU share the same RAM pool. This changes the VRAM equation fundamentally: a Mac Studio M4 Max with 128 GB unified memory can load models that would require a $25,000 enterprise GPU on the NVIDIA side.

The tradeoff is bandwidth. Apple's unified memory bandwidth (400–800 GB/s depending on chip) is lower than HBM3 (3,350 GB/s on the H100), which means token generation is slower. But for loading and running models that simply won't fit on a 24 GB GPU, Apple Silicon offers an accessible path.

Apple Device	Unified Memory	Usable for AI (after OS)	Largest Model (Q4)
Mac Mini M4 Pro	24 GB	~18–20 GB	Qwen 32B, Mistral 22B
MacBook Pro M4 Max	36–128 GB	~28–110 GB	Llama 3.1 70B (at 48 GB+)
Mac Studio M4 Max	64–128 GB	~50–110 GB	Llama 3.1 70B+ comfortably

Note

Apple Silicon runs LLMs via Metal/MLX acceleration — not CUDA. Most inference tools (Ollama, llama.cpp, LM Studio) fully support Metal. Training and fine-tuning support is more limited. If your workflow is primarily inference, Apple Silicon is excellent. For training, NVIDIA is still the safer choice.

What GPU for What Task: Buying Recommendations

Here's the practical mapping. Find your use case, get the right GPU.

Use Case	VRAM Needed	Recommended GPU	Price Range
Small LLMs (3B–8B), SD 1.5	6–8 GB	RTX 3060 12GB (used)	$200–$300
Mid-size LLMs (8B–14B), SDXL	10–16 GB	RTX 4080 SUPER 16GB	$949–$1,099
Large LLMs (32B Q4), Flux, LoRA fine-tuning	20–24 GB	RTX 4090 24GB or RTX 3090 24GB	$699–$1,999
70B Q3 models, Flux upscaling, video gen	28–32 GB	RTX 5090 32GB	$1,999–$2,199
70B FP16, multi-model serving, full fine-tuning	80 GB	NVIDIA A100 80GB	$12,000–$15,000
Production inference, 70B+ training	80 GB (fast)	NVIDIA H100 80GB	$25,000–$33,000
Massive models, research	128 GB	AMD MI250X 128GB	$8,000–$11,000
Silent home AI, 7B–32B (Q4), Ollama	24–128 GB unified	Mac Studio M4 Max	$1,999–$4,499
Edge inference, small models	8 GB shared	Jetson Orin Nano	$199–$249

Expert Perspectives on VRAM

Tim Dettmers, researcher at the University of Washington and author of the widely cited Which GPU for Deep Learning guide, has consistently argued that VRAM is the single most important GPU spec for deep learning. In his hardware analysis, he notes that while raw compute (FLOPS) determines training speed, VRAM determines what you can run at all — making it the hard constraint rather than a performance variable^[2].

George Hotz, founder of tinygrad and Tiny Corp, has taken this argument further in the consumer space. When evaluating AMD's 24 GB RX 7900 XTX against NVIDIA's 16 GB options, Hotz emphasized that the memory advantage matters more than raw compute for AI workloads: "If workloads need memory, you'd need three RTX 4060 cards to get the same 24 GB" — making the case that VRAM-per-dollar should be the primary buying metric for AI practitioners.

Five Common VRAM Mistakes to Avoid

Ignoring the KV cache. Your model loaded? Great. Now add 2–8 GB for the conversation context. Always leave a 2–4 GB VRAM buffer above model weight size.
Running at FP16 when INT4 would suffice. Unless you're doing research that demands maximum precision, Q4_K_M gives you 4x the model capacity for a barely measurable quality difference.
Confusing system RAM with VRAM. 64 GB of DDR5 system RAM does not help your GPU. The model must fit in GPU VRAM (exception: Apple Silicon unified memory and CPU offloading, both slower).
Buying 8 GB GPUs for AI. In 2026, 8 GB VRAM limits you to 3B–7B models at INT4 with minimal context. Given that 24 GB cards are available from $699 (used RTX 3090), 8 GB is false economy for AI.
Sizing for inference when you plan to fine-tune. Fine-tuning uses 2–4x more memory than inference. If training is on your roadmap, size your GPU for training — not just inference.

The Decision Tree: How Much VRAM Do You Need?

Answer these three questions:

What models will you run? Check the lookup table. Find the model size and quantization that matches your quality needs.
What context length do you need? Add the KV cache overhead for your target context window to the model weight requirement.
Will you fine-tune or just run inference? If fine-tuning, multiply your inference VRAM by 1.5x (LoRA) or 3–4x (full).

Sum those numbers. Then buy the GPU with the next tier up in VRAM — you'll always find a use for the headroom.

Our Recommendations by Budget

Under $300: Used RTX 3060 12GB. Runs 7B–8B models in INT4. Enough for learning and lightweight inference.
$700–$1,000: Used RTX 3090 24GB. The best VRAM-per-dollar in the market. Runs 32B models, handles SDXL and Flux, enables LoRA fine-tuning of 7B–13B models. (compare vs RTX 4090)
$1,000–$1,100: RTX 4080 SUPER 16GB. Best for 7B–14B inference with excellent power efficiency. Not enough VRAM for 32B+ models.
$1,500–$2,000: RTX 4090 24GB. The proven workhorse. 24 GB handles 32B Q4, Flux, and LoRA fine-tuning of larger models. Strong used-market availability. (compare vs RTX 5090)
$2,000–$2,200: RTX 5090 32GB. 32 GB opens up 70B models at aggressive quantization and full-resolution Flux without memory pressure. 78% more bandwidth than the 4090. (compare vs RTX 4090)
$2,000–$4,500 (silent): Mac Studio M4 Max. Up to 128 GB unified memory for the largest open-source models. Zero noise. No driver hassles. Best for inference-focused workflows.

The Bottom Line

In 2026, 24 GB is the new baseline for serious AI work. It runs 90% of the models most practitioners need. 32 GB (the new RTX 5090) future-proofs you for the next generation of 20B–70B models. And if you need 70B+ at full precision or plan to fine-tune large models, you're looking at 80 GB enterprise cards or multi-GPU setups.

The single best value in the market right now is a used RTX 3090 for $700–$900 — it gives you 24 GB of VRAM at a price point that makes VRAM-per-dollar math absurd. If you're buying new, the RTX 5090 at $2,000 offers the best VRAM-per-dollar of any current-gen card at $65.60/GB.

Don't overthink it. Check the lookup table, find the model you want to run, and buy the GPU that covers it with 20% headroom. VRAM is the constraint that matters — everything else is optimization.

Sources

LocalLLM.in, "Ollama VRAM Requirements: Complete Guide to GPU Memory for Local LLMs" (2026). Measured VRAM usage for popular models across quantization levels. localllm.in
Tim Dettmers, "Which GPU(s) to Get for Deep Learning: My Experience and Advice" (2023). University of Washington. Comprehensive GPU hardware analysis including VRAM prioritization for deep learning. timdettmers.com
Lyx, "Context Kills VRAM: How to Run LLMs on Consumer GPUs," Medium (2024). Analysis of KV cache scaling and context length impact on memory. medium.com
Frank Denneman, "Training vs Inference — Memory Consumption by Neural Networks" (2022). Detailed breakdown of optimizer state, gradient, and activation memory during training. frankdenneman.nl
Prompting Pixels, "System Requirements for Stable Diffusion" (2024). Benchmarked VRAM consumption for SD 1.5, SDXL, and Flux across GPU tiers. medium.com

Pair-buy essentials

Pairs with your NVIDIA GeForce RTX 4090

A 5090 is wasted without clean power, fresh paste, and fast storage. Pair-buys that keep the rig stable.

Corsair RM850x ATX 3.1 (Native 12V-2x6)
$130 – $170
Native 12V-2x6 at 850W, 80+ Gold, fully modular — skips the melted-adapter saga on RTX 40/50 builds.
Shop on Amazon
Arctic MX-6 Thermal Paste (4g)
$8 – $14
Drops sustained-load temps 4–8°C vs. dried-out stock paste. Reapply on day one.
Shop on Amazon
Samsung 990 Pro 2TB Gen4 NVMe
$160 – $210
7,450 MB/s reads cut 70B-class quant cold-loads to seconds. 2TB fits ~10 quantized models.
Shop on Amazon

Show 3 more →

Arctic P14 PWM PST 140mm Fans (5-pack)
$40 – $55
High static pressure + PWM daisy-chain. A full tower's worth of airflow for ~$50.
Shop on Amazon
CyberPower CP1500PFCLCD Pure-Sine UPS
$200 – $260
1500VA pure sine + AVR — protects PSUs from the brownouts that corrupt model files mid-run.
Shop on Amazon
Acer GPU Support Bracket (Magnetic Base)
$15 – $25
Stops a 3-slot RTX 5090 from sagging into the PCIe pins. Magnetic base + non-slip foot — 30-second install.
Shop on Amazon

Affiliate links — We earn a commission on qualifying purchases at no cost to you.

VRAMGPU memoryLLMquantizationinferencetraininghardware guide2026

How Much VRAM Do You Need for AI in 2026?

VRAM Is the Bottleneck. Here's How Much You Actually Need.

The Quick Formula

The Master VRAM Lookup Table

The Hidden VRAM Cost: KV Cache and Context Length

Inference vs. Training: Completely Different VRAM Budgets

VRAM for Image Generation (Stable Diffusion, SDXL, Flux)

VRAM for AI Video Generation

Special Case: Apple Silicon Unified Memory

What GPU for What Task: Buying Recommendations

Expert Perspectives on VRAM

Five Common VRAM Mistakes to Avoid

The Decision Tree: How Much VRAM Do You Need?

Our Recommendations by Budget

The Bottom Line

Further Reading

Sources

More from the blog

Best GPU for AI in 2026: Complete Buyer's Guide (Tested & Ranked)

AMD vs NVIDIA for AI: Which GPU Should You Buy in 2026?

Best Budget GPU for AI in 2026: Every Price Tier Ranked

Stay ahead in AI hardware