Reference
AI Hardware Glossary
Every term you need to understand before buying AI hardware.
A
AWQ (Activation-aware Weight Quantization)
A quantization method that identifies the most important weights in a model by analyzing activation patterns, then preserves their precision while aggressively quantizing the rest. AWQ typically delivers better accuracy than naive round-to-nearest quantization at INT4, making it a popular choice for deploying large models on consumer GPUs. If you see an AWQ model variant on Hugging Face, it’s been optimized to run well on hardware with limited VRAM.
B
Batch Size
The number of input samples processed simultaneously during training or inference. Larger batch sizes increase GPU utilization and throughput but require more VRAM. During training, batch size directly affects how much memory you need — doubling the batch size roughly doubles the activation memory. For inference servers handling multiple users, batch size determines how many concurrent requests your GPU can handle.
BF16 (Brain Floating Point)
A 16-bit floating-point format developed by Google Brain that keeps the same exponent range as FP32 but with reduced mantissa precision. BF16 is the default training format for most modern LLMs because it avoids the overflow issues of FP16 while using half the memory of FP32. NVIDIA GPUs from Ampere onward (RTX 3090+) and Apple Silicon M-series chips support BF16 natively in their tensor/matrix units.
C
CUDA
NVIDIA’s parallel computing platform that lets GPUs run general-purpose code, including AI workloads. Almost every major AI framework (PyTorch, TensorFlow) relies on CUDA, which is why NVIDIA GPUs dominate the AI hardware market. If you’re buying a GPU for AI, CUDA support is the single biggest reason NVIDIA cards command a premium over AMD alternatives.
Context Window
The maximum number of tokens an LLM can process in a single conversation or prompt, including both input and output. Context windows range from 4K tokens in older models to 128K+ in modern ones like Llama 3 and GPT-4. Larger context windows require significantly more VRAM due to the KV cache growing linearly with sequence length. If you need to process long documents locally, plan for extra VRAM beyond what the model weights alone require.
CUDA Cores
The individual parallel processors inside an NVIDIA GPU that handle general-purpose computation. CUDA cores are distinct from tensor cores: they execute standard floating-point and integer operations, while tensor cores specialize in matrix math. Higher CUDA core counts improve gaming and general compute, but for AI workloads, tensor core count and VRAM matter more. Don’t compare raw CUDA core numbers across GPU generations — architecture improvements mean newer cores do more work per clock.
D
DeepSpeed
Microsoft’s open-source deep learning optimization library for efficient distributed training and inference. DeepSpeed’s ZeRO (Zero Redundancy Optimizer) stages let you train models that exceed a single GPU’s VRAM by sharding optimizer states, gradients, and parameters across multiple GPUs. For multi-GPU AI workstations, DeepSpeed is the go-to tool for training models that don’t fit on one card.
Diffusion Model
A type of generative AI model that creates images (or video, audio) by gradually removing noise from a random starting point. Stable Diffusion, FLUX, and DALL-E are all diffusion models. Running diffusion models locally requires a decent GPU but less VRAM than LLMs — most image generators work well with 8–12 GB VRAM. Speed scales with GPU compute power, so faster cards produce images in seconds rather than minutes.
Distillation
Training a smaller “student” model to replicate the behavior of a larger “teacher” model. Distillation is how many efficient open-source models are created — a 70B model’s knowledge gets compressed into a 7B model that runs on consumer hardware. The result is a model that punches above its parameter count, making distilled models some of the best choices for local AI on limited VRAM.
DLSS (Deep Learning Super Sampling)
NVIDIA’s AI-powered upscaling technology that uses tensor cores to render games at a lower resolution and intelligently upscale the output. While DLSS is primarily a gaming feature, it demonstrates the dual-use value of NVIDIA GPUs with tensor cores: you get AI compute performance and gaming upscaling from the same hardware. Relevant if you want a single GPU that handles both AI workloads and gaming.
DPO (Direct Preference Optimization)
A simpler alternative to RLHF for aligning language models with human preferences. DPO skips the separate reward model step and directly optimizes the language model using pairs of preferred and rejected outputs. It’s more memory-efficient than RLHF, making it feasible to run alignment fine-tuning on consumer GPUs with 24 GB+ VRAM. Many popular open-source chat models are DPO-tuned.
E
Embedding
A numerical vector representation of text, images, or other data that captures semantic meaning. Embeddings are the foundation of RAG (retrieval-augmented generation), semantic search, and recommendation systems. Embedding models are small and fast — they run comfortably on almost any modern hardware, including CPUs. The hardware bottleneck in embedding workflows is usually the LLM that processes retrieved results, not the embedding step itself.
F
FP16 / FP32 / FP8
Floating-point precision formats that determine how many bits represent each number in a model’s weights. FP32 (32-bit) is full precision, FP16 (16-bit) halves memory usage with minimal accuracy loss, and FP8 (8-bit) cuts it further for inference. Lower precision means you can fit larger models in less VRAM and run them faster — modern GPUs like the RTX 5090 have dedicated FP8 tensor cores specifically for this reason.
Fine-tuning
The process of taking a pre-trained AI model and adapting it to a specific task or dataset. Fine-tuning requires significantly more VRAM and compute than inference because the GPU must store model weights, gradients, and optimizer states simultaneously. For consumer hardware, 24 GB VRAM is the practical minimum for fine-tuning 7B-parameter models; larger models need multi-GPU setups or cloud compute.
Flash Attention
An optimized attention algorithm that dramatically reduces the memory usage and increases the speed of transformer models by restructuring how attention is computed in GPU memory. Flash Attention avoids materializing the full attention matrix, cutting VRAM usage from quadratic to linear in sequence length. It’s now built into most inference engines and lets you run longer context windows on the same hardware. NVIDIA GPUs benefit most; support on AMD and Apple Silicon is improving.
FP4
An extremely low-precision 4-bit floating-point format used for aggressive model quantization. FP4 reduces model size by 8x compared to FP32, enabling very large models to fit on consumer hardware at the cost of some accuracy. NVIDIA’s Blackwell architecture (RTX 50-series) adds native FP4 tensor core support, making it a practical option for inference on next-gen GPUs where hardware can compensate for the reduced precision.
G
GDDR6X / GDDR7 / HBM
The three main types of GPU memory. GDDR6X is the standard on current high-end consumer GPUs (RTX 4090), GDDR7 debuts on next-gen cards with higher bandwidth, and HBM (High Bandwidth Memory) is a stacked design used in data-center GPUs like the A100 and H100. For AI workloads, memory bandwidth directly affects tokens-per-second performance, so HBM cards are fastest but cost 10–20x more than consumer alternatives.
GGUF
The file format used by llama.cpp and compatible tools (Ollama, LM Studio) for storing quantized LLMs. GGUF replaced the older GGML format and includes metadata about the model’s architecture, quantization level, and tokenizer. When downloading a model to run locally, GGUF files are typically what you want. They come in variants like Q4_K_M or Q5_K_S, where lower numbers mean more compression (less VRAM, slightly lower quality).
GPTQ
A post-training quantization method that compresses LLMs to 4-bit or 3-bit precision with minimal accuracy loss by using second-order optimization (approximate Hessian information). GPTQ models are optimized for GPU inference and typically run faster than GGUF on NVIDIA cards with enough VRAM. If you have a dedicated NVIDIA GPU and want the fastest quantized inference, GPTQ is often the best format.
Gradient Checkpointing
A memory optimization technique for training that trades compute time for VRAM savings. Instead of storing all intermediate activations during the forward pass, gradient checkpointing recomputes them during the backward pass. This can reduce training memory usage by 60–80% at the cost of ~20–30% slower training. It’s essential for fine-tuning larger models on consumer GPUs with limited VRAM.
I
Inference
Running a trained AI model to generate predictions, text, or images — the primary workload for anyone running AI locally. Inference is far less demanding than training: a single consumer GPU with enough VRAM can run most open-source LLMs. The key bottleneck is VRAM (to fit the model) and memory bandwidth (to generate tokens quickly).
INT4 / INT8 (Quantization)
Quantization reduces model precision from floating-point (FP16/FP32) to lower-bit integers (INT8 or INT4), shrinking the model so it fits in less VRAM. A 70B-parameter model that needs 140 GB in FP16 can fit in roughly 35–40 GB at INT4. The trade-off is a small accuracy loss, but for most local AI use cases the difference is negligible. Quantization is why consumer GPUs with 24 GB VRAM can run surprisingly large models.
K
Knowledge Distillation
A model compression technique where a large, high-performance “teacher” model transfers its learned knowledge to a smaller “student” model through soft label training. The student learns not just the correct answers but the teacher’s confidence distribution across all possible outputs. This is how many of the best small models (1B–7B parameters) achieve impressive performance while remaining runnable on budget hardware with 8–16 GB VRAM.
KV Cache
A memory buffer that stores previously computed key-value attention pairs during LLM text generation, so the model doesn’t have to recompute them for every new token. The KV cache grows with context length and batch size, and can consume significant VRAM on top of the model weights. For a 70B model at 128K context, the KV cache alone can require 20+ GB. This is why long conversations use more VRAM than short ones on the same model.
L
llama.cpp
An efficient open-source C/C++ runtime for running LLMs on consumer hardware, including CPUs and GPUs. It pioneered practical quantization for local inference and is the engine behind tools like Ollama. If you’re running models locally on a Mac, mini PC, or budget GPU, llama.cpp (or a tool built on it) is almost certainly what’s doing the work.
LLM (Large Language Model)
The neural networks behind ChatGPT, Llama, Mistral, and other AI chatbots. LLMs are measured by parameter count (7B, 70B, 405B) which directly correlates with how much VRAM you need to run them. Choosing AI hardware is largely about figuring out which LLMs you want to run and buying enough VRAM to fit them.
LoRA
Low-Rank Adaptation — a technique for fine-tuning large models by training only a small set of additional parameters instead of the full model. LoRA dramatically reduces the VRAM required for fine-tuning, making it possible to customize 7B–13B models on a single consumer GPU with 24 GB VRAM. It’s the most practical fine-tuning method for anyone without data-center hardware.
LoRA Adapter
The small set of trained weight matrices produced by LoRA fine-tuning. A LoRA adapter is typically just 10–100 MB compared to the multi-gigabyte base model, and can be swapped in and out at inference time to customize model behavior for different tasks. You can collect multiple adapters for different use cases and apply them to the same base model, making it a flexible system for local AI customization.
M
Metal Performance Shaders (MPS)
Apple’s GPU compute framework, essentially the macOS equivalent of CUDA. MPS enables PyTorch and other frameworks to use Apple Silicon GPUs for AI acceleration. Support has improved rapidly, but the ecosystem is still smaller than CUDA’s, which means some models and tools work better on NVIDIA hardware. If you’re running AI on a Mac, MPS is what makes GPU acceleration possible.
MLX
Apple’s machine learning framework optimized specifically for Apple Silicon. MLX is designed to take full advantage of unified memory and the GPU/Neural Engine on M-series chips. It’s increasingly the best way to run AI models on Mac hardware, with a growing library of optimized models and a NumPy-like API that makes it easy for developers to adopt.
Model Sharding
Splitting a model’s weights across multiple GPUs (or across GPU and CPU/RAM) when it’s too large to fit on a single device. Model sharding enables running 70B+ parameter models on consumer hardware by distributing layers across available memory. The trade-off is slower inference due to inter-device communication. Tools like llama.cpp support CPU/GPU sharding out of the box, while multi-GPU sharding requires frameworks like DeepSpeed or vLLM.
MoE (Mixture of Experts)
An architecture where a model contains multiple specialized sub-networks (“experts”) and a router that activates only a subset for each input token. This means a 140B-parameter MoE model might only use 40B parameters per token, running much faster than a dense 140B model while needing the full parameter count in memory. For hardware buyers, MoE models are deceptive: they’re fast for their apparent size but still need substantial VRAM to load all experts.
N
NPU (Neural Processing Unit)
A dedicated AI accelerator chip built into modern CPUs and SoCs, designed for efficient low-power inference tasks like background noise removal, image enhancement, and on-device AI assistants. Intel’s Meteor Lake, AMD’s Ryzen AI, Qualcomm’s Snapdragon X, and Apple’s Neural Engine all include NPUs. For heavy AI workloads like running LLMs, NPUs are currently too slow compared to GPUs — but they handle lightweight AI features with minimal battery impact.
NVLink
NVIDIA’s high-speed GPU-to-GPU interconnect that provides much higher bandwidth than PCIe for multi-GPU communication. NVLink is critical for multi-GPU training and large-model inference where GPUs need to exchange data constantly. Consumer GPUs (RTX series) do not support NVLink — it’s limited to data-center cards like the A100 and H100. If you’re building a multi-GPU consumer rig, you’ll be using PCIe, which is slower but still workable for many setups.
O
Ollama
A popular open-source tool that makes running LLMs locally as easy as a single terminal command. Ollama handles model downloading, quantization, and serving, built on top of llama.cpp. It’s the fastest way to go from zero to running a local chatbot and supports macOS, Linux, and Windows. If you’re buying hardware for local AI, Ollama compatibility is a good baseline test.
P
PCIe
The physical interface connecting a GPU to the motherboard. PCIe Gen 4 is the current standard; Gen 5 doubles bandwidth. For single-GPU AI setups the generation rarely matters (the bottleneck is VRAM, not the bus). It becomes important for multi-GPU rigs where cards need to communicate quickly, and for workstations where NVMe storage speed also depends on available PCIe lanes.
Pipeline Parallelism
A multi-GPU strategy that splits a model’s layers across GPUs in sequence — GPU 1 handles layers 1–20, GPU 2 handles layers 21–40, and so on. Each GPU processes a different micro-batch simultaneously, creating a pipeline. This is simpler than tensor parallelism but introduces pipeline “bubbles” (idle time). For consumer multi-GPU setups, pipeline parallelism over PCIe is the most practical approach since it requires less inter-GPU bandwidth than tensor parallelism.
Pruning
A model compression technique that removes unnecessary weights (setting them to zero) to create a smaller, faster model. Structured pruning removes entire neurons or attention heads, while unstructured pruning zeroes out individual weights. Pruning can reduce model size by 50–90% with careful calibration. Combined with quantization, pruning enables very large models to run on consumer GPUs that couldn’t otherwise fit them.
Q
QLoRA
Quantized LoRA — a technique that combines 4-bit quantization of the base model with LoRA fine-tuning on top. QLoRA makes it possible to fine-tune a 70B-parameter model on a single 24 GB GPU by keeping the frozen base weights in INT4 while training the LoRA adapters in FP16. It’s the most VRAM-efficient fine-tuning method available and has made consumer-GPU fine-tuning of large models genuinely practical.
R
ROCm
AMD’s open-source GPU computing platform, the alternative to NVIDIA’s CUDA. ROCm support has improved significantly, with PyTorch offering official ROCm builds, but the ecosystem is still less mature. AMD GPUs can be 20–40% cheaper than equivalent NVIDIA cards, making ROCm attractive if your specific workload is supported. Always check framework and model compatibility before buying AMD for AI.
RAG (Retrieval-Augmented Generation)
A technique that enhances LLM responses by first retrieving relevant documents from a knowledge base and including them in the prompt. RAG lets you give a model access to private data, current information, or domain-specific knowledge without fine-tuning. Running RAG locally requires enough hardware for both the embedding model (lightweight) and the LLM (the real bottleneck). The main hardware consideration is VRAM for the LLM plus storage for the vector database.
RLHF (Reinforcement Learning from Human Feedback)
A training technique that aligns LLMs with human preferences by training a reward model on human-rated outputs, then using reinforcement learning to optimize the LLM against that reward model. RLHF is extremely resource-intensive — it requires running three models simultaneously (the policy, the reference, and the reward model), demanding 2–3x the VRAM of standard fine-tuning. This is primarily a data-center workload, though simplified alternatives like DPO make alignment feasible on consumer hardware.
S
Speculative Decoding
An inference acceleration technique where a small, fast “draft” model generates multiple candidate tokens that a larger “verifier” model checks in parallel. Since the verifier can validate a batch of tokens faster than generating them one by one, this speeds up overall generation by 2–3x without changing the output quality. Speculative decoding is especially effective on consumer hardware where memory bandwidth is the bottleneck — it gets more useful work per memory read.
T
TDP (Thermal Design Power)
The maximum heat a GPU is designed to dissipate under load, measured in watts. TDP directly determines your power supply requirements and cooling needs. A 450 W GPU like the RTX 4090 needs at least an 850 W PSU and good case airflow. For quiet builds or small-form-factor AI PCs, TDP is a critical constraint that may push you toward more efficient options like Apple Silicon or lower-power GPUs.
Tensor Cores
Specialized processing units inside NVIDIA GPUs designed for matrix multiplication — the fundamental math operation in neural networks. Tensor cores can perform mixed-precision operations (FP16, INT8, FP8) orders of magnitude faster than standard CUDA cores. They’re the reason an RTX 4090 is dramatically faster at AI workloads than a gaming GPU with similar CUDA core counts but no tensor cores.
Tokens per second (tok/s)
The standard speed metric for LLM inference, measuring how many text tokens the model generates each second. A comfortable conversational speed is around 30–40 tok/s; below 10 tok/s feels sluggish. Your tok/s depends on VRAM bandwidth, model size, quantization level, and GPU architecture. When comparing hardware for local AI, tok/s benchmarks on the models you care about matter more than raw TFLOPS.
Training
Teaching an AI model from scratch (pre-training) or adapting it to new data (fine-tuning). Training is the most demanding AI workload: it requires massive VRAM, fast interconnects for multi-GPU setups, and sustained high power draw for days or weeks. Pre-training large models is impractical on consumer hardware, but fine-tuning smaller models (7B–13B) is achievable with a single high-end GPU.
TGI (Text Generation Inference)
Hugging Face’s production-ready inference server optimized for serving LLMs with features like continuous batching, tensor parallelism, and quantization support. TGI is designed for deploying models at scale rather than personal use, but it’s relevant for anyone running a local AI API server that needs to handle multiple concurrent requests. Requires an NVIDIA GPU with substantial VRAM for best performance.
Tokens
The basic units that LLMs process — roughly corresponding to word fragments. A token is typically 3–4 characters of English text, so 1,000 tokens equals about 750 words. Token count matters for two hardware-related reasons: (1) the context window size determines the maximum tokens per conversation, which affects KV cache VRAM usage, and (2) tokens per second (tok/s) is the primary speed benchmark for comparing AI hardware.
TOPS / TFLOPS
TOPS (Tera Operations Per Second) and TFLOPS (Tera Floating-Point Operations Per Second) measure a chip’s theoretical peak compute throughput. NPUs are typically rated in TOPS, while GPUs use TFLOPS. These numbers are useful for rough comparisons within the same architecture but misleading across different hardware — an Apple M4’s 38 TOPS NPU and an NVIDIA RTX 4090’s 83 TFLOPS aren’t directly comparable. For AI workloads, real-world tok/s benchmarks matter far more than theoretical TOPS.
Transformer
The neural network architecture underlying virtually all modern LLMs, image generators, and multimodal AI models. Transformers use self-attention mechanisms to process input sequences in parallel, making them highly efficient on GPU hardware. The key hardware implication: transformer workloads are dominated by matrix multiplications, which is why GPUs with tensor cores (NVIDIA) or specialized matrix engines (Apple Silicon) vastly outperform general-purpose CPUs for AI.
U
Unified Memory
Apple Silicon’s shared memory pool used by both CPU and GPU, eliminating the need to copy data between separate memory banks. This is why a Mac with 128 GB unified memory can load models that would require a multi-GPU setup on discrete NVIDIA hardware. The trade-off is lower memory bandwidth compared to dedicated VRAM, so token generation is slower per-GB — but the ability to fit huge models in a single machine is a unique advantage.
V
VRAM
Video RAM — the dedicated memory on a GPU. VRAM is the single most important spec for AI hardware because the entire model (or a quantized version of it) must fit in VRAM for GPU-accelerated inference. 8 GB runs small models (up to ~7B parameters), 16 GB handles mid-range, 24 GB is the sweet spot for serious local AI, and 48 GB+ opens up 70B-parameter models. When in doubt, buy more VRAM.
vLLM
A high-throughput, memory-efficient inference engine for LLMs that introduced PagedAttention — a technique that manages KV cache memory like an operating system manages virtual memory. vLLM can serve 2–4x more concurrent requests than naive implementations on the same hardware. It’s the go-to tool for running a local LLM API server on NVIDIA GPUs, especially if you need to handle multiple users or applications simultaneously.
Ready to Buy?
Now that you know the terms, find the right hardware for your AI workload.