Reference
AI Hardware Glossary
Every term you need to understand before buying AI hardware.
C
CUDA
NVIDIA’s parallel computing platform that lets GPUs run general-purpose code, including AI workloads. Almost every major AI framework (PyTorch, TensorFlow) relies on CUDA, which is why NVIDIA GPUs dominate the AI hardware market. If you’re buying a GPU for AI, CUDA support is the single biggest reason NVIDIA cards command a premium over AMD alternatives.
F
FP16 / FP32 / FP8
Floating-point precision formats that determine how many bits represent each number in a model’s weights. FP32 (32-bit) is full precision, FP16 (16-bit) halves memory usage with minimal accuracy loss, and FP8 (8-bit) cuts it further for inference. Lower precision means you can fit larger models in less VRAM and run them faster — modern GPUs like the RTX 5090 have dedicated FP8 tensor cores specifically for this reason.
Fine-tuning
The process of taking a pre-trained AI model and adapting it to a specific task or dataset. Fine-tuning requires significantly more VRAM and compute than inference because the GPU must store model weights, gradients, and optimizer states simultaneously. For consumer hardware, 24 GB VRAM is the practical minimum for fine-tuning 7B-parameter models; larger models need multi-GPU setups or cloud compute.
G
GDDR6X / GDDR7 / HBM
The three main types of GPU memory. GDDR6X is the standard on current high-end consumer GPUs (RTX 4090), GDDR7 debuts on next-gen cards with higher bandwidth, and HBM (High Bandwidth Memory) is a stacked design used in data-center GPUs like the A100 and H100. For AI workloads, memory bandwidth directly affects tokens-per-second performance, so HBM cards are fastest but cost 10–20x more than consumer alternatives.
I
Inference
Running a trained AI model to generate predictions, text, or images — the primary workload for anyone running AI locally. Inference is far less demanding than training: a single consumer GPU with enough VRAM can run most open-source LLMs. The key bottleneck is VRAM (to fit the model) and memory bandwidth (to generate tokens quickly).
INT4 / INT8 (Quantization)
Quantization reduces model precision from floating-point (FP16/FP32) to lower-bit integers (INT8 or INT4), shrinking the model so it fits in less VRAM. A 70B-parameter model that needs 140 GB in FP16 can fit in roughly 35–40 GB at INT4. The trade-off is a small accuracy loss, but for most local AI use cases the difference is negligible. Quantization is why consumer GPUs with 24 GB VRAM can run surprisingly large models.
L
llama.cpp
An efficient open-source C/C++ runtime for running LLMs on consumer hardware, including CPUs and GPUs. It pioneered practical quantization for local inference and is the engine behind tools like Ollama. If you’re running models locally on a Mac, mini PC, or budget GPU, llama.cpp (or a tool built on it) is almost certainly what’s doing the work.
LLM (Large Language Model)
The neural networks behind ChatGPT, Llama, Mistral, and other AI chatbots. LLMs are measured by parameter count (7B, 70B, 405B) which directly correlates with how much VRAM you need to run them. Choosing AI hardware is largely about figuring out which LLMs you want to run and buying enough VRAM to fit them.
LoRA
Low-Rank Adaptation — a technique for fine-tuning large models by training only a small set of additional parameters instead of the full model. LoRA dramatically reduces the VRAM required for fine-tuning, making it possible to customize 7B–13B models on a single consumer GPU with 24 GB VRAM. It’s the most practical fine-tuning method for anyone without data-center hardware.
M
Metal Performance Shaders (MPS)
Apple’s GPU compute framework, essentially the macOS equivalent of CUDA. MPS enables PyTorch and other frameworks to use Apple Silicon GPUs for AI acceleration. Support has improved rapidly, but the ecosystem is still smaller than CUDA’s, which means some models and tools work better on NVIDIA hardware. If you’re running AI on a Mac, MPS is what makes GPU acceleration possible.
MLX
Apple’s machine learning framework optimized specifically for Apple Silicon. MLX is designed to take full advantage of unified memory and the GPU/Neural Engine on M-series chips. It’s increasingly the best way to run AI models on Mac hardware, with a growing library of optimized models and a NumPy-like API that makes it easy for developers to adopt.
O
Ollama
A popular open-source tool that makes running LLMs locally as easy as a single terminal command. Ollama handles model downloading, quantization, and serving, built on top of llama.cpp. It’s the fastest way to go from zero to running a local chatbot and supports macOS, Linux, and Windows. If you’re buying hardware for local AI, Ollama compatibility is a good baseline test.
P
PCIe
The physical interface connecting a GPU to the motherboard. PCIe Gen 4 is the current standard; Gen 5 doubles bandwidth. For single-GPU AI setups the generation rarely matters (the bottleneck is VRAM, not the bus). It becomes important for multi-GPU rigs where cards need to communicate quickly, and for workstations where NVMe storage speed also depends on available PCIe lanes.
R
ROCm
AMD’s open-source GPU computing platform, the alternative to NVIDIA’s CUDA. ROCm support has improved significantly, with PyTorch offering official ROCm builds, but the ecosystem is still less mature. AMD GPUs can be 20–40% cheaper than equivalent NVIDIA cards, making ROCm attractive if your specific workload is supported. Always check framework and model compatibility before buying AMD for AI.
T
TDP (Thermal Design Power)
The maximum heat a GPU is designed to dissipate under load, measured in watts. TDP directly determines your power supply requirements and cooling needs. A 450 W GPU like the RTX 4090 needs at least an 850 W PSU and good case airflow. For quiet builds or small-form-factor AI PCs, TDP is a critical constraint that may push you toward more efficient options like Apple Silicon or lower-power GPUs.
Tensor Cores
Specialized processing units inside NVIDIA GPUs designed for matrix multiplication — the fundamental math operation in neural networks. Tensor cores can perform mixed-precision operations (FP16, INT8, FP8) orders of magnitude faster than standard CUDA cores. They’re the reason an RTX 4090 is dramatically faster at AI workloads than a gaming GPU with similar CUDA core counts but no tensor cores.
Tokens per second (tok/s)
The standard speed metric for LLM inference, measuring how many text tokens the model generates each second. A comfortable conversational speed is around 30–40 tok/s; below 10 tok/s feels sluggish. Your tok/s depends on VRAM bandwidth, model size, quantization level, and GPU architecture. When comparing hardware for local AI, tok/s benchmarks on the models you care about matter more than raw TFLOPS.
Training
Teaching an AI model from scratch (pre-training) or adapting it to new data (fine-tuning). Training is the most demanding AI workload: it requires massive VRAM, fast interconnects for multi-GPU setups, and sustained high power draw for days or weeks. Pre-training large models is impractical on consumer hardware, but fine-tuning smaller models (7B–13B) is achievable with a single high-end GPU.
U
Unified Memory
Apple Silicon’s shared memory pool used by both CPU and GPU, eliminating the need to copy data between separate memory banks. This is why a Mac with 128 GB unified memory can load models that would require a multi-GPU setup on discrete NVIDIA hardware. The trade-off is lower memory bandwidth compared to dedicated VRAM, so token generation is slower per-GB — but the ability to fit huge models in a single machine is a unique advantage.
V
VRAM
Video RAM — the dedicated memory on a GPU. VRAM is the single most important spec for AI hardware because the entire model (or a quantized version of it) must fit in VRAM for GPU-accelerated inference. 8 GB runs small models (up to ~7B parameters), 16 GB handles mid-range, 24 GB is the sweet spot for serious local AI, and 48 GB+ opens up 70B-parameter models. When in doubt, buy more VRAM.
Ready to Buy?
Now that you know the terms, find the right hardware for your AI workload.