Inference

Running a trained AI model to generate predictions, text, or images — the primary workload for anyone running AI locally. Inference is far less demanding than training: a single consumer GPU with enough VRAM can run most open-source LLMs. The key bottleneck is VRAM (to fit the model) and memory bandwidth (to generate tokens quickly).