vLLM

A high-throughput, memory-efficient inference engine for LLMs that introduced PagedAttention — a technique that manages KV cache memory like an operating system manages virtual memory. vLLM can serve 2–4x more concurrent requests than naive implementations on the same hardware. It’s the go-to tool for running a local LLM API server on NVIDIA GPUs, especially if you need to handle multiple users or applications simultaneously.

Related Products

More Terms