Speculative Decoding

An inference acceleration technique where a small, fast “draft” model generates multiple candidate tokens that a larger “verifier” model checks in parallel. Since the verifier can validate a batch of tokens faster than generating them one by one, this speeds up overall generation by 2–3x without changing the output quality. Speculative decoding is especially effective on consumer hardware where memory bandwidth is the bottleneck — it gets more useful work per memory read.

More Terms

RLHF (Reinforcement Learning from Human Feedback)ROCm TDP (Thermal Design Power)Tensor Cores TGI (Text Generation Inference)

Back to Glossary

Speculative Decoding

Related Products

Related Articles

More Terms