Knowledge Distillation

A model compression technique where a large, high-performance “teacher” model transfers its learned knowledge to a smaller “student” model through soft label training. The student learns not just the correct answers but the teacher’s confidence distribution across all possible outputs. This is how many of the best small models (1B–7B parameters) achieve impressive performance while remaining runnable on budget hardware with 8–16 GB VRAM.

Related Products

More Terms