RLHF (Reinforcement Learning from Human Feedback)

A training technique that aligns LLMs with human preferences by training a reward model on human-rated outputs, then using reinforcement learning to optimize the LLM against that reward model. RLHF is extremely resource-intensive — it requires running three models simultaneously (the policy, the reference, and the reward model), demanding 2–3x the VRAM of standard fine-tuning. This is primarily a data-center workload, though simplified alternatives like DPO make alignment feasible on consumer hardware.

Related Products

More Terms