DPO (Direct Preference Optimization)

A simpler alternative to RLHF for aligning language models with human preferences. DPO skips the separate reward model step and directly optimizes the language model using pairs of preferred and rejected outputs. It’s more memory-efficient than RLHF, making it feasible to run alignment fine-tuning on consumer GPUs with 24 GB+ VRAM. Many popular open-source chat models are DPO-tuned.