A technique to fine-tune AI models using human preferences, making outputs more helpful, harmless, and aligned with human values.
Reinforcement Learning from Human Feedback (RLHF) is a training technique that aligns AI models with human preferences. Rather than optimising for next-token prediction alone, RLHF optimises for human-judged quality and helpfulness.
How RLHF works:
The RLHF process:
Why RLHF matters:
This technique is why ChatGPT feels helpful and conversational rather than just completing text.
RLHF is why ChatGPT feels helpful and safe. It's complex to implement but crucial for public-facing AI applications.
While most businesses use pre-RLHF'd models, understanding RLHF helps Australian businesses evaluate model choices and understand why different models behave differently.
"Training a model to prefer helpful responses over technically correct but unhelpful ones - this is what makes ChatGPT conversational."