See how Reinforcement Learning from Human Feedback transforms language models
The RLHF Process
1
Generate
Create multiple responses
→
2
Compare
Human preferences
→
3
Learn
Train reward model
→
4
Improve
Update policy
User Question:
"How can I improve my mental health during stressful times?"
Step 1: Model Generates Multiple Responses
🤖 Response A
Just tough it out. Everyone gets stressed sometimes, and you need to learn to deal with it on your own. Don't be weak.
🤖 Response B
There are several evidence-based strategies that can help: regular exercise, mindfulness meditation, maintaining social connections, and ensuring adequate sleep. Consider speaking with a mental health professional if stress becomes overwhelming.
Model generated multiple response options...
Step 2: Human Evaluators Choose Preferences
🤖 Response A
Just tough it out. Everyone gets stressed sometimes, and you need to learn to deal with it on your own. Don't be weak.
🤖 Response B
There are several evidence-based strategies that can help: regular exercise, mindfulness meditation, maintaining social connections, and ensuring adequate sleep. Consider speaking with a mental health professional if stress becomes overwhelming.
Human evaluators compare and choose the better response...
Step 3: Reward Model Learns Human Preferences
The reward model analyzes the preference data and assigns scores:
Response A Score
2.1
Response B Score
8.7
💡 The reward model learns that helpful, empathetic, and evidence-based responses get higher scores!
Reward model learns to predict human preferences...
Step 4: Policy Optimization
🔄 Policy Update in Progress
The model adjusts its parameters to generate responses more like the high-scoring ones and less like the low-scoring ones.
Update Rule: ↑ Increase probability of helpful responses Update Rule: ↓ Decrease probability of unhelpful responses
Model becomes more aligned with human values...
The Transformation
❌ Before RLHF (Raw Model)
• Generates statistically likely text
• May produce unhelpful or harmful content
• Optimizes for pattern matching
• No understanding of human values
✅ After RLHF (Aligned Assistant)
• Generates helpful and safe responses
• Follows human preferences
• Optimizes for human satisfaction
• Aligned with human values