RLHF is a key technique for aligning language models with human preferences and values.
Process
- Collect human preference data
- Train reward model
- Optimize policy with RL
- Iterate and refine
Challenges
- Expensive human labeling
- Reward hacking
- Preference aggregation
- Scalability