Trust

RLHF

1 min read

In Short

Reinforcement Learning from Human Feedback—training AI models using human preferences as the reward signal.

RLHF is a key technique for aligning language models with human preferences and values.

Process

  1. Collect human preference data
  2. Train reward model
  3. Optimize policy with RL
  4. Iterate and refine

Challenges

  • Expensive human labeling
  • Reward hacking
  • Preference aggregation
  • Scalability
trusttrainingalignment