Trust

RLHF

1 min read

In Short

Reinforcement Learning from Human Feedback—training AI models using human preferences as the reward signal.

Avoid common pitfalls

Learn what failures to watch for

RLHF is a key technique for aligning language models with human preferences and values.

Process

Collect human preference data
Train reward model
Optimize policy with RL
Iterate and refine

Challenges

Expensive human labeling
Reward hacking
Preference aggregation
Scalability

trusttrainingalignment

Back to Glossary