Reward models convert subjective human preferences into scalar rewards for optimization.
Training
- Human comparison data
- Pairwise preferences
- Scalar reward prediction
Limitations
- Imperfect proxy for values
- Can be gamed
- Distribution shift
A model trained to predict human preferences, used to guide AI training via reinforcement learning.
Reward models convert subjective human preferences into scalar rewards for optimization.