Trust

Reward Model

1 min read

In Short

A model trained to predict human preferences, used to guide AI training via reinforcement learning.

Reward models convert subjective human preferences into scalar rewards for optimization.

Training

  • Human comparison data
  • Pairwise preferences
  • Scalar reward prediction

Limitations

  • Imperfect proxy for values
  • Can be gamed
  • Distribution shift
trusttrainingalignment