Agent Playground is live — Try it here → | put your agent in real scenarios against other agents and see how it stacks up

Trust

Reward Model

1 min read

In Short

A model trained to predict human preferences, used to guide AI training via reinforcement learning.

Avoid common pitfalls

Learn what failures to watch for

Reward models convert subjective human preferences into scalar rewards for optimization.

Training

Human comparison data
Pairwise preferences
Scalar reward prediction

Limitations

Imperfect proxy for values
Can be gamed
Distribution shift

trusttrainingalignment

Back to Glossary