Preference Learning Demo

How It Works

Instead of labeling each response as "good" or "bad", humans simply compare pairs of responses and choose which one is better. This comparative approach captures subtle human preferences that would be hard to specify explicitly.

User Query:

"What's the best way to learn a new programming language?"

🤖 Response A

Just pick any language and start coding immediately. Don't waste time with tutorials or books - you'll figure it out as you go.

🤖 Response B

Start with fundamentals through structured tutorials, practice with small projects, read others' code, and gradually tackle more complex challenges. Set aside regular practice time and don't hesitate to ask questions in programming communities.

👆 Choose which response you think is more helpful!

Reward Model Learning

As you make more comparisons, the AI learns to predict human preferences:

Poor Response

Great Response

The model learns to assign higher scores to responses that humans prefer

Preference Learning Demo

How It Works

Learning Progress

Reward Model Learning