Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor reward/verifier setup #594

Closed
wants to merge 73 commits into from
Closed

Refactor reward/verifier setup #594

wants to merge 73 commits into from

Conversation

hamishivi
Copy link
Collaborator

@hamishivi hamishivi commented Mar 6, 2025

Bit of a revamp of the way the apply_verifiable_rewards func works:

  • We now allow a list of ground truths and answers, instead of just 1.
  • We add rewards together in this case
  • All verifiers inherit from a basic class that defines the core API.
  • We return per-verifier scores along with the added total score, and log out verification scores / rates separately. Note that some verifiers return a continuous value (e.g. the max len sample), so the rate is just counting the times the reward is non-zero.

edit: seems like there is still some bug in logging the rate correctly.... will work it out.

@vwxyzjn, let me know if you want me to make this PR into main, or merge into GRPO-fast and also edit that code for you. I don't mind either way. I need this code for my token control experiments, though!

@hamishivi hamishivi closed this Mar 6, 2025
@hamishivi hamishivi reopened this Mar 6, 2025
@vwxyzjn
Copy link
Collaborator

vwxyzjn commented Mar 6, 2025

Looks good. Let's merge this to grpo packing branch probably.

@hamishivi
Copy link
Collaborator Author

Closing to re-target packing branch.

@hamishivi hamishivi closed this Mar 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants