Refactor reward/verifier setup #594

hamishivi · 2025-03-06T05:04:59Z

Bit of a revamp of the way the apply_verifiable_rewards func works:

We now allow a list of ground truths and answers, instead of just 1.
We add rewards together in this case
All verifiers inherit from a basic class that defines the core API.
We return per-verifier scores along with the added total score, and log out verification scores / rates separately. Note that some verifiers return a continuous value (e.g. the max len sample), so the rate is just counting the times the reward is non-zero.

edit: seems like there is still some bug in logging the rate correctly.... will work it out.

@vwxyzjn, let me know if you want me to make this PR into main, or merge into GRPO-fast and also edit that code for you. I don't mind either way. I need this code for my token control experiments, though!

…thread.

vwxyzjn · 2025-03-06T15:40:24Z

Looks good. Let's merge this to grpo packing branch probably.

hamishivi · 2025-03-06T17:55:07Z

Closing to re-target packing branch.

hamishivi and others added 30 commits February 21, 2025 13:20

first pass at mult verifies + max length check

100ddc3

update

42973b9

minor tweak

9881cc1

fix bug

e9c6882

fix

77b0587

bug fixes

e5da4a9

push changes

4a9eb41

quick change

afc072e

add sequence length eval

e4d1f7b

Merge branch 'main' into mult-verify-max-len

59753b0

trying a new reward function

a0c60bf

fix

3134196

fix

16c21d8

better logging

c6467c5

add tokens per second metric

8901287

allow training with mini batches

c0252ca

fix index out of bound issues

89dfbe3

return to previous setting

9f4af92

change it back, but per_device_train_batch_size > 1 does not work.

78a27f1

ok now pdbs>1 should work, accumulation steps was wrong

b1c9c3b

update tokens per second calculation based on iteration instead

7aea2cc

add data thread

95d61f0

graceful shutdown

42c5df7

making the save logic works

bae1a64

refactor

f4618d4

Fixes here

8cdbd1f

remove unused

864deee

add better traceback

2056a8c

pin collatoed tensors

d67c83b

send the queries data early, so as not to block the data preparation …

3d0166a

…thread.

hamishivi and others added 23 commits March 5, 2025 16:48

fix

2a9bfd3

fix

3eb7cd3

fix

2f0e562

fix

3b1d355

fix

ddf0896

fix

c05ace1

fix

c47a27e

fix

1ab9622

fix

a8caac6

fix

50ff527

fix

7386492

i am silly

7d1c5d9

fix max len func

61f4c05

fix bug

1eeaaba

fix bug

7331f6d

fix bug

b5a7672

fix bug

a3c918f

fix bug

41d7501

fix logging?

ac8ae43

fix logging?

8d63939

fix logging?

b4ba794

fix logging?

b1f7758

lint

f37de8f

hamishivi closed this Mar 6, 2025

hamishivi reopened this Mar 6, 2025

hamishivi added 2 commits March 6, 2025 09:51

Merge branch 'grpo-fast-pro' into mult-verify-max-len

c364a0e

edits for grpo fast

7118c2b

hamishivi closed this Mar 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor reward/verifier setup #594

Refactor reward/verifier setup #594

hamishivi commented Mar 6, 2025 •

edited

Loading

vwxyzjn commented Mar 6, 2025

hamishivi commented Mar 6, 2025

Refactor reward/verifier setup #594

Refactor reward/verifier setup #594

Conversation

hamishivi commented Mar 6, 2025 • edited Loading

vwxyzjn commented Mar 6, 2025

hamishivi commented Mar 6, 2025

hamishivi commented Mar 6, 2025 •

edited

Loading