Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NOT MEANT TO MERG!] GRPO reward func for coding dataset #105

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

August-murr
Copy link

refer to #28

This is an example of how to use the OpenCoders dataset to create a reward function for the GRPOTrainer. The reward function parses the generated code, creates an evaluation script, and executes it using E2B to calculate its accuracy by counting the number of test cases that passed, as well as measuring the execution time.

It's not perfect since there are edge cases and potential failures that have not been addressed.

I haven't been able to test it with GRPO due to other issues. I would appreciate it if anyone could try this out and see if there are any other issues or if it works properly.

A similar approach was used to train R1, although it likely relied solely on a LeetCode dataset.

@kalogyu
Copy link

kalogyu commented Jan 31, 2025

Thanks for sharing the approach. I understand that this implementation is still in the testing phase and has some potential edge cases. I'll be happy to help test it out and look for any issues.

refer to #28

This is an example of how to use the OpenCoders dataset to create a reward function for the GRPOTrainer. The reward function parses the generated code, creates an evaluation script, and executes it using E2B to calculate its accuracy by counting the number of test cases that passed, as well as measuring the execution time.

It's not perfect since there are edge cases and potential failures that have not been addressed.

I haven't been able to test it with GRPO due to other issues. I would appreciate it if anyone could try this out and see if there are any other issues or if it works properly.

A similar approach was used to train R1, although it likely relied solely on a LeetCode dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants