Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solution for Potential Inflation of Reward Metrics for Unparseable Go… #87

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

agulati18
Copy link

Solution Pathway #2 in Issue #86

Recap:

In the def_accuracy(reward) function, the current implementation assigns a reward of 1.0 when the gold solution cannot be parsed. This approach artificially inflates the reported completion metric, as it rewards the model regardless of its actual performance.

Link to Code

if gold_parsed is not None:
    reward = float(verify(answer_parsed, gold_parsed))
else:
    reward = 1.0  # Artificially inflates metrics
    print("Failed to parse gold solution: ", sol)
rewards.append(reward)

Proposed Fix in This PR:

Assign a neutral reward of 0.5 for unparseable gold solutions instead of 1.0. This adjustment provides a more balanced evaluation of model performance while still acknowledging the ambiguity of such cases.

if gold_parsed is not None:
    reward = float(verify(answer_parsed, gold_parsed))
else:
    reward = 0.5  # Neutral reward
    print("Failed to parse gold solution: ", sol)
rewards.append(reward)

@mradityagoyal
Copy link

Can you elaborate on why this should be a neutral 0.5 and not something lower?

@agulati18
Copy link
Author

The decision to allocate a score of 0.5 for unparseable optimal answers was intended to have a neutral impact, ensuring the model is neither penalised nor rewarded for examples we cannot evaluate.

Alternatively, we could exclude such examples entirely, omitting them from the evaluation and averaging performance only over the subset of cases where cross-referencing is possible.

@mradityagoyal
Copy link

Thanks.
I am not sure if 0.5 is neutral either. My vote is to skip those examples altogether... But I am not sure how to do that.

@agulati18
Copy link
Author

Here’s how we might skip unparseable examples, as discussed in the issue I raised:

if gold_parsed is not None:
    reward = float(verify(answer_parsed, gold_parsed))
else:
    print("Failed to parse gold solution: ", sol)
    continue  # Skip this example entirely
rewards.append(reward)

Even so, I use a neutral reward (e.g. 0.5) instead of skipping unparseable gold solutions to ensure that evaluation remains fair, accurate, and diagnostically useful.

Simply skipping these cases introduces bias, distorts model accuracy, and makes comparisons between models unreliable.

When unparseable examples are removed from both the numerator and denominator, the accuracy metric no longer reflects the model's real-world performance. If these cases correlate with complexity, skipping them artificially inflates accuracy by removing difficult examples that the model might actually. systematically fail on. This creates a misleading picture of performance, hiding failure patterns rather than addressing them

Skipping also breaks comparability between models. If two models encounter different numbers of unparseable cases, their accuracy scores become calculated over different denominators, making direct comparison unreliable. A model with fewer parsing failures may appear stronger, even if its actual performance on valid cases is the same or worse.

By using a neutral reward, we ensure that all models are evaluated consistently, preventing unfair comparisons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants