Solution for Potential Inflation of Reward Metrics for Unparseable Go… #87

agulati18 · 2025-01-28T07:12:45Z

Recap:

In the def_accuracy(reward) function, the current implementation assigns a reward of 1.0 when the gold solution cannot be parsed. This approach artificially inflates the reported completion metric, as it rewards the model regardless of its actual performance.

Link to Code

if gold_parsed is not None:
    reward = float(verify(answer_parsed, gold_parsed))
else:
    reward = 1.0  # Artificially inflates metrics
    print("Failed to parse gold solution: ", sol)
rewards.append(reward)

Proposed Fix in This PR:

Assign a neutral reward of 0.5 for unparseable gold solutions instead of 1.0. This adjustment provides a more balanced evaluation of model performance while still acknowledging the ambiguity of such cases.

if gold_parsed is not None:
    reward = float(verify(answer_parsed, gold_parsed))
else:
    reward = 0.5  # Neutral reward
    print("Failed to parse gold solution: ", sol)
rewards.append(reward)

…ld Solutions (Issue huggingface#86)

mradityagoyal · 2025-01-28T22:05:06Z

Can you elaborate on why this should be a neutral 0.5 and not something lower?

agulati18 · 2025-01-29T16:37:52Z

The decision to allocate a score of 0.5 for unparseable optimal answers was intended to have a neutral impact, ensuring the model is neither penalised nor rewarded for examples we cannot evaluate.

Alternatively, we could exclude such examples entirely, omitting them from the evaluation and averaging performance only over the subset of cases where cross-referencing is possible.

mradityagoyal · 2025-01-29T23:43:06Z

Thanks.
I am not sure if 0.5 is neutral either. My vote is to skip those examples altogether... But I am not sure how to do that.

agulati18 · 2025-01-30T06:13:47Z

Here’s how we might skip unparseable examples, as discussed in the issue I raised:

if gold_parsed is not None:
    reward = float(verify(answer_parsed, gold_parsed))
else:
    print("Failed to parse gold solution: ", sol)
    continue  # Skip this example entirely
rewards.append(reward)

Even so, I use a neutral reward (e.g. 0.5) instead of skipping unparseable gold solutions to ensure that evaluation remains fair, accurate, and diagnostically useful.

Simply skipping these cases introduces bias, distorts model accuracy, and makes comparisons between models unreliable.

When unparseable examples are removed from both the numerator and denominator, the accuracy metric no longer reflects the model's real-world performance. If these cases correlate with complexity, skipping them artificially inflates accuracy by removing difficult examples that the model might actually. systematically fail on. This creates a misleading picture of performance, hiding failure patterns rather than addressing them

Skipping also breaks comparability between models. If two models encounter different numbers of unparseable cases, their accuracy scores become calculated over different denominators, making direct comparison unreliable. A model with fewer parsing failures may appear stronger, even if its actual performance on valid cases is the same or worse.

By using a neutral reward, we ensure that all models are evaluated consistently, preventing unfair comparisons.

Solution for Potential Inflation of Reward Metrics for Unparseable Go…

48c6860

…ld Solutions (Issue huggingface#86)

agulati18 mentioned this pull request Jan 28, 2025

Issue: Potential Inflation of Reward Metrics for Unparseable Gold Solutions #86

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Solution for Potential Inflation of Reward Metrics for Unparseable Go… #87

Solution for Potential Inflation of Reward Metrics for Unparseable Go… #87

agulati18 commented Jan 28, 2025

mradityagoyal commented Jan 28, 2025

agulati18 commented Jan 29, 2025

mradityagoyal commented Jan 29, 2025

agulati18 commented Jan 30, 2025

Solution for Potential Inflation of Reward Metrics for Unparseable Go… #87

Are you sure you want to change the base?

Solution for Potential Inflation of Reward Metrics for Unparseable Go… #87

Conversation

agulati18 commented Jan 28, 2025

Recap:

Proposed Fix in This PR:

mradityagoyal commented Jan 28, 2025

agulati18 commented Jan 29, 2025

mradityagoyal commented Jan 29, 2025

agulati18 commented Jan 30, 2025