Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Medhelm judges #3484

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Medhelm judges #3484

wants to merge 7 commits into from

Conversation

MiguelAFH
Copy link
Collaborator

@MiguelAFH MiguelAFH commented Mar 27, 2025

MedHELM contains several open-ended benchmarks where the current main metric is BertScore. While this metric is widely accepted in the literature, we want to make our evaluation metrics more robust.

To do so, we propose the use of LLM as a judge for the open-ended benchmarks, where we use standaridized evaluation criteria for each scenario and judges from multiple vendors the reduce bias.

Default judges:

  • GPT-4o
    • model_name="openai/gpt-4o-2024-05-13"
      *model_deployment="stanfordhealthcare/gpt-4o-2024-05-13"
  • Llama 3.3 70B Instruct
    • model_name="meta/llama-3.3-70b-instruct"
    • model_deployment="stanfordhealthcare/llama-3.3-70b-instruct"
  • Claude 3.7 sonnet
    • model_name="anthropic/claude-3-7-sonnet-20250219"
    • model_deployment="stanfordhealthcare/claude-3-7-sonnet-20250219"

Adding @suhana13 @aunell @HennyJie for FYI.

@MiguelAFH MiguelAFH requested a review from yifanmai March 27, 2025 18:41
@MiguelAFH MiguelAFH self-assigned this Mar 27, 2025
Copy link
Collaborator

@yifanmai yifanmai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good at a high level. Left some optional suggestions; feel free to address in this pull request, or merge and open a new request.

hlog("WARNING: Annotator skipped sending requests because the model response was empty")
return {
"prompt_text": None,
"empty_output_equivalence_judgement": False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably want to rename this key.

prompt_template: str,
annotation_criteria: Dict[str, Set[str]],
annotator_models: Dict[str, AnnotatorModelInfo],
preprocessor: Optional[Callable[[str], str]] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: output_preprocessor

# Attempt to fix incomplete JSON by adding a closing brace
annotator_output = annotator_output + "}"
try:
annotator_criteria = json.loads(annotator_output)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

annotation_criteria -> annotation?

"""
self._auto_client = auto_client
self._prompt_template = prompt_template
self._annotation_criteria = annotation_criteria
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: annotation_schema or annotation_format

metric_args = {
"task": "mtsamples_replicate",
"device": get_torch_device_name(),
"bertscore_model": "distilbert-base-uncased",
"rescale_with_baseline": False,
}

metric_specs = get_summarization_metric_specs(metric_args) + [
MetricSpec(class_name="helm.benchmark.metrics.mtsamples_replicate_metrics.MTSamplesReplicateMetric", args={})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional nit: you can omit args={} in most specs.

metric_service: MetricService,
eval_cache_path: str,
) -> List[Stat]:
assert request_state.annotations
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some way that the commonality between these metrics could be refactored out? See AnnotationLikertScaleMetric here for an example that you could follow.

"clarity": {"score", "explanation"},
}

ANNOTATOR_MODELS: Dict[str, AnnotatorModelInfo] = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you need to switch between different predefined deployments in different environments, you could make an environment enum (e.g. "shc", "medical") and then do one of the below:

  1. Add a parameter to the run spec function that specifies an environment enum value
  2. Read in a shell environment variable to get the environment enum value

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants