Medhelm judges #3484

MiguelAFH · 2025-03-27T18:40:59Z

MedHELM contains several open-ended benchmarks where the current main metric is BertScore. While this metric is widely accepted in the literature, we want to make our evaluation metrics more robust.

To do so, we propose the use of LLM as a judge for the open-ended benchmarks, where we use standaridized evaluation criteria for each scenario and judges from multiple vendors the reduce bias.

Default judges:

GPT-4o
- model_name="openai/gpt-4o-2024-05-13"
  *model_deployment="stanfordhealthcare/gpt-4o-2024-05-13"
Llama 3.3 70B Instruct
- model_name="meta/llama-3.3-70b-instruct"
- model_deployment="stanfordhealthcare/llama-3.3-70b-instruct"
Claude 3.7 sonnet
- model_name="anthropic/claude-3-7-sonnet-20250219"
- model_deployment="stanfordhealthcare/claude-3-7-sonnet-20250219"

Adding @suhana13 @aunell @HennyJie for FYI.

yifanmai

Looks good at a high level. Left some optional suggestions; feel free to address in this pull request, or merge and open a new request.

yifanmai · 2025-03-28T21:35:24Z

src/helm/benchmark/annotation/model_as_judge.py

+            hlog("WARNING: Annotator skipped sending requests because the model response was empty")
+            return {
+                "prompt_text": None,
+                "empty_output_equivalence_judgement": False,


Probably want to rename this key.

yifanmai · 2025-03-28T21:38:25Z

src/helm/benchmark/annotation/model_as_judge.py

+        prompt_template: str,
+        annotation_criteria: Dict[str, Set[str]],
+        annotator_models: Dict[str, AnnotatorModelInfo],
+        preprocessor: Optional[Callable[[str], str]] = None,


nit: output_preprocessor

yifanmai · 2025-03-28T21:39:39Z

src/helm/benchmark/annotation/model_as_judge.py

+                    # Attempt to fix incomplete JSON by adding a closing brace
+                    annotator_output = annotator_output + "}"
+                    try:
+                        annotator_criteria = json.loads(annotator_output)


annotation_criteria -> annotation?

yifanmai · 2025-03-28T21:41:59Z

src/helm/benchmark/annotation/model_as_judge.py

+        """
+        self._auto_client = auto_client
+        self._prompt_template = prompt_template
+        self._annotation_criteria = annotation_criteria


nit: annotation_schema or annotation_format

yifanmai · 2025-03-28T21:43:16Z

src/helm/benchmark/run_specs/medhelm_run_specs.py

    metric_args = {
        "task": "mtsamples_replicate",
        "device": get_torch_device_name(),
        "bertscore_model": "distilbert-base-uncased",
        "rescale_with_baseline": False,
    }

+    metric_specs = get_summarization_metric_specs(metric_args) + [
+        MetricSpec(class_name="helm.benchmark.metrics.mtsamples_replicate_metrics.MTSamplesReplicateMetric", args={})


optional nit: you can omit args={} in most specs.

yifanmai · 2025-03-28T21:48:01Z

src/helm/benchmark/metrics/aci_bench_metrics.py

+        metric_service: MetricService,
+        eval_cache_path: str,
+    ) -> List[Stat]:
+        assert request_state.annotations


Is there some way that the commonality between these metrics could be refactored out? See AnnotationLikertScaleMetric here for an example that you could follow.

yifanmai · 2025-03-28T21:49:44Z

src/helm/benchmark/annotation/aci_bench_annotator.py

+    "clarity": {"score", "explanation"},
+}
+
+ANNOTATOR_MODELS: Dict[str, AnnotatorModelInfo] = {


If you need to switch between different predefined deployments in different environments, you could make an environment enum (e.g. "shc", "medical") and then do one of the below:

Add a parameter to the run spec function that specifies an environment enum value

Read in a shell environment variable to get the environment enum value

MiguelAFH added 4 commits March 21, 2025 05:08

Added medhelm annotators

ad2d0d2

Merge branch 'main' into medhelm-judges

d4574a4

Merge branch 'main' into medhelm-judges

38379d7

Update LLM as a judge

e860d31

MiguelAFH requested a review from yifanmai March 27, 2025 18:41

MiguelAFH added 3 commits March 27, 2025 20:05

Fix lint

19cc7e0

Fix lint

e3f4ebe

Fix lint

1247719

MiguelAFH self-assigned this Mar 27, 2025

yifanmai approved these changes Mar 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Medhelm judges #3484

Medhelm judges #3484

MiguelAFH commented Mar 27, 2025 •

edited

Loading

yifanmai left a comment

yifanmai Mar 28, 2025

yifanmai Mar 28, 2025

yifanmai Mar 28, 2025

yifanmai Mar 28, 2025

yifanmai Mar 28, 2025

yifanmai Mar 28, 2025

yifanmai Mar 28, 2025

Medhelm judges #3484

Are you sure you want to change the base?

Medhelm judges #3484

Conversation

MiguelAFH commented Mar 27, 2025 • edited Loading

yifanmai left a comment

Choose a reason for hiding this comment

yifanmai Mar 28, 2025

Choose a reason for hiding this comment

yifanmai Mar 28, 2025

Choose a reason for hiding this comment

yifanmai Mar 28, 2025

Choose a reason for hiding this comment

yifanmai Mar 28, 2025

Choose a reason for hiding this comment

yifanmai Mar 28, 2025

Choose a reason for hiding this comment

yifanmai Mar 28, 2025

Choose a reason for hiding this comment

yifanmai Mar 28, 2025

Choose a reason for hiding this comment

MiguelAFH commented Mar 27, 2025 •

edited

Loading