-
Notifications
You must be signed in to change notification settings - Fork 283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Medhelm judges #3484
base: main
Are you sure you want to change the base?
Medhelm judges #3484
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good at a high level. Left some optional suggestions; feel free to address in this pull request, or merge and open a new request.
hlog("WARNING: Annotator skipped sending requests because the model response was empty") | ||
return { | ||
"prompt_text": None, | ||
"empty_output_equivalence_judgement": False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably want to rename this key.
prompt_template: str, | ||
annotation_criteria: Dict[str, Set[str]], | ||
annotator_models: Dict[str, AnnotatorModelInfo], | ||
preprocessor: Optional[Callable[[str], str]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: output_preprocessor
# Attempt to fix incomplete JSON by adding a closing brace | ||
annotator_output = annotator_output + "}" | ||
try: | ||
annotator_criteria = json.loads(annotator_output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
annotation_criteria
-> annotation
?
""" | ||
self._auto_client = auto_client | ||
self._prompt_template = prompt_template | ||
self._annotation_criteria = annotation_criteria |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: annotation_schema
or annotation_format
metric_args = { | ||
"task": "mtsamples_replicate", | ||
"device": get_torch_device_name(), | ||
"bertscore_model": "distilbert-base-uncased", | ||
"rescale_with_baseline": False, | ||
} | ||
|
||
metric_specs = get_summarization_metric_specs(metric_args) + [ | ||
MetricSpec(class_name="helm.benchmark.metrics.mtsamples_replicate_metrics.MTSamplesReplicateMetric", args={}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optional nit: you can omit args={}
in most specs.
metric_service: MetricService, | ||
eval_cache_path: str, | ||
) -> List[Stat]: | ||
assert request_state.annotations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some way that the commonality between these metrics could be refactored out? See AnnotationLikertScaleMetric
here for an example that you could follow.
"clarity": {"score", "explanation"}, | ||
} | ||
|
||
ANNOTATOR_MODELS: Dict[str, AnnotatorModelInfo] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you need to switch between different predefined deployments in different environments, you could make an environment enum (e.g. "shc", "medical") and then do one of the below:
- Add a parameter to the run spec function that specifies an environment enum value
- Read in a shell environment variable to get the environment enum value
MedHELM contains several open-ended benchmarks where the current main metric is BertScore. While this metric is widely accepted in the literature, we want to make our evaluation metrics more robust.
To do so, we propose the use of LLM as a judge for the open-ended benchmarks, where we use standaridized evaluation criteria for each scenario and judges from multiple vendors the reduce bias.
Default judges:
*model_deployment="stanfordhealthcare/gpt-4o-2024-05-13"
Adding @suhana13 @aunell @HennyJie for FYI.