MedHELM V1 #3403

MiguelAFH · 2025-03-04T05:44:11Z

On this PR, we add the 31 scenarios part of the first release of MedHELM and the model deployments used to run all benchmarks. Changes checklist:

31 scenarios under src/helm/benchmark/scenarios
11 model deployments added
31 run specs added to the new file medhelm_run_specs.py

… med-helm

…n Carina

…ine, change bertscore backbone model to fit on 40GB GPU

… med-helm

…h Quality

yifanmai · 2025-03-17T21:35:24Z

src/helm/benchmark/scenarios/mtsamples_replicate_scenario.py

+    GITHUB_DIR_URL = "https://github.com/raulista1997/benchmarkdata/tree/main/mtsamples_processed"
+    RAW_BASE_URL = "https://raw.githubusercontent.com/raulista1997/benchmarkdata/refs/heads/main/mtsamples_processed/"


Pin githash.

yifanmai · 2025-03-17T21:35:53Z

src/helm/benchmark/scenarios/mtsamples_replicate_scenario.py

+        soup = BeautifulSoup(response.text, "html.parser")
+        file_links = [
+            link.text
+            for link in soup.find_all(
+                "a", {"href": re.compile(r"/raulista1997/benchmarkdata/blob/main/mtsamples_processed/.*\.txt$")}
+            )
+        ]
+        return file_links


Use the GitHub API and pin githash (same comments as for mtsamples_procedures_scenario)

yifanmai · 2025-03-17T21:41:13Z

src/helm/config/model_metadata.yaml

@@ -1958,6 +1976,15 @@ models:
    num_parameters: 14000000000
    release_date: 2024-05-21
    tags: [TEXT_MODEL_TAG, LIMITED_FUNCTIONALITY_TEXT_MODEL_TAG, INSTRUCTION_FOLLOWING_MODEL_TAG]
+
+  - name: microsoft/phi-3.5-mini-instruct


microsoft/phi-3.5-mini-instruct already exists; remove.

yifanmai · 2025-03-17T21:46:17Z

src/helm/config/model_metadata.yaml

+    release_date: 2024-09-25
+    tags: [TEXT_MODEL_TAG, LIMITED_FUNCTIONALITY_TEXT_MODEL_TAG, INSTRUCTION_FOLLOWING_MODEL_TAG]
+
+  - name: meta/llama-3.1-8b-instruct


meta/llama-3.1-8b-instruct already exists; delete

yifanmai · 2025-03-17T21:46:34Z

src/helm/config/model_metadata.yaml

@@ -1530,6 +1530,24 @@ models:
    release_date: 2022-12-22
    tags: [] # TODO: add tags

+  - name: meta/llama-3.2-1b-instruct


Move this to right before the entry for meta/llama-3.2-3b-instruct-turbo.

src/helm/config/model_deployments.yaml

yifanmai · 2025-03-18T21:16:52Z

src/helm/benchmark/scenarios/medalign_scenario.py

+    get_instructions,
+    extract_patient_id_from_fname,
+    get_ehrs,
+    get_tokenizer,
+    tag_rgx_expression,
+    fetch_nodes_with_tag,
+    cast_dtype,
+    check_condition,
+    check_all_conditions,
+    remove_node,
+    query_xml_str,
+    filter_events,
+    retrieve_most_relevant_visits,
+    get_prompt_template,
+    pack_and_trim_prompts,
+    preprocess_prompts,
+    add_reference_responses,
+    return_dataset_dataframe,


I think you only need to import return_dataset_dataframe.

yifanmai · 2025-03-18T21:17:40Z

src/helm/benchmark/scenarios/medalign_scenario.py

-            # get the patient EHR selected for this instruction
-            pt_id: Union[str, int] = instruction_dict["patient_id"]
-            relevant_ehr = ehrs[pt_id]  # type: ignore
+            prompt = PassageQuestionInput(passage="", question=question)


yifanmai · 2025-03-18T21:24:18Z

src/helm/benchmark/scenarios/n2c2_ct_matching_scenario.py

+    @staticmethod
+    def get_date_of_note(patient: Dict[str, Any], note_idx: int) -> str:
+        """Get date of note for patient"""
+        if not isinstance(note_idx, int):


eval() is insecure code. Please either use int() instead, or remove this if block.

src/helm/benchmark/scenarios/n2c2_ct_matching_scenario.py

yifanmai · 2025-03-18T21:31:38Z

src/helm/benchmark/annotation/ehr_sql_annotator.py

+                        cursor.execute(ground_truth_sql)
+                        fetched_result = cursor.fetchone()
+                        if fetched_result:
+                            # Convert extra_values to match SQLite's expected types
+                            converted_values = [
+                                type(fetched_result[i])(extra_values[i]) for i in range(len(extra_values))
+                            ]
+                            ground_truth_result = converted_values


Does this actually work? If we're in this block, then it means that cursor.fetchall() returned a false-y value or that the query failed, so re-running the query should also result in failure. I'm fine with just using extra_values as is (i.e. the original verison).

src/helm/clients/azure_openai_client.py

src/helm/benchmark/scenarios/n2c2_ct_matching_scenario.py

yifanmai

LGTM. Thank you all!

MiguelAFH and others added 30 commits November 23, 2024 03:23

Added medcalc bench scenario

e4b6e84

Merge branch 'med-helm' of https://github.com/stanford-crfm/helm into…

c8219a9

… med-helm

Rollback removal of medical scenarios

c2ca7e5

UNTESTED implementation of Medalign with new setup, pushing to test o…

a2aa50f

…n Carina

updated medalign, functional

0679d3d

add max tokens for medalign run spec

f3b53e1

Added llama 3.1 instruct and medalign to schema_medical

a087633

Added display name for Llama 3.2 1B Instruct

501fe33

Update summarization metrics and MedAlign spec to bring bertscore onl…

7bf79a2

…ine, change bertscore backbone model to fit on 40GB GPU

Added MedDialog to MEDHELM

e9c8e47

Added MIMIC-RRS scenario

aa41939

Reduced max tokens for MIMIC-RRS

676d4a5

Fix device for medical scenarios

ed2429c

Added groups for each medical task category

3e5bd96

dischargeMe scenario + schema update

78b2a91

Added medi_qa scenario

3700016

feat: add mimic billing codes

acb902b

Added MIMICIV Billing Code scenario

5c3eea0

add mtsamples benchmark

2f96439

Merge branch 'med-helm' of https://github.com/stanford-crfm/helm into…

b48d219

… med-helm

feat: resolve merge commits

513987c

Modified medication_qa metrics

73ed7da

Merge branch 'med-helm' of https://github.com/stanford-crfm/helm into…

7bf8b11

… med-helm

initial ehrshot commit

c52313b

Merge branch 'med-helm' of https://github.com/stanford-crfm/helm into…

ad8c71e

… med-helm

ehrshot

53c9e56

token stats for ehrshot

57241ec

Race based medicine detection benchmark for Ensuring Clinical Researc…

7715c16

…h Quality

updated multiple choice adapter

8921886

fix: mimiciv duplicate instantiation

04fd2bb

MiguelAFH added 4 commits March 16, 2025 23:03

Update run entries

5b37259

Update medhelm run entries for SHC models

ece4190

Fix head_qa scenario

d410dbe

Merge branch

cb47085

yifanmai reviewed Mar 17, 2025

View reviewed changes

MiguelAFH and others added 4 commits March 18, 2025 03:48

Added medalgin helper file

5663820

Removed init from mental_health_scenario

afd6c4e

fix n2c2_ct_matching

34282b8

Merge main

8f0bfc9

yifanmai reviewed Mar 18, 2025

View reviewed changes

MiguelAFH and others added 17 commits March 18, 2025 21:56

Fix scenarios

c818d52

Fix lint

3a3e80f

Merge branch 'main' into med-helm

1bd80dd

update n2c2

28bf270

Merge branch 'med-helm' of github.com:stanford-crfm/helm into med-helm

ead96a3

Merge conflict

4b6f69b

Merge conflict

b9f4d90

Merge conflict

ce12aa4

Merge conflict

88dc673

Merge main

1e36813

Fix requirements order

29a432b

Fix requirements

19b9a7a

Fix lint

52bfc23

Fix lint

ac78180

Fix lint

957f8a9

Fix lint

26307ca

Fix lint

f08fe40

yifanmai approved these changes Mar 19, 2025

View reviewed changes

yifanmai merged commit 87cd4d8 into main Mar 19, 2025
8 checks passed

yifanmai deleted the med-helm branch March 19, 2025 01:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MedHELM V1 #3403

MedHELM V1 #3403

MiguelAFH commented Mar 4, 2025

yifanmai Mar 17, 2025

yifanmai Mar 17, 2025

yifanmai Mar 17, 2025

yifanmai Mar 17, 2025

yifanmai Mar 17, 2025

yifanmai Mar 18, 2025

yifanmai Mar 18, 2025

yifanmai Mar 18, 2025

yifanmai Mar 18, 2025

yifanmai left a comment

		GITHUB_DIR_URL = "https://github.com/raulista1997/benchmarkdata/tree/main/mtsamples_processed"
		RAW_BASE_URL = "https://raw.githubusercontent.com/raulista1997/benchmarkdata/refs/heads/main/mtsamples_processed/"

MedHELM V1 #3403

MedHELM V1 #3403

Conversation

MiguelAFH commented Mar 4, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yifanmai left a comment

Choose a reason for hiding this comment