support arbitrary metric logging from torchmetrics #677

sichu2023 · 2025-02-05T07:44:08Z

Description

Training and validation torchmetrics.Metric can is organized by TorchmetricsConfig. This encapsulates metric class instantiation and naming through get_metric_name instead of using field_factory.

Currently model parallelism is not supported and will raise NotImplementedError.

Type of changes

New feature (non-breaking change which adds functionality)

sichu2023 · 2025-02-05T08:01:12Z

/build-ci

codecov-commenter · 2025-02-05T09:33:02Z

Codecov Report

Attention: Patch coverage is 90.16393% with 6 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@9cec09f). Learn more about missing BASE report.

✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
...-packages/bionemo-llm/src/bionemo/llm/lightning.py	84.00%	4 Missing ⚠️
...emo-esm2/src/bionemo/esm2/scripts/finetune_esm2.py	85.71%	1 Missing ⚠️
...ionemo-esm2/src/bionemo/esm2/scripts/train_esm2.py	85.71%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #677   +/-   ##
=======================================
  Coverage        ?   86.27%           
=======================================
  Files           ?      119           
  Lines           ?     7249           
  Branches        ?        0           
=======================================
  Hits            ?     6254           
  Misses          ?      995           
  Partials        ?        0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

farhadrgh

I tentatively approve to unblock, but the tests are failing, and I wasn’t able to experiment with it.

sub-packages/bionemo-llm/src/bionemo/llm/model/biobert/model.py

sub-packages/bionemo-llm/src/bionemo/llm/model/config.py

dorotat-nv · 2025-02-05T16:56:27Z

sub-packages/bionemo-llm/src/bionemo/llm/lightning.py

+        # configure metrics
+        self.train_metric = self.config.train_metric.get_instance() if self.config.train_metric else None
+        self.valid_metric = self.config.valid_metric.get_instance() if self.config.valid_metric else None
+        if (self.train_metric or self.valid_metric) and not self.is_on_logging_device:


ok, so the logic is that at least one rank will throw this error if model parallelism is initiated? Is it any metric logging? or only torch metric?

Yes and this is an imperfect solution which might lock the other devices.

We can condition on tensor/pipeline model parallelism If parallel_state has the functionality to inspect the tensor/pipeline model parallel size even when tp=pp=1. get_tensor_model_parallel_world_size and get_pipeline_model_parallel_world_size will throw an error in such case.

Would be great to have suggestions on alternatives.

imho, this error should be thrown when the metric config is initialised. In pydantic it could be

from pydantic import BaseModel, Field class TorchmetricsConfig(BaseModel): class_path: str task: Literal["lm", "classification", "regression"] metric_name: str kwargs: Optional[dict[str, Any]] = Field(default_factory=dict) model_parallelism: bool = False .... def validate_parallel_logging(self) -> None: if self.model_parallelism: raise NotImplementedError(f"{self.__class__.__name__} logging is not implemented with model parallelism yet.")

and then you would need to intilised metrcis with

valid_metric = TorchmetricsConfig( class_path="Accuracy", task="classification", kwargs={ "task": "multiclass", "threshold": 0.5, "num_classes": data_module.train_dataset.label_tokenizer.vocab_size, }, metric_name="val_acc", model_parallelism: any([p > 1 [tensor_model_parallel, pipeline_model_parallel]]) )

alternatively you can add parallelism check to the config and pass pp and tp settings

Since model parallelism is supported at BionemoLightningModule level I prefer to move the model parallelism check outside of TorchmetricsConfig. I have moved that into the training and finetuning scripts.

sub-packages/bionemo-llm/src/bionemo/llm/lightning.py

sub-packages/bionemo-llm/src/bionemo/llm/model/config.py

pstjohn

Could we get some tests for these new classes? Would be great to have unit tests around MetricConfig that show how it's used, as well as a short training run with some simple model that ensures it gets serialized correctly, produces the right results, etc.

sichu2023 · 2025-02-05T22:05:45Z

sub-packages/bionemo-esm2/tests/bionemo/esm2/scripts/test_finetune_esm2.py

@@ -84,6 +84,7 @@ def test_esm2_finetune_token_classifier(
        assert weights_ckpt.is_dir()
        assert io.is_distributed_ckpt(weights_ckpt)
        assert simple_ft_metrics.collection_train["loss"][0] > simple_ft_metrics.collection_train["loss"][-1]
+        # assert trainer.logged_metrics["val_acc"].item() <= 0.5  # TODO @farhad for a reasonable value


@farhadrgh will have to pick your brain.

The tests for fine-tuning are on a dummy dataset which overfits. I can replace that with a more reasonable dataset and uncomment these lines later

Thanks! Feel free to push your commits directly.

sichu2023 · 2025-02-05T22:05:50Z

sub-packages/bionemo-esm2/tests/bionemo/esm2/scripts/test_finetune_esm2.py

@@ -136,6 +137,7 @@ def test_esm2_finetune_regressor(
        assert weights_ckpt.is_dir()
        assert io.is_distributed_ckpt(weights_ckpt)
        assert simple_ft_metrics.collection_train["loss"][0] > simple_ft_metrics.collection_train["loss"][-1]
+        # assert trainer.logged_metrics["val_mse"].item() <= 0.5  # TODO @farhad for a reasonable value


@farhadrgh will have to pick your brain.

sichu2023 · 2025-02-05T22:05:54Z

sub-packages/bionemo-esm2/tests/bionemo/esm2/scripts/test_finetune_esm2.py

@@ -189,6 +191,7 @@ def test_esm2_finetune_classifier(
        assert weights_ckpt.is_dir()
        assert io.is_distributed_ckpt(weights_ckpt)
        assert simple_ft_metrics.collection_train["loss"][0] > simple_ft_metrics.collection_train["loss"][-1]
+        # assert trainer.logged_metrics["val_acc"].item() <= 0.5  # TODO @farhad for a reasonable value


@farhadrgh will have to pick your brain.

dorotat-nv · 2025-02-06T12:41:25Z

sub-packages/bionemo-llm/src/bionemo/llm/model/config.py

+
+
+@dataclass
+class TorchmetricsConfig:


why not to use pydantic?

I would reimplement it to pydantic and add some validations

ie

from pydantic import BaseModel, Field, model_validator class TorchmetricsConfig(BaseModel): class_path: str task: Literal["lm", "classification", "regression"] metric_name: str kwargs: Optional[dict[str, Any]] = Field(default_factory=dict) @model_validator(mode='after') def validate_class_path(self) -> 'TorchmetricsConfig': try: self.get_instance() except (ImportError, AttributeError) as e: raise ValueError(f"Invalid class_path: {self.class_path}. Error: {str(e)}") return self def get_instance(self) -> torchmetrics.Metric: """Dynamically imports and instantiates the metric class.""" if "." in self.class_path: module_path, class_name = self.class_path.rsplit(".", 1) module = importlib.import_module(f"torchmetrics.{module_path}") else: class_name = self.class_path module = importlib.import_module("torchmetrics") cls = getattr(module, class_name) return cls(**self.kwargs)

Because we still have implementation beyond pydantic, e.g. train_esm.py and finetune_esm.py where most unittest and dev happen. That alone will be another discussion.

sub-packages/bionemo-llm/src/bionemo/llm/model/config.py

dorotat-nv · 2025-02-06T12:54:29Z

sub-packages/bionemo-llm/src/bionemo/llm/lightning.py

+        # configure metrics
+        self.train_metric = self.config.train_metric.get_instance() if self.config.train_metric else None
+        self.valid_metric = self.config.valid_metric.get_instance() if self.config.valid_metric else None
+        if (self.train_metric or self.valid_metric) and not self.is_on_logging_device:


imho, this error should be thrown when the metric config is initialised. In pydantic it could be

from pydantic import BaseModel, Field class TorchmetricsConfig(BaseModel): class_path: str task: Literal["lm", "classification", "regression"] metric_name: str kwargs: Optional[dict[str, Any]] = Field(default_factory=dict) model_parallelism: bool = False .... def validate_parallel_logging(self) -> None: if self.model_parallelism: raise NotImplementedError(f"{self.__class__.__name__} logging is not implemented with model parallelism yet.")

and then you would need to intilised metrcis with

valid_metric = TorchmetricsConfig( class_path="Accuracy", task="classification", kwargs={ "task": "multiclass", "threshold": 0.5, "num_classes": data_module.train_dataset.label_tokenizer.vocab_size, }, metric_name="val_acc", model_parallelism: any([p > 1 [tensor_model_parallel, pipeline_model_parallel]]) )

alternatively you can add parallelism check to the config and pass pp and tp settings

Signed-off-by: sichu <[email protected]>

This reverts commit a180864. Signed-off-by: sichu <[email protected]>

Signed-off-by: sichu <[email protected]>

sichu2023 requested review from pstjohn, jstjohn, skothenhill-nv, jomitchellnv, farhadrgh, dorotat-nv and malcolmgreaves as code owners February 5, 2025 07:44

sichu2023 force-pushed the sichu/metric-config branch 2 times, most recently from 6232173 to c43eff6 Compare February 5, 2025 08:00

sichu2023 self-assigned this Feb 5, 2025

sichu2023 force-pushed the sichu/metric-config branch 2 times, most recently from 44e7217 to a79fc4e Compare February 5, 2025 08:49

farhadrgh approved these changes Feb 5, 2025

View reviewed changes

sub-packages/bionemo-llm/src/bionemo/llm/model/biobert/model.py Outdated Show resolved Hide resolved

dorotat-nv requested changes Feb 5, 2025

View reviewed changes

sub-packages/bionemo-llm/src/bionemo/llm/model/config.py Outdated Show resolved Hide resolved

dorotat-nv reviewed Feb 5, 2025

View reviewed changes

pstjohn reviewed Feb 5, 2025

View reviewed changes

sub-packages/bionemo-llm/src/bionemo/llm/lightning.py Outdated Show resolved Hide resolved

pstjohn reviewed Feb 5, 2025

View reviewed changes

sub-packages/bionemo-llm/src/bionemo/llm/model/config.py Outdated Show resolved Hide resolved

pstjohn reviewed Feb 5, 2025

View reviewed changes

sichu2023 commented Feb 5, 2025

View reviewed changes

sichu2023 requested a review from farhadrgh February 5, 2025 22:06

dorotat-nv requested changes Feb 6, 2025

View reviewed changes

sichu2023 requested a review from dorotat-nv February 6, 2025 17:36

sichu2023 added 4 commits February 6, 2025 21:19

enable arbitrary torchmetrics through MetricConfig

7c52c08

Signed-off-by: sichu <[email protected]>

add finetuning metric logging

fff729b

Signed-off-by: sichu <[email protected]>

fix import error

35c1d5c

Signed-off-by: sichu <[email protected]>

fix pydantic api

fb7eb00

Signed-off-by: sichu <[email protected]>

sichu2023 added 17 commits February 6, 2025 21:19

support empty kwargs

ef70cbc

Signed-off-by: sichu <[email protected]>

migrate metric_config field into BioBertConfig

8680084

Signed-off-by: sichu <[email protected]>

add task to accuracy

cf14521

Signed-off-by: sichu <[email protected]>

fix get_instance

4cc46d8

Signed-off-by: sichu <[email protected]>

add task support

ed70a9a

Signed-off-by: sichu <[email protected]>

drop get_metric_name feature

73cb2b4

Signed-off-by: sichu <[email protected]>

fix bug in finetuning metric logging

5302fe5

Signed-off-by: sichu <[email protected]>

extract method update_metric

58cdf5e

Signed-off-by: sichu <[email protected]>

rename MetricConfig to TorchmetricsConfig

478d79b

Signed-off-by: sichu <[email protected]>

improve TorchmetricsConfig docstring

6f4ad2f

Signed-off-by: sichu <[email protected]>

Revert "fix bug in finetuning metric logging"

df730e7

This reverts commit a180864. Signed-off-by: sichu <[email protected]>

clean up comments and unused args

fb5ccd5

Signed-off-by: sichu <[email protected]>

add e2e test on finetuning metric logging

88d4ac0

Signed-off-by: sichu <[email protected]>

drop ppl args to train_esm2.py and check whether metric is in trainer

ddb6a90

Signed-off-by: sichu <[email protected]>

move model parallelism check into main function

ad72305

Signed-off-by: sichu <[email protected]>

update docstring

4730c61

Signed-off-by: sichu <[email protected]>

fix bug in pydantic

4d61dc6

Signed-off-by: sichu <[email protected]>

sichu2023 force-pushed the sichu/metric-config branch from ecc9f67 to 4d61dc6 Compare February 6, 2025 21:19

sichu2023 enabled auto-merge February 6, 2025 23:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support arbitrary metric logging from torchmetrics #677

support arbitrary metric logging from torchmetrics #677

sichu2023 commented Feb 5, 2025 •

edited

Loading

sichu2023 commented Feb 5, 2025

codecov-commenter commented Feb 5, 2025 •

edited

Loading

farhadrgh left a comment

dorotat-nv Feb 5, 2025

sichu2023 Feb 5, 2025

sichu2023 Feb 5, 2025

dorotat-nv Feb 6, 2025

sichu2023 Feb 6, 2025

pstjohn left a comment

sichu2023 Feb 5, 2025

farhadrgh Feb 6, 2025

sichu2023 Feb 6, 2025

sichu2023 Feb 5, 2025

sichu2023 Feb 5, 2025

dorotat-nv Feb 6, 2025

dorotat-nv Feb 6, 2025

sichu2023 Feb 6, 2025

dorotat-nv Feb 6, 2025

support arbitrary metric logging from torchmetrics #677

Are you sure you want to change the base?

support arbitrary metric logging from torchmetrics #677

Conversation

sichu2023 commented Feb 5, 2025 • edited Loading

Description

Type of changes

sichu2023 commented Feb 5, 2025

codecov-commenter commented Feb 5, 2025 • edited Loading

Codecov Report

farhadrgh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pstjohn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sichu2023 commented Feb 5, 2025 •

edited

Loading

codecov-commenter commented Feb 5, 2025 •

edited

Loading