Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When reproducing the results, it is possible to reproduce the NLG results but not the CE metric results. #9

Open
mengweiwang opened this issue Apr 1, 2023 · 3 comments

Comments

@mengweiwang
Copy link

I used the data format (only for the findings section) as R2Gen and R2GenCMN (Chen et al.) followed in this article, but I was unable to obtain the CE metric results mentioned in the paper.

I used the provided epoch=8-val_chen_cider=0.425092.ckpt model for cvt_21_to_distilgpt2 task and also tested epoch=0-val_chen_cider=0.410965.ckpt model for cvt_21_to_distilgpt2_scst task, but neither of them achieved the CE metric results mentioned in the paper.

In terms of CE metric, precision_macro can reach the result mentioned in the paper, but recall_macro and f1_macro cannot achieve it and there is a significant difference between them.

When calculating CE metrics here, only text related to findings is considered; do I need to perform any other processing?

The results obtained from performing cvt_21_to_distilgpt2 task are as follows:
{'test_ce_f1_example': 0.36598095297813416,
'test_ce_f1_macro': 0.2593880891799927,
'test_ce_f1_micro': 0.4408090114593506,
'test_ce_num_examples': 3858.0,
'test_ce_precision_example': 0.4171517491340637,
'test_ce_precision_macro': 0.3600466549396515,
'test_ce_precision_micro': 0.4919118881225586,
'test_ce_recall_example': 0.3665845990180969,
'test_ce_recall_macro': 0.25423887372016907,
'test_ce_recall_micro': 0.3993246555328369,
'test_chen_bleu_1': 0.39292487502098083,
'test_chen_bleu_2': 0.24805393815040588,
'test_chen_bleu_3': 0.17164887487888336,
'test_chen_bleu_4': 0.1269991397857666,
'test_chen_cider': 0.3902686834335327,
'test_chen_meteor': 0.15456412732601166,
'test_chen_num_examples': 3858.0,
'test_chen_rouge': 0.286588191986084}

The results in the paper are as follows:
precision_macro: 0.3597
recall_macro: 0.4122
f1_macro: 0.3842

The results obtained from performing cvt_21_to_distilgpt2_scst task are as follows:
{'test_ce_f1_example': 0.36484676599502563,
'test_ce_f1_macro': 0.26361414790153503,
'test_ce_f1_micro': 0.4410783648490906,
'test_ce_num_examples': 3858.0,
'test_ce_precision_example': 0.4175392985343933,
'test_ce_precision_macro': 0.3873042166233063,
'test_ce_precision_micro': 0.49624764919281006,
'test_ce_recall_example': 0.3643813729286194,
'test_ce_recall_macro': 0.2558453679084778,
'test_ce_recall_micro': 0.3969484865665436,
'test_chen_bleu_1': 0.39466917514801025,
'test_chen_bleu_2': 0.248764768242836,
'test_chen_bleu_3': 0.1718045324087143,
'test_chen_bleu_4': 0.1269892156124115,
'test_chen_cider': 0.37993040680885315,
'test_chen_meteor': 0.15499255061149597,
'test_chen_num_examples': 3858.0,
'test_chen_rouge': 0.28760746121406555}

Reproduced the above content, only modifying the task parameters in task/mimic_cxr_jpg_chen/jobs.yaml.

@anicolson
Copy link
Member

Hi,

Please see the updated README.md for the labels from Chen et al.
https://github.com/aehrc/cvt2distilgpt2

I will look into the discrepency with the results.

@mengweiwang
Copy link
Author

Hi,

Please see the updated README.md for the labels from Chen et al. aehrc/cvt2distilgpt2

I will look into the discrepency with the results.

@anicolson

Yes, I am using this dataset and the precision has reached the level reported in the paper. However, the recall rate is low and cannot reach the level reported in the paper.

Also, I have checked and tested the updated source code. The CE metric results did not change much, and there is a bug in the latest source code when running it. The bug is as follows:

The bug that occurred while I was executing the cvt_21_to_distilgpt2 task.

On line 281 of transmodal.model.py, the content is if not getattr(self, metric).compute_on_step:.

It indicates that the compute_on_step attribute does not exist.

@anicolson
Copy link
Member

Hi, there are some errors in the preprint, the correct results are reported in the updated repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants