Fail to reproduce BabelReFT's experimental results

Hello, 

I tried to reproduce your experimental results using BabelEdits dataset, with #sequential edits = 500, on Llama3.1-8b-Instruct.

However, it seems that
- the reliability scores for each language are mixed into portability results (`xlt-prompts_mt-en`, `xlt-prompts_mt-de`, ...)
- the output `summary.json` shows that BabelReFT fails to propagate the English Edits to other languages when #sequential edits becomes larger, with Reliability (`post.portability.xlt-prompts_mt-xx.rewrite_score` and `post.portability.xlt-prompts_mt_marked-xx.rewrite_score`), Generality (`post.rephrase_acc.prompts_gen_mt-xx.rewrite_score` and `post.rephrase_acc.prompts_gen_mt_marked-xx.rewrite_score`), Portability (`post.portability.multi-hop_prompts_port_mt-xx.rewrite_score`, etc...) all close to zero

I used your default `config.yaml`, and used `python edit.py method="babelreft" max_edits=500 sequential=True return_edited_weights_at_end=True` to get the results. 

Could you please tell me why the results are different from your reported ones? If by using those default configurations can't reproduce your results, what I should do to get the correct ones?

Thank you for your time.

<img width="530" height="636" alt="Image" src="https://github.com/user-attachments/assets/9efdddcc-965d-4bfe-9bf0-5089a3009538" />

<img width="481" height="607" alt="Image" src="https://github.com/user-attachments/assets/601bc4f9-7234-499c-8dc7-0507033b68b6" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fail to reproduce BabelReFT's experimental results #2

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Fail to reproduce BabelReFT's experimental results #2

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions