Hello,
I tried to reproduce your experimental results using BabelEdits dataset, with #sequential edits = 500, on Llama3.1-8b-Instruct.
However, it seems that
- the reliability scores for each language are mixed into portability results (
xlt-prompts_mt-en, xlt-prompts_mt-de, ...)
- the output
summary.json shows that BabelReFT fails to propagate the English Edits to other languages when #sequential edits becomes larger, with Reliability (post.portability.xlt-prompts_mt-xx.rewrite_score and post.portability.xlt-prompts_mt_marked-xx.rewrite_score), Generality (post.rephrase_acc.prompts_gen_mt-xx.rewrite_score and post.rephrase_acc.prompts_gen_mt_marked-xx.rewrite_score), Portability (post.portability.multi-hop_prompts_port_mt-xx.rewrite_score, etc...) all close to zero
I used your default config.yaml, and used python edit.py method="babelreft" max_edits=500 sequential=True return_edited_weights_at_end=True to get the results.
Could you please tell me why the results are different from your reported ones? If by using those default configurations can't reproduce your results, what I should do to get the correct ones?
Thank you for your time.

Hello,
I tried to reproduce your experimental results using BabelEdits dataset, with #sequential edits = 500, on Llama3.1-8b-Instruct.
However, it seems that
xlt-prompts_mt-en,xlt-prompts_mt-de, ...)summary.jsonshows that BabelReFT fails to propagate the English Edits to other languages when #sequential edits becomes larger, with Reliability (post.portability.xlt-prompts_mt-xx.rewrite_scoreandpost.portability.xlt-prompts_mt_marked-xx.rewrite_score), Generality (post.rephrase_acc.prompts_gen_mt-xx.rewrite_scoreandpost.rephrase_acc.prompts_gen_mt_marked-xx.rewrite_score), Portability (post.portability.multi-hop_prompts_port_mt-xx.rewrite_score, etc...) all close to zeroI used your default
config.yaml, and usedpython edit.py method="babelreft" max_edits=500 sequential=True return_edited_weights_at_end=Trueto get the results.Could you please tell me why the results are different from your reported ones? If by using those default configurations can't reproduce your results, what I should do to get the correct ones?
Thank you for your time.