We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
作者,你好,非常感谢能开源模型,我这边想要复现,目前在预训练阶段遇到了瓶颈,麻烦作者帮忙看一下: (1)在预训练xlm-roberta-large+retroMAE时,收集了C4、wudao和pile数据集,看到文中还使用了大量的无监督数据对密集检索进行预训练,目前关于这部分数据集有公开吗? (2)我在网上搜到了MTP的数据集,里面有3亿条数据,由许多数据集组合得来的,但这个就和作者文中的表9有重复,这些数据是怎么处理的呢? MTP: table 9: 如果作者看到这个问题,麻烦作者帮忙解答一下,万分感谢~
The text was updated successfully, but these errors were encountered:
(1)这部分数据太大了,所以没有开源;目前只开源了微调部分的数据集:https://huggingface.co/datasets/Shitao/bge-m3-data/tree/main (2)可以保留重复的部分,也可以去下重,比如去掉来自同样 source 的数据集
Sorry, something went wrong.
No branches or pull requests
作者,你好,非常感谢能开源模型,我这边想要复现,目前在预训练阶段遇到了瓶颈,麻烦作者帮忙看一下:
(1)在预训练xlm-roberta-large+retroMAE时,收集了C4、wudao和pile数据集,看到文中还使用了大量的无监督数据对密集检索进行预训练,目前关于这部分数据集有公开吗?
(2)我在网上搜到了MTP的数据集,里面有3亿条数据,由许多数据集组合得来的,但这个就和作者文中的表9有重复,这些数据是怎么处理的呢?
MTP:
table 9:
如果作者看到这个问题,麻烦作者帮忙解答一下,万分感谢~
The text was updated successfully, but these errors were encountered: