Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

预训练数据集相关问题咨询 #1154

Open
WangCC-77 opened this issue Oct 28, 2024 · 1 comment
Open

预训练数据集相关问题咨询 #1154

WangCC-77 opened this issue Oct 28, 2024 · 1 comment

Comments

@WangCC-77
Copy link

作者,你好,非常感谢能开源模型,我这边想要复现,目前在预训练阶段遇到了瓶颈,麻烦作者帮忙看一下:
(1)在预训练xlm-roberta-large+retroMAE时,收集了C4、wudao和pile数据集,看到文中还使用了大量的无监督数据对密集检索进行预训练,目前关于这部分数据集有公开吗?
(2)我在网上搜到了MTP的数据集,里面有3亿条数据,由许多数据集组合得来的,但这个就和作者文中的表9有重复,这些数据是怎么处理的呢?
MTP:
image
table 9:
image
如果作者看到这个问题,麻烦作者帮忙解答一下,万分感谢~

@hanhainebula
Copy link
Contributor

(1)这部分数据太大了,所以没有开源;目前只开源了微调部分的数据集:https://huggingface.co/datasets/Shitao/bge-m3-data/tree/main
(2)可以保留重复的部分,也可以去下重,比如去掉来自同样 source 的数据集

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants