I read in your paper that"when packing multiple documents into one sequence, we ensure each sequence starts with a new document rather than in the middle of one." Have you done any ablation studies on this? What is the impact on the downstream performance when just packing the documents in the standard fashion? Thanks!
I read in your paper that"when packing multiple documents into one sequence, we ensure each sequence starts with a new document rather than in the middle of one." Have you done any ablation studies on this? What is the impact on the downstream performance when just packing the documents in the standard fashion? Thanks!