Skip to content

Why add EOS token after every sentence? #1

@lucacampanella

Description

@lucacampanella

Hi,
Thanks a lot for sharing the code with us, interesting work!
I have a question regarding tokenization for GPT-2.
I've seen that you add an EOS token at the end of every sentence in each text example. Here:

def add_eos_tokens(self, text):
        eos_token = " " + self.transformer_tokenizer.eos_token + " "
        sentences = self.sentence_detector.tokenize(text)
        eos_added_text = (
            eos_token.join(sentences) + " " + self.transformer_tokenizer.eos_token
        )
        return eos_added_text

Why do you do this? Wouldn't one at the end of the whole text be sufficient?
Thanks a lot for your input :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions