Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About eos_token_id in config file (20M, 1B) #757

Open
lllabmaster opened this issue Nov 29, 2024 · 5 comments
Open

About eos_token_id in config file (20M, 1B) #757

lllabmaster opened this issue Nov 29, 2024 · 5 comments
Assignees
Labels
type/question An issue that's a question

Comments

@lllabmaster
Copy link

❓ The question

In the 20M configuration file (OLMo-20M.yaml), the settings specify:
eos_token_id: 0, pad_token_id: 1, and tokenizer: tokenizers/allenai_gpt-neox-olmo-dolma-v1_5.json.

However, in the tokenizers/allenai_gpt-neox-olmo-dolma-v1_5.json file, I noticed that id 0 corresponds to "|||IP_ADDRESS|||", while <|endoftext|> is assigned id 50279. This seems to contradict the configuration, especially when compared with the tokenizers/allenai_eleuther-ai-gpt-neox-20b-pii-special.json file.

Additionally, I inspected the preprocessed data file (part-1-00000.npy) and ran the following analysis:

list(( np.where(data == 50279)[0][1:] - np.where(d == 50279)[0][:-1] ) / 4)

This assumes an average of 4 tokens per English word. The results were:
[27639.0, 16304.25, 23344.5, 26183.25, 35961.75, 6302.0, 42492.0, 4867.0, 7313.5, ...] (length = 7386).

I have two questions:

  1. What should I set as the eos_token_id in the configuration file when using the allenai_eleuther-ai-gpt-neox-20b-pii-special.json tokenizer?
  2. Was the preprocessed data (gpt-neox-olmo-dolma-v1_5/part-X-00000.npy) encoded with eos_token_id=50279?

Thank you for your assistance!

@lllabmaster lllabmaster added the type/question An issue that's a question label Nov 29, 2024
@Harry-Miral
Copy link

Have you solved your problem? I have the same confusion. Thank you.

@lllabmaster
Copy link
Author

Have you solved your problem? I have the same confusion. Thank you.

I changed the config file(yaml) with the right token number setting eos_token_id= 50279. It worked well.

@Harry-Miral
Copy link

Harry-Miral commented Mar 4, 2025

Thank you, I just found that the tokenizer.json file is defined like this:(https://huggingface.co/allenai/OLMoE-1B-7B-0125-Instruct/raw/main/tokenizer.json)
{
"id": 50279,
"content": "|||IP_ADDRESS|||",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": true,
"special": true
},
{
"id": 0,
"content": "<|endoftext|>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 1,
"content": "<|padding|>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
I am a little confused, I don't know whether I should modify it in the yml file or in the tokenizer.

@lllabmaster
Copy link
Author

Thank you, I just found that the tokenizer.json file is defined like this:(https://huggingface.co/allenai/OLMoE-1B-7B-0125-Instruct/raw/main/tokenizer.json) { "id": 50279, "content": "|||IP_ADDRESS|||", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "special": true }, { "id": 0, "content": "<|endoftext|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false, "special": true }, { "id": 1, "content": "<|padding|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false, "special": true }, I am a little confused, I don't know whether I should modify it in the yml file or in the tokenizer.

I suggest you change the model config file(yaml) with your tokenizer file setting (token ids etc.). I train my own tokenizer in 100B tokens and changed the model config file with my tokenizer file. (Maybe the model config file in the repo has some mistakes sometimes, so we should edit it with right setting)

@Harry-Miral
Copy link

Harry-Miral commented Mar 4, 2025

Thank you, your solution is very good. I think I found the real reason for the problem: in the tokenizer.json of version 09.24 (https://huggingface.co/allenai/OLMoE-1B-7B-0924-
Instruct/blob/main/tokenizer.json), it is defined as follows:
{
"id": 50279,
"content": "<|endoftext|>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
In the tokenizer.json of version 0125 (https://huggingface.co/allenai/OLMoE-1B-7B-0125-Instruct/raw/main/tokenizer.json), it is defined as follows:
{
"id": 0,
"content": "<|endoftext|>",
"single_word": false,
"lstrip":
false,
"rstrip": false,
"normalized": false, "special": true
},
There are differences between the two tokenizers. When using the official https://github.com/allenai/OLMoE, please note that the version in the official readme is version 09.24.
Therefore, the yml file of version 0924 should be set to eos_token_id: 50279

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/question An issue that's a question
Projects
None yet
Development

No branches or pull requests

3 participants