About eos_token_id in config file (20M, 1B) #757

lllabmaster · 2024-11-29T03:30:17Z

❓ The question

In the 20M configuration file (OLMo-20M.yaml), the settings specify:
eos_token_id: 0, pad_token_id: 1, and tokenizer: tokenizers/allenai_gpt-neox-olmo-dolma-v1_5.json.

However, in the tokenizers/allenai_gpt-neox-olmo-dolma-v1_5.json file, I noticed that id 0 corresponds to "|||IP_ADDRESS|||", while <|endoftext|> is assigned id 50279. This seems to contradict the configuration, especially when compared with the tokenizers/allenai_eleuther-ai-gpt-neox-20b-pii-special.json file.

Additionally, I inspected the preprocessed data file (part-1-00000.npy) and ran the following analysis:

list(( np.where(data == 50279)[0][1:] - np.where(d == 50279)[0][:-1] ) / 4)

This assumes an average of 4 tokens per English word. The results were:
[27639.0, 16304.25, 23344.5, 26183.25, 35961.75, 6302.0, 42492.0, 4867.0, 7313.5, ...] (length = 7386).

I have two questions:

What should I set as the eos_token_id in the configuration file when using the allenai_eleuther-ai-gpt-neox-20b-pii-special.json tokenizer?
Was the preprocessed data (gpt-neox-olmo-dolma-v1_5/part-X-00000.npy) encoded with eos_token_id=50279?

Thank you for your assistance!

The text was updated successfully, but these errors were encountered:

Harry-Miral · 2025-03-03T10:15:01Z

Have you solved your problem? I have the same confusion. Thank you.

lllabmaster · 2025-03-04T01:46:22Z

Have you solved your problem? I have the same confusion. Thank you.

I changed the config file(yaml) with the right token number setting eos_token_id= 50279. It worked well.

Harry-Miral · 2025-03-04T01:55:12Z

Thank you, I just found that the tokenizer.json file is defined like this:（https://huggingface.co/allenai/OLMoE-1B-7B-0125-Instruct/raw/main/tokenizer.json）
{
"id": 50279,
"content": "|||IP_ADDRESS|||",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": true,
"special": true
},
{
"id": 0,
"content": "<|endoftext|>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 1,
"content": "<|padding|>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
I am a little confused, I don't know whether I should modify it in the yml file or in the tokenizer.

lllabmaster · 2025-03-04T02:01:10Z

Thank you, I just found that the tokenizer.json file is defined like this:（https://huggingface.co/allenai/OLMoE-1B-7B-0125-Instruct/raw/main/tokenizer.json） { "id": 50279, "content": "|||IP_ADDRESS|||", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "special": true }, { "id": 0, "content": "<|endoftext|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false, "special": true }, { "id": 1, "content": "<|padding|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false, "special": true }, I am a little confused, I don't know whether I should modify it in the yml file or in the tokenizer.

I suggest you change the model config file(yaml) with your tokenizer file setting (token ids etc.). I train my own tokenizer in 100B tokens and changed the model config file with my tokenizer file. (Maybe the model config file in the repo has some mistakes sometimes, so we should edit it with right setting)

Harry-Miral · 2025-03-04T02:19:19Z

Thank you, your solution is very good. I think I found the real reason for the problem: in the tokenizer.json of version 09.24 (https://huggingface.co/allenai/OLMoE-1B-7B-0924-
Instruct/blob/main/tokenizer.json), it is defined as follows:
{
"id": 50279,
"content": "<|endoftext|>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
In the tokenizer.json of version 0125 (https://huggingface.co/allenai/OLMoE-1B-7B-0125-Instruct/raw/main/tokenizer.json）, it is defined as follows:
{
"id": 0,
"content": "<|endoftext|>",
"single_word": false,
"lstrip":
false,
"rstrip": false,
"normalized": false, "special": true
},
There are differences between the two tokenizers. When using the official https://github.com/allenai/OLMoE, please note that the version in the official readme is version 09.24.
Therefore, the yml file of version 0924 should be set to eos_token_id: 50279

lllabmaster added the type/question An issue that's a question label Nov 29, 2024

Harry-Miral mentioned this issue Mar 4, 2025

A question about eos_token_id allenai/OLMoE#25

Open

soldni assigned aman-17 Mar 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About eos_token_id in config file (20M, 1B) #757

About eos_token_id in config file (20M, 1B) #757

lllabmaster commented Nov 29, 2024

Harry-Miral commented Mar 3, 2025

lllabmaster commented Mar 4, 2025

Harry-Miral commented Mar 4, 2025 •

edited

Loading

lllabmaster commented Mar 4, 2025

Harry-Miral commented Mar 4, 2025 •

edited

Loading

About eos_token_id in config file (20M, 1B) #757

About eos_token_id in config file (20M, 1B) #757

Comments

lllabmaster commented Nov 29, 2024

❓ The question

Harry-Miral commented Mar 3, 2025

lllabmaster commented Mar 4, 2025

Harry-Miral commented Mar 4, 2025 • edited Loading

lllabmaster commented Mar 4, 2025

Harry-Miral commented Mar 4, 2025 • edited Loading

Harry-Miral commented Mar 4, 2025 •

edited

Loading

Harry-Miral commented Mar 4, 2025 •

edited

Loading