-
Notifications
You must be signed in to change notification settings - Fork 568
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About eos_token_id in config file (20M, 1B) #757
Comments
Have you solved your problem? I have the same confusion. Thank you. |
I changed the config file(yaml) with the right token number setting eos_token_id= 50279. It worked well. |
Thank you, I just found that the tokenizer.json file is defined like this:(https://huggingface.co/allenai/OLMoE-1B-7B-0125-Instruct/raw/main/tokenizer.json) |
I suggest you change the model config file(yaml) with your tokenizer file setting (token ids etc.). I train my own tokenizer in 100B tokens and changed the model config file with my tokenizer file. (Maybe the model config file in the repo has some mistakes sometimes, so we should edit it with right setting) |
Thank you, your solution is very good. I think I found the real reason for the problem: in the tokenizer.json of version 09.24 (https://huggingface.co/allenai/OLMoE-1B-7B-0924- |
❓ The question
In the 20M configuration file (OLMo-20M.yaml), the settings specify:
eos_token_id: 0
,pad_token_id: 1
, andtokenizer: tokenizers/allenai_gpt-neox-olmo-dolma-v1_5.json
.However, in the
tokenizers/allenai_gpt-neox-olmo-dolma-v1_5.json
file, I noticed thatid 0
corresponds to"|||IP_ADDRESS|||"
, while<|endoftext|>
is assignedid 50279
. This seems to contradict the configuration, especially when compared with thetokenizers/allenai_eleuther-ai-gpt-neox-20b-pii-special.json
file.Additionally, I inspected the preprocessed data file (part-1-00000.npy) and ran the following analysis:
This assumes an average of 4 tokens per English word. The results were:
[27639.0, 16304.25, 23344.5, 26183.25, 35961.75, 6302.0, 42492.0, 4867.0, 7313.5, ...]
(length = 7386).I have two questions:
eos_token_id
in the configuration file when using theallenai_eleuther-ai-gpt-neox-20b-pii-special.json
tokenizer?gpt-neox-olmo-dolma-v1_5/part-X-00000.npy
) encoded witheos_token_id=50279
?Thank you for your assistance!
The text was updated successfully, but these errors were encountered: