-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ubuntu IRC broken encoding, impacting generative models downstream #102
Comments
I don't believe you can specify character encoding in HTTP requests. I'll try to contact the author of the bot that scrapes for irclogs.ubuntu.com to get some insight, or report a bug (no way the data has been encoded wrong for over a decade, right?...) |
Found the solution. The .txt files are mixed encoding, line-by-line. This dataset must be properly decoded before use. This can be done fairly simply: |
@briansemrau do you know if |
I would not expect it to. This dataset has strange encoding to work around a specific technical problem with IRC compatibility. |
i see. thank you very much! |
The Ubuntu IRC dataset appears to contain broken character encoding, which noticeably impacts generated output from models trained on The Pile in certain situations.
For example, from https://irclogs.ubuntu.com/2020/08/23/%23ubuntu.txt
This file contains
¯\_(ツ)_/¯
which should instead show as¯\_(ツ)_/¯
, if it were properly encoded.I can't currently inspect the data directly in The Pile, because the-eye.eu and eaidata.bmk.sh are both inaccessible right now.
However, I have seen lots of garbled output from GPT-J that looks remarkably similar to this broken encoding, e.g.
¯_(�)_/¯
It looks like this dataset could be cleaned by using the
ftfy
python library. https://ftfy.readthedocs.io/en/latest/In my very brief testing, this appears to fix the broken encoding from the file linked above.
The text was updated successfully, but these errors were encountered: