Ubuntu IRC broken encoding, impacting generative models downstream #102

briansemrau · 2023-01-19T05:43:59Z

The Ubuntu IRC dataset appears to contain broken character encoding, which noticeably impacts generated output from models trained on The Pile in certain situations.

For example, from https://irclogs.ubuntu.com/2020/08/23/%23ubuntu.txt
This file contains Â¯\_(ãƒ„)_/Â¯ which should instead show as ¯\_(ツ)_/¯, if it were properly encoded.

I can't currently inspect the data directly in The Pile, because the-eye.eu and eaidata.bmk.sh are both inaccessible right now.
However, I have seen lots of garbled output from GPT-J that looks remarkably similar to this broken encoding, e.g. Â¯_(ã��)_/Â¯

It looks like this dataset could be cleaned by using the ftfy python library. https://ftfy.readthedocs.io/en/latest/
In my very brief testing, this appears to fix the broken encoding from the file linked above.

The text was updated successfully, but these errors were encountered:

Mistobaan · 2023-01-19T06:03:14Z

~~Could we download them again without errors, or are they gone?~~
So my guess is that is a utf8-to-ascii error. Maybe the server is messing with the encoding?

try to request utf8 when doing the GET request.

briansemrau · 2023-01-19T07:23:48Z

I don't believe you can specify character encoding in HTTP requests. I'll try to contact the author of the bot that scrapes for irclogs.ubuntu.com to get some insight, or report a bug (no way the data has been encoded wrong for over a decade, right?...)

briansemrau · 2023-01-19T18:26:23Z

Found the solution. The .txt files are mixed encoding, line-by-line.

This dataset must be properly decoded before use. This can be done fairly simply:

https://github.com/mgedmin/irclog2html/blob/ab7759e4b54f146f9c585d2c71d321fbda5c1e1c/src/irclog2html/irclog2html.py#L199-L208

https://github.com/mgedmin/irclog2html/blob/ab7759e4b54f146f9c585d2c71d321fbda5c1e1c/src/irclog2html/irclog2html.py#L141-L154

keunwoochoi · 2023-04-10T19:21:46Z

@briansemrau do you know if huggingface would decode this properly? i'm not sure where i should look into from https://github.com/huggingface/datasets/tree/main/src/datasets/utils

briansemrau · 2023-04-10T19:33:30Z

do you know if huggingface would decode this properly?

I would not expect it to. This dataset has strange encoding to work around a specific technical problem with IRC compatibility.
You should use the code from the links I posted above to make sure the data is being properly decoded.

keunwoochoi · 2023-04-10T19:47:14Z

i see. thank you very much!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ubuntu IRC broken encoding, impacting generative models downstream #102

Ubuntu IRC broken encoding, impacting generative models downstream #102

briansemrau commented Jan 19, 2023

Mistobaan commented Jan 19, 2023 •

edited

Loading

briansemrau commented Jan 19, 2023 •

edited

Loading

briansemrau commented Jan 19, 2023

keunwoochoi commented Apr 10, 2023

briansemrau commented Apr 10, 2023 •

edited

Loading

keunwoochoi commented Apr 10, 2023

Ubuntu IRC broken encoding, impacting generative models downstream #102

Ubuntu IRC broken encoding, impacting generative models downstream #102

Comments

briansemrau commented Jan 19, 2023

Mistobaan commented Jan 19, 2023 • edited Loading

briansemrau commented Jan 19, 2023 • edited Loading

briansemrau commented Jan 19, 2023

keunwoochoi commented Apr 10, 2023

briansemrau commented Apr 10, 2023 • edited Loading

keunwoochoi commented Apr 10, 2023

Mistobaan commented Jan 19, 2023 •

edited

Loading

briansemrau commented Jan 19, 2023 •

edited

Loading

briansemrau commented Apr 10, 2023 •

edited

Loading