Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] UTF-8 text filter #19

Open
ljdarj opened this issue Sep 30, 2024 · 8 comments
Open

[Feature Request] UTF-8 text filter #19

ljdarj opened this issue Sep 30, 2024 · 8 comments

Comments

@ljdarj
Copy link

ljdarj commented Sep 30, 2024

I'm currently doing a Java prototype for tukaani-project/xz#50 and so far the results look pretty good. My choice was to convert it to SCSU because that way I'm both sure it would be reversible and wouldn't require to bring in something like ICU. Here are the results of the tests I did, using Українська кухня. Підручник from the C library's issue but getting rid first of the HTML prologue and epilogue the Internet Archive stuck in there:

  • UTF-8 file size: 2729604 bytes
  • SCSU file size: 1566123 bytes
  • KOI8-U file size: 1540610 bytes (non-reversible)
  • UTF-8 compressed file size: 425436 bytes
  • SCSU compressed file size: 399020 bytes
  • KOI8-U compressed file size: 394820 bytes (non-reversible)

I still need to polish the code before even considering a draft pull request but so far so good.

@Larhzu
Copy link
Member

Larhzu commented Oct 3, 2024 via email

@ljdarj
Copy link
Author

ljdarj commented Oct 5, 2024

I'll test UTF-16 and UTF-32 to see if that doesn't improve things but the issue I see with trying not to force the user to specify the charset is that the only encodings where I see that possible without trying to decode everything is the Unicode set with the BOM. Which a lot of time wouldn't help because UTF-8 tends to be written without BOM.

And that's if decoding would even help: while the ISO-8859 series or KOI8-B has unassigned bytes which would allow us to declare that a text isn't encoded in them and the ISO/IEC 2022 series is stateful so invalid states would help for declaring it can't be them, technically any octet-stream is valid, say, KOI8-R or VSCII-1. Possibly full of enough control characters for us to think it's not even text at all, but that wouldn't make it invalid.

So personally I would be for either the user specifies the charset, there's a BOM, or it has to be UTF-8.

@Larhzu
Copy link
Member

Larhzu commented Oct 8, 2024 via email

@ljdarj
Copy link
Author

ljdarj commented Oct 11, 2024

For an automatic conversion then I think the target ought to be SCSU, BOCU-1 and their ilk: set for all intents and purposes the text to a given Unicode block and escape to UTF-16 for whatever's outside is basically what a good encoder for them does...admitting, of course, that converting to UTF-16 or UTF-32 before setting LZMA 2 on it doesn't give bigger wins when all is said and done.

@ljdarj
Copy link
Author

ljdarj commented Oct 19, 2024

I have a question before trying the UTF-16/UTF-32 conversions: for data which is 2 bytes (resp. 4 bytes) wide, we put lc to 0 and lp to 1 (resp. 2), right?

@Larhzu
Copy link
Member

Larhzu commented Oct 19, 2024 via email

@ljdarj
Copy link
Author

ljdarj commented Feb 27, 2025

Ok, so I made a few other tests with other encodings and source files in the mix just to make sure, but this time (except for the legacy encoding) recording only the final compressed sizes because after all, it's to use them as filters. Here are all the results, including those I made previously:
So first, the Ukranian text from the C library issue, with the Archive.org header and footer removed:

  • UTF-8 file size: 2 729 604 bytes
  • KOI8-U file size: 1 540 610 bytes (non-reversible)
  • UTF-8 compressed file size: 425 436 bytes
  • UTF-7 compressed file size: 456 672 bytes
  • UTF-16BE compressed file size: 403 040 bytes
  • UTF-16LE compressed file size: 405 216 bytes
  • UTF-32BE compressed file size: 426 908 bytes
  • UTF-32LE compressed file size: 429 824 bytes
  • BOCU-1 compressed file size: 416 192 bytes
  • SCSU compressed file size: 399 020 bytes
  • KOI8-U compressed file size: 394 820 bytes (non-reversible)

Second, a French text for something with relatively little differences in size between the legacy encoding and UTF-8:

  • UTF-8 file size: 35 585 bytes
  • ISO-8859-15 file size: 33 423 bytes (non-reversible)
  • UTF-8 compressed file size: 11 932 bytes
  • UTF-7 compressed file size: 12 132 bytes
  • UTF-16BE compressed file size: 11 864 bytes
  • UTF-16LE compressed file size: 11 884 bytes
  • UTF-32BE compressed file size: 12 268 bytes
  • UTF-32LE compressed file size: 12 292 bytes
  • BOCU-1 compressed file size: 12 440 bytes
  • SCSU compressed file size: 11 836 bytes
  • ISO-8859-15 compressed file size: 11 720 bytes (non-reversible)

Third, a Chinese novel with the Project Gutenberg header and footer removed to test for multi-byte legacy encodings:

  • UTF-8 file size: 2 048 248 bytes
  • Big5 file size: 1 378 613 bytes (non-reversible)
  • UTF-8 compressed file size: 694 784 bytes
  • UTF-7 compressed file size: 865 504 bytes
  • UTF-16BE compressed file size: 631 416 bytes
  • UTF-16LE compressed file size: 630 796 bytes
  • UTF-32BE compressed file size: 636 924 bytes
  • UTF-32LE compressed file size: 637 488 bytes
  • BOCU-1 compressed file size: 740 392 bytes
  • SCSU compressed file size: 694 460 bytes
  • Big5 compressed file size: 647 892 bytes (non-reversible)

Fourth, a copy as text of Thailand's constitution to check what happens with encodings with a bigger difference between the legacy encoding and UTF-8:

  • UTF-8 file size: 596 578 bytes
  • TIS-620 file size: 204 762 bytes (non-reversible)
  • UTF-8 compressed file size: 53 532 bytes
  • UTF-7 compressed file size: 66 760 bytes
  • UTF-16BE compressed file size: 47 264 bytes
  • UTF-16LE compressed file size: 47 728 bytes
  • UTF-32BE compressed file size: 50 400 bytes
  • UTF-32LE compressed file size: 50 948 bytes
  • BOCU-1 compressed file size: 47 704 bytes
  • SCSU compressed file size: 46 648 bytes
  • TIS-620 compressed file size: 45 812 bytes (non-reversible)

Lastly, a Greek-Bulgarian parallel corpus in CLARIN, to check a case where there's no legacy encoding applicable:

  • UTF-8 file size: 151 118 982 bytes
  • UTF-8 compressed file size: 17 720 280 bytes
  • UTF-7 compressed file size: 19 954 544 bytes
  • UTF-16BE compressed file size: 17 530 864 bytes
  • UTF-16LE compressed file size: 17 614 460 bytes
  • UTF-32BE compressed file size: 19 894 660 bytes
  • UTF-32LE compressed file size: 19 956 188 bytes
  • BOCU-1 compressed file size: 16 779 164 bytes
  • SCSU compressed file size: 16 547 028 bytes

So, before my opinion, a few notes: first, the encodings I tuned the pb/lp/lc values for are UTF-16, Big5 (both getting the 2-bytes tuning), and UTF-32 (getting the 4-bytes one) and for all others I left the default values which means that except for the Chinese novel, the comparison between BOCU-1, SCSU, UTF-8, and the legacy encoding is for LZMA 2 equally tuned, that is to say not particularly. Second, the SCSU implementation used is the reference Java implementation from UTS #6 and the tool used for converting the test files to BOCU-1 is the compiled reference tool from UTN #6.

Now, my opinion: first, as could be expected from tukaani-project/xz#50, UTF-8 is the worst performing…except for UTF-7 which I expected nothing from but still am getting disappointed by. Second, when applicable the legacy encoding is the best performing, except for the Chinese text which is an outlier in more than one way. Third, of the others SCSU is the best performer except for the Chinese text where not only does UTF-16LE unusually beat its big-endian self but it also beats SCSU and even the legacy encoding (incidentally, UTF-7 is particularly bad there). Fourth, UTF-32 is worse performing than UTF-16 and the only text on which it shines is the Chinese one. Lastly, BOCU-1 is generally between SCSU's and UTF16-BE's performances, except for the French text where it's especially bad: even UTF-7 beats it. But then, it was the only case where the reference tool told me the generated BOCU-1 text was bigger than its UTF-8 self.

@Larhzu
Copy link
Member

Larhzu commented Mar 8, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants