[Feature Request] UTF-8 text filter #19

ljdarj · 2024-09-30T13:07:31Z

I'm currently doing a Java prototype for tukaani-project/xz#50 and so far the results look pretty good. My choice was to convert it to SCSU because that way I'm both sure it would be reversible and wouldn't require to bring in something like ICU. Here are the results of the tests I did, using Українська кухня. Підручник from the C library's issue but getting rid first of the HTML prologue and epilogue the Internet Archive stuck in there:

UTF-8 file size: 2729604 bytes
SCSU file size: 1566123 bytes
KOI8-U file size: 1540610 bytes (non-reversible)
UTF-8 compressed file size: 425436 bytes
SCSU compressed file size: 399020 bytes
KOI8-U compressed file size: 394820 bytes (non-reversible)

I still need to polish the code before even considering a draft pull request but so far so good.

Larhzu · 2024-10-03T16:32:28Z

Excluding the headers and footers, the file has 125 different Unicode code points. Thus one byte per code point is possible. One needs something to tell which bytes match which code point ideally without requiring user to specify the charset. That information will take some space. Use also UTF-16BE for comparison. I got 403156 bytes but my input was two bytes bigger than yours. iconv -f utf8 -t utf16be < ukrainskakuhnya1998_djvu.txt | xz --lzma2=pb=1,lp=1 | wc -c SCSU sounds more complex than some ideas, which means it would need to be clearly better too to be worth it. It's nice to see it tested still. :-) SCSU itself is a compression method already. A filter doesn't necessarily need to make the file smaller; filter needs to make the file easier to compress. For example, UTF-16 makes the file 13 % bigger but then it compresses better than UTF-8 in case of this test file.

ljdarj · 2024-10-05T20:53:42Z

I'll test UTF-16 and UTF-32 to see if that doesn't improve things but the issue I see with trying not to force the user to specify the charset is that the only encodings where I see that possible without trying to decode everything is the Unicode set with the BOM. Which a lot of time wouldn't help because UTF-8 tends to be written without BOM.

And that's if decoding would even help: while the ISO-8859 series or KOI8-B has unassigned bytes which would allow us to declare that a text isn't encoded in them and the ISO/IEC 2022 series is stateful so invalid states would help for declaring it can't be them, technically any octet-stream is valid, say, KOI8-R or VSCII-1. Possibly full of enough control characters for us to think it's not even text at all, but that wouldn't make it invalid.

So personally I would be for either the user specifies the charset, there's a BOM, or it has to be UTF-8.

Larhzu · 2024-10-08T15:47:21Z

Sorry, I think you misunderstood me. Input is UTF-8, no need to guess that or ask from the user. If you knew the language, you could manually tell the filter to convert to slightly modified KOI8-U. One could reserve a few control bytes as escapes so that Unicode codepoints not in KOI8-U and also invalid UTF-8 could be encoded too. My point was that it's nicer for users if the encoder can determine the UTF-8-to-8-bit mapping automatically. That is, if the 8-bit mapping is a good method. It's not the only way to go like your SCSU result shows. It's about figuring out what is simple and small code and still provides good result.

ljdarj · 2024-10-11T18:53:26Z

For an automatic conversion then I think the target ought to be SCSU, BOCU-1 and their ilk: set for all intents and purposes the text to a given Unicode block and escape to UTF-16 for whatever's outside is basically what a good encoder for them does...admitting, of course, that converting to UTF-16 or UTF-32 before setting LZMA 2 on it doesn't give bigger wins when all is said and done.

ljdarj · 2024-10-19T12:13:51Z

I have a question before trying the UTF-16/UTF-32 conversions: for data which is 2 bytes (resp. 4 bytes) wide, we put lc to 0 and lp to 1 (resp. 2), right?

Larhzu · 2024-10-19T14:06:15Z

UTF-16BE: pb=1, lp=1, lc=3 UTF-32BE: pb=2, lp=2, lc=2 pb and lp are about alignment. lp + lc must not exceed 4, thus one has to use lc=2 with lp=2. Also: UTF-8: pb=0, lp=0, lc=3 (or sometimes lc=4) With UTF-16 and UTF-32, big endian should compress better than little endian.

ljdarj · 2025-02-27T23:09:17Z

Ok, so I made a few other tests with other encodings and source files in the mix just to make sure, but this time (except for the legacy encoding) recording only the final compressed sizes because after all, it's to use them as filters. Here are all the results, including those I made previously:
So first, the Ukranian text from the C library issue, with the Archive.org header and footer removed:

UTF-8 file size: 2 729 604 bytes
KOI8-U file size: 1 540 610 bytes (non-reversible)
UTF-8 compressed file size: 425 436 bytes
UTF-7 compressed file size: 456 672 bytes
UTF-16BE compressed file size: 403 040 bytes
UTF-16LE compressed file size: 405 216 bytes
UTF-32BE compressed file size: 426 908 bytes
UTF-32LE compressed file size: 429 824 bytes
BOCU-1 compressed file size: 416 192 bytes
SCSU compressed file size: 399 020 bytes
KOI8-U compressed file size: 394 820 bytes (non-reversible)

Second, a French text for something with relatively little differences in size between the legacy encoding and UTF-8:

UTF-8 file size: 35 585 bytes
ISO-8859-15 file size: 33 423 bytes (non-reversible)
UTF-8 compressed file size: 11 932 bytes
UTF-7 compressed file size: 12 132 bytes
UTF-16BE compressed file size: 11 864 bytes
UTF-16LE compressed file size: 11 884 bytes
UTF-32BE compressed file size: 12 268 bytes
UTF-32LE compressed file size: 12 292 bytes
BOCU-1 compressed file size: 12 440 bytes
SCSU compressed file size: 11 836 bytes
ISO-8859-15 compressed file size: 11 720 bytes (non-reversible)

Third, a Chinese novel with the Project Gutenberg header and footer removed to test for multi-byte legacy encodings:

UTF-8 file size: 2 048 248 bytes
Big5 file size: 1 378 613 bytes (non-reversible)
UTF-8 compressed file size: 694 784 bytes
UTF-7 compressed file size: 865 504 bytes
UTF-16BE compressed file size: 631 416 bytes
UTF-16LE compressed file size: 630 796 bytes
UTF-32BE compressed file size: 636 924 bytes
UTF-32LE compressed file size: 637 488 bytes
BOCU-1 compressed file size: 740 392 bytes
SCSU compressed file size: 694 460 bytes
Big5 compressed file size: 647 892 bytes (non-reversible)

Fourth, a copy as text of Thailand's constitution to check what happens with encodings with a bigger difference between the legacy encoding and UTF-8:

UTF-8 file size: 596 578 bytes
TIS-620 file size: 204 762 bytes (non-reversible)
UTF-8 compressed file size: 53 532 bytes
UTF-7 compressed file size: 66 760 bytes
UTF-16BE compressed file size: 47 264 bytes
UTF-16LE compressed file size: 47 728 bytes
UTF-32BE compressed file size: 50 400 bytes
UTF-32LE compressed file size: 50 948 bytes
BOCU-1 compressed file size: 47 704 bytes
SCSU compressed file size: 46 648 bytes
TIS-620 compressed file size: 45 812 bytes (non-reversible)

Lastly, a Greek-Bulgarian parallel corpus in CLARIN, to check a case where there's no legacy encoding applicable:

UTF-8 file size: 151 118 982 bytes
UTF-8 compressed file size: 17 720 280 bytes
UTF-7 compressed file size: 19 954 544 bytes
UTF-16BE compressed file size: 17 530 864 bytes
UTF-16LE compressed file size: 17 614 460 bytes
UTF-32BE compressed file size: 19 894 660 bytes
UTF-32LE compressed file size: 19 956 188 bytes
BOCU-1 compressed file size: 16 779 164 bytes
SCSU compressed file size: 16 547 028 bytes

So, before my opinion, a few notes: first, the encodings I tuned the pb/lp/lc values for are UTF-16, Big5 (both getting the 2-bytes tuning), and UTF-32 (getting the 4-bytes one) and for all others I left the default values which means that except for the Chinese novel, the comparison between BOCU-1, SCSU, UTF-8, and the legacy encoding is for LZMA 2 equally tuned, that is to say not particularly. Second, the SCSU implementation used is the reference Java implementation from UTS #6 and the tool used for converting the test files to BOCU-1 is the compiled reference tool from UTN #6.

Now, my opinion: first, as could be expected from tukaani-project/xz#50, UTF-8 is the worst performing…except for UTF-7 which I expected nothing from but still am getting disappointed by. Second, when applicable the legacy encoding is the best performing, except for the Chinese text which is an outlier in more than one way. Third, of the others SCSU is the best performer except for the Chinese text where not only does UTF-16LE unusually beat its big-endian self but it also beats SCSU and even the legacy encoding (incidentally, UTF-7 is particularly bad there). Fourth, UTF-32 is worse performing than UTF-16 and the only text on which it shines is the Chinese one. Lastly, BOCU-1 is generally between SCSU's and UTF16-BE's performances, except for the French text where it's especially bad: even UTF-7 beats it. But then, it was the only case where the reference tool told me the generated BOCU-1 text was bigger than its UTF-8 self.

Larhzu · 2025-03-08T11:04:18Z

Thanks! We discussed the results a bit on IRC a few days ago, but I realized I should reply here too so that other people see that this thread hasn't been ignored. I wrote an early prototype of my idea. There are a few things that should be done better, but even in the current form it should be useful for testing if the idea is promising or not. This is the same code I linked on IRC. https://tukaani.org/xz/utf8-filter15.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] UTF-8 text filter #19

[Feature Request] UTF-8 text filter #19

ljdarj commented Sep 30, 2024 •

edited

Loading

Larhzu commented Oct 3, 2024 via email

ljdarj commented Oct 5, 2024 •

edited

Loading

Larhzu commented Oct 8, 2024 via email

ljdarj commented Oct 11, 2024 •

edited

Loading

ljdarj commented Oct 19, 2024

Larhzu commented Oct 19, 2024 via email

ljdarj commented Feb 27, 2025 •

edited

Loading

Larhzu commented Mar 8, 2025 via email

[Feature Request] UTF-8 text filter #19

[Feature Request] UTF-8 text filter #19

Comments

ljdarj commented Sep 30, 2024 • edited Loading

Larhzu commented Oct 3, 2024 via email

ljdarj commented Oct 5, 2024 • edited Loading

Larhzu commented Oct 8, 2024 via email

ljdarj commented Oct 11, 2024 • edited Loading

ljdarj commented Oct 19, 2024

Larhzu commented Oct 19, 2024 via email

ljdarj commented Feb 27, 2025 • edited Loading

Larhzu commented Mar 8, 2025 via email

ljdarj commented Sep 30, 2024 •

edited

Loading

ljdarj commented Oct 5, 2024 •

edited

Loading

ljdarj commented Oct 11, 2024 •

edited

Loading

ljdarj commented Feb 27, 2025 •

edited

Loading