-
-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] UTF-8 text filter #19
Comments
Excluding the headers and footers, the file has 125 different Unicode code points. Thus one byte per code point is possible. One needs something to tell which bytes match which code point ideally without requiring user to specify the charset. That information will take some space.
Use also UTF-16BE for comparison. I got 403156 bytes but my input was two bytes bigger than yours.
iconv -f utf8 -t utf16be < ukrainskakuhnya1998_djvu.txt | xz --lzma2=pb=1,lp=1 | wc -c
SCSU sounds more complex than some ideas, which means it would need to be clearly better too to be worth it. It's nice to see it tested still. :-)
SCSU itself is a compression method already. A filter doesn't necessarily need to make the file smaller; filter needs to make the file easier to compress. For example, UTF-16 makes the file 13 % bigger but then it compresses better than UTF-8 in case of this test file.
|
I'll test UTF-16 and UTF-32 to see if that doesn't improve things but the issue I see with trying not to force the user to specify the charset is that the only encodings where I see that possible without trying to decode everything is the Unicode set with the BOM. Which a lot of time wouldn't help because UTF-8 tends to be written without BOM. And that's if decoding would even help: while the ISO-8859 series or KOI8-B has unassigned bytes which would allow us to declare that a text isn't encoded in them and the ISO/IEC 2022 series is stateful so invalid states would help for declaring it can't be them, technically any octet-stream is valid, say, KOI8-R or VSCII-1. Possibly full of enough control characters for us to think it's not even text at all, but that wouldn't make it invalid. So personally I would be for either the user specifies the charset, there's a BOM, or it has to be UTF-8. |
Sorry, I think you misunderstood me. Input is UTF-8, no need to guess that or ask from the user.
If you knew the language, you could manually tell the filter to convert to slightly modified KOI8-U. One could reserve a few control bytes as escapes so that Unicode codepoints not in KOI8-U and also invalid UTF-8 could be encoded too.
My point was that it's nicer for users if the encoder can determine the UTF-8-to-8-bit mapping automatically. That is, if the 8-bit mapping is a good method. It's not the only way to go like your SCSU result shows. It's about figuring out what is simple and small code and still provides good result.
|
For an automatic conversion then I think the target ought to be SCSU, BOCU-1 and their ilk: set for all intents and purposes the text to a given Unicode block and escape to UTF-16 for whatever's outside is basically what a good encoder for them does...admitting, of course, that converting to UTF-16 or UTF-32 before setting LZMA 2 on it doesn't give bigger wins when all is said and done. |
I have a question before trying the UTF-16/UTF-32 conversions: for data which is 2 bytes (resp. 4 bytes) wide, we put lc to 0 and lp to 1 (resp. 2), right? |
UTF-16BE: pb=1, lp=1, lc=3
UTF-32BE: pb=2, lp=2, lc=2
pb and lp are about alignment. lp + lc must not exceed 4, thus one has to use lc=2 with lp=2. Also:
UTF-8: pb=0, lp=0, lc=3 (or sometimes lc=4)
With UTF-16 and UTF-32, big endian should compress better than little endian.
|
Ok, so I made a few other tests with other encodings and source files in the mix just to make sure, but this time (except for the legacy encoding) recording only the final compressed sizes because after all, it's to use them as filters. Here are all the results, including those I made previously:
Second, a French text for something with relatively little differences in size between the legacy encoding and UTF-8:
Third, a Chinese novel with the Project Gutenberg header and footer removed to test for multi-byte legacy encodings:
Fourth, a copy as text of Thailand's constitution to check what happens with encodings with a bigger difference between the legacy encoding and UTF-8:
Lastly, a Greek-Bulgarian parallel corpus in CLARIN, to check a case where there's no legacy encoding applicable:
So, before my opinion, a few notes: first, the encodings I tuned the pb/lp/lc values for are UTF-16, Big5 (both getting the 2-bytes tuning), and UTF-32 (getting the 4-bytes one) and for all others I left the default values which means that except for the Chinese novel, the comparison between BOCU-1, SCSU, UTF-8, and the legacy encoding is for LZMA 2 equally tuned, that is to say not particularly. Second, the SCSU implementation used is the reference Java implementation from UTS #6 and the tool used for converting the test files to BOCU-1 is the compiled reference tool from UTN #6. Now, my opinion: first, as could be expected from tukaani-project/xz#50, UTF-8 is the worst performing…except for UTF-7 which I expected nothing from but still am getting disappointed by. Second, when applicable the legacy encoding is the best performing, except for the Chinese text which is an outlier in more than one way. Third, of the others SCSU is the best performer except for the Chinese text where not only does UTF-16LE unusually beat its big-endian self but it also beats SCSU and even the legacy encoding (incidentally, UTF-7 is particularly bad there). Fourth, UTF-32 is worse performing than UTF-16 and the only text on which it shines is the Chinese one. Lastly, BOCU-1 is generally between SCSU's and UTF16-BE's performances, except for the French text where it's especially bad: even UTF-7 beats it. But then, it was the only case where the reference tool told me the generated BOCU-1 text was bigger than its UTF-8 self. |
Thanks! We discussed the results a bit on IRC a few days ago, but I realized I should reply here too so that other people see that this thread hasn't been ignored.
I wrote an early prototype of my idea. There are a few things that should be done better, but even in the current form it should be useful for testing if the idea is promising or not. This is the same code I linked on IRC.
https://tukaani.org/xz/utf8-filter15.c
|
I'm currently doing a Java prototype for tukaani-project/xz#50 and so far the results look pretty good. My choice was to convert it to SCSU because that way I'm both sure it would be reversible and wouldn't require to bring in something like ICU. Here are the results of the tests I did, using Українська кухня. Підручник from the C library's issue but getting rid first of the HTML prologue and epilogue the Internet Archive stuck in there:
I still need to polish the code before even considering a draft pull request but so far so good.
The text was updated successfully, but these errors were encountered: