Skip to content

[Feature Request]: BOCU-1 for unicode multibyte chars text compression #2551

@HDDen

Description

@HDDen

Hello! How about adding BOCU-1 compression for Unicode multilingual messages?
Here is the description: https://www.unicode.org/notes/tn6/

Here are a few examples:

Это было моё первое принятое сообщение... 👩👨🧑👧🧒👨‍🦱👸🤶👮‍♀️🕵️‍♀️
input: 143 UTF-8 bytes
output: 90 BOCU-1 bytes
BOCU-1/UTF-8: 0.629371
Заменил антенну на 5-ягу. Если кто через нее ходит, дайте обратную связь
input: 129 UTF-8 bytes
output: 79 BOCU-1 bytes
BOCU-1/UTF-8: 0.612403
Доброго утра всем! 17,5 ° C и солнце 📡 )))
input: 68 UTF-8 bytes
output: 51 BOCU-1 bytes
BOCU-1/UTF-8: 0.750000
Погодка КАЙФ
input: 23 UTF-8 bytes
output: 13 BOCU-1 bytes
BOCU-1/UTF-8: 0.565217
Первый рабочий день после длинных выходных
input: 79 UTF-8 bytes
output: 43 BOCU-1 bytes
BOCU-1/UTF-8: 0.544304

You can download and test the Win32 console implementation here: https://www.unicode.org/notes/tn6/bocu1.exe

Alternatively, there is the «UCF» encoding, which also resolves the issue of bloated file sizes caused by characters outside the a-z range: https://github.com/hyoo-ru/mam_mol/tree/master/charset/ucf

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions