Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use TextEncoder and TextDecoder for utf8 strings #4513

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

seia-soto
Copy link
Member

fixes #4424

This PR replaces punycode encoder and decoder with TextEncoder and TextDecoder for utf8 strings.

  • BOM character \ufeff should be skipped when decoding to ensure the original form
  • Uint8Array.subarray doesn't copy the array but provides a direct interface to subarray
  • TextEncoder.encodeInto doesn't produce EOL character ( NULL, U+0000 ) but we don't care because TextDecoder can stop nicely when provided buffer ends

To be safe:

  • The performance should be evaluated carefully
  • The output binary size needs to be evaluated carefully

@seia-soto seia-soto added the PR: Internal 🏠 Changes only affect internals label Dec 10, 2024
@seia-soto seia-soto self-assigned this Dec 10, 2024
@seia-soto seia-soto requested a review from remusao as a code owner December 10, 2024 08:29
@seia-soto
Copy link
Member Author

seia-soto commented Dec 11, 2024

> (6635933 / 6634657) * 100 // raw > ads + trackers + annoyances (104550 network + 69076 hide)
100.01923234313395

master, af1a9c9

aa@MacBookPro adblocker % yarn tsx ./tools/engine-size.ts 
> ads (49534 network + 38171 hide)
 + raw 3358745 bytes
 + gzip 1783830 bytes
 + brotli 1467652 bytes
> ads (0 network + 38171 hide)
 + raw 1825421 bytes
 + gzip 880805 bytes
 + brotli 708974 bytes
> ads (49534 network + 0 hide)
 + raw 1533629 bytes
 + gzip 905016 bytes
 + brotli 765458 bytes
> ads + trackers (102405 network + 38324 hide)
 + raw 4958193 bytes
 + gzip 2635039 bytes
 + brotli 2172546 bytes
> ads + trackers (0 network + 38324 hide)
 + raw 1847893 bytes
 + gzip 890432 bytes
 + brotli 715577 bytes
> ads + trackers (102405 network + 0 hide)
 + raw 3110601 bytes
 + gzip 1745248 bytes
 + brotli 1459959 bytes
> ads + trackers + annoyances (104585 network + 69043 hide)
 + raw 6635933 bytes
 + gzip 3468626 bytes
 + brotli 2862312 bytes
> ads + trackers + annoyances (0 network + 69043 hide)
 + raw 3455897 bytes
 + gzip 1674036 bytes
 + brotli 1363518 bytes
> ads + trackers + annoyances (104585 network + 0 hide)
 + raw 3180345 bytes
 + gzip 1793551 bytes
 + brotli 1501761 bytes

seia-soto:textencoder, 1de005e

aa@MacBookPro adblocker % yarn tsx ./tools/engine-size.ts 
> ads (49501 network + 38172 hide)
 + raw 3357509 bytes
 + gzip 1782885 bytes
 + brotli 1468373 bytes
> ads (0 network + 38172 hide)
 + raw 1824997 bytes
 + gzip 880743 bytes
 + brotli 706714 bytes
> ads (49501 network + 0 hide)
 + raw 1532817 bytes
 + gzip 904485 bytes
 + brotli 764578 bytes
> ads + trackers (102370 network + 38325 hide)
 + raw 4957369 bytes
 + gzip 2634088 bytes
 + brotli 2171825 bytes
> ads + trackers (0 network + 38325 hide)
 + raw 1847925 bytes
 + gzip 890324 bytes
 + brotli 719134 bytes
> ads + trackers (102370 network + 0 hide)
 + raw 3109745 bytes
 + gzip 1744659 bytes
 + brotli 1459817 bytes
> ads + trackers + annoyances (104550 network + 69076 hide)
 + raw 6634657 bytes
 + gzip 3468231 bytes
 + brotli 2859315 bytes
> ads + trackers + annoyances (0 network + 69076 hide)
 + raw 3455473 bytes
 + gzip 1674098 bytes
 + brotli 1362984 bytes
> ads + trackers + annoyances (104550 network + 0 hide)
 + raw 3179489 bytes
 + gzip 1793145 bytes
 + brotli 1501288 bytes

@seia-soto seia-soto requested a review from chrmod December 11, 2024 07:45
@seia-soto
Copy link
Member Author

seia-soto commented Dec 12, 2024

> 147.50420889870574 / 156.1296767089117 // benchEngineDeserialization
0.944754463135876
> 147.50420889870574 / 148.7394726802865 // benchEngineSerialization
0.9916951179177841

seia-soto:textencoder, 65764e9

  • benchEngineDeserialization: 147.50420889870574 op/s
  • benchEngineSerialization: 147.50420889870574 op/s

master, de7bfb5

  • benchEngineDeserialization: 156.1296767089117 op/s
  • benchEngineSerialization: 148.7394726802865 op/s

@chrmod chrmod added PR: Bug Fix 🐛 Increment patch version when merged and removed PR: Internal 🏠 Changes only affect internals labels Dec 23, 2024
@seia-soto
Copy link
Member Author

This PR is awaiting for the final review. Further changes are expected to be categorized as a performance improvement.

Copy link
Member

@chrmod chrmod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about other uses of encode in this file?

@@ -20,6 +20,9 @@ export const EMPTY_UINT32_ARRAY = new Uint32Array(0);
// Check if current architecture is little endian
const LITTLE_ENDIAN: boolean = new Int8Array(new Int16Array([1]).buffer)[0] === 1;

// TextEncoder doesn't need to be recreated every time unlike TextDecoder
const TEXT_ENCODER = new TextEncoder();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternatively you could do:

const encoder = new TextEncoder();
const encode = encoder.encode.bind(encoder);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encode is already exported by punycode. Would you like to have another variable name?

Copy link
Member

@chrmod chrmod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add a test for the exact filter that was a problem in #4424

Also, would be great to have a fuzz test setup to test serialize/deserialize on random data

@seia-soto seia-soto force-pushed the textencoder branch 6 times, most recently from f7d75c0 to 3bfc064 Compare March 6, 2025 15:01
chore: add description to smaz structure

fix: out of range error

fix: ASCII and Smaz is not compatible

chore: remove unused padding
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR: Bug Fix 🐛 Increment patch version when merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Overflow error when loading uBO quick fix list
4 participants