Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adopt GB18030-2022 #54996

Open
annevk opened this issue Sep 18, 2024 · 4 comments
Open

Adopt GB18030-2022 #54996

annevk opened this issue Sep 18, 2024 · 4 comments
Labels
icu Issues and PRs related to the ICU dependency. web-standards Issues and PRs related to Web APIs

Comments

@annevk
Copy link

annevk commented Sep 18, 2024

If you implement gb18030/GBK, see whatwg/encoding#336.

@RedYetiDev RedYetiDev added the web-standards Issues and PRs related to Web APIs label Sep 18, 2024
@RedYetiDev
Copy link
Member

CC @nodejs/web-standards

@RedYetiDev
Copy link
Member

RedYetiDev commented Sep 18, 2024

GB18030 has been implemented in ICU 73.2, we are using 74.2 (as-of 22.9.0), is this issue already resolved?

@RedYetiDev RedYetiDev added the icu Issues and PRs related to the ICU dependency. label Sep 18, 2024
@RedYetiDev
Copy link
Member

@RedYetiDev
Copy link
Member

RedYetiDev commented Sep 18, 2024

After testing this, Node.js passes all the gb18030 tests, however fails the following gbk tests:

✖ gbk: initial byte out of accepted ranges 
✖ gbk: two bytes 0x81 0xFF 
✖ gbk: two bytes 0xFE 0xFF 
✖ gbk: two bytes 0x81 0x30 
✖ gbk: three bytes 0x81 0x30 0xFE 
✖ gbk: three bytes 0x81 0x30 0xFF 
✖ gbk: four bytes 0xFE 0x39 0xFE 0x39 
✖ gbk: pointer 7458 
✖ gbk: pointer 7457 
✖ gbk: pointer 7459 
✖ gbk: pointer 39419 
✖ gbk: pointer 39420 
✖ gbk: pointer 189999 
✖ gbk: pointer 189000 
✖ gbk: pointer 1237575 
✖ gbk: pointer 1237576 
✖ gbk: legacy ICU special case 1 
✖ gbk: GB18030-2022 1 
✖ gbk: GB18030-2022 2 
✖ gbk: GB18030-2022 3 
✖ gbk: GB18030-2022 4 
✖ gbk: GB18030-2022 5 
✖ gbk: GB18030-2022 6 
✖ gbk: GB18030-2022 7 
✖ gbk: GB18030-2022 8 
✖ gbk: GB18030-2022 9 
✖ gbk: GB18030-2022 10 
✖ gbk: GB18030-2022 11 
✖ gbk: GB18030-2022 12 
✖ gbk: GB18030-2022 13 
✖ gbk: GB18030-2022 14 
✖ gbk: GB18030-2022 15 
✖ gbk: GB18030-2022 16 
✖ gbk: GB18030-2022 17 
✖ gbk: GB18030-2022 18 
✖ gbk: GB18030-2022 19 
✖ gbk: GB18030-2022 20 
✖ gbk: GB18030-2022 21 
✖ gbk: GB18030-2022 22 
✖ gbk: GB18030-2022 23 
✖ gbk: GB18030-2022 24 
✖ gbk: GB18030-2022 25 
✖ gbk: GB18030-2022 26 
✖ gbk: GB18030-2022 27 
✖ gbk: GB18030-2022 28 
✖ gbk: GB18030-2022 29 
✖ gbk: GB18030-2022 30 
✖ gbk: GB18030-2022 31 
✖ gbk: GB18030-2022 32 
✖ gbk: GB18030-2022 33 
✖ gbk: GB18030-2022 34 
✖ gbk: GB18030-2022 35 
✖ gbk: GB18030-2022 36 
✖ gbk: range 0 
✖ gbk: range 1 
✖ gbk: range 2 
✖ gbk: range 3 
✖ gbk: range 4 
✖ gbk: range 5 
✖ gbk: range 6 
✖ gbk: range 7 
✖ gbk: range 8 
✖ gbk: range 9 
✖ gbk: range 10 
✖ gbk: range 11 
✖ gbk: range 12 
✖ gbk: range 13 
✖ gbk: range 14 
✖ gbk: range 15 
✖ gbk: range 16 
✖ gbk: range 17 
✖ gbk: range 18 
✖ gbk: range 19 
✖ gbk: range 20 
✖ gbk: range 21 
✖ gbk: range 22 
✖ gbk: range 23 
✖ gbk: range 24 
✖ gbk: range 25 
✖ gbk: range 26 
✖ gbk: range 27 
✖ gbk: range 28 
✖ gbk: range 29 
✖ gbk: range 30 
✖ gbk: range 31 
✖ gbk: range 32 
✖ gbk: range 33 
✖ gbk: range 34 
✖ gbk: range 35 
✖ gbk: range 36 
✖ gbk: range 37 
✖ gbk: range 38 
✖ gbk: range 39 
✖ gbk: range 40 
✖ gbk: range 41 
✖ gbk: range 42 
✖ gbk: range 43 
✖ gbk: range 44 
✖ gbk: range 45 
✖ gbk: range 46 
✖ gbk: range 47 
✖ gbk: range 48 
✖ gbk: range 49 
✖ gbk: range 50 
✖ gbk: range 51 
✖ gbk: range 52 
✖ gbk: range 53 
✖ gbk: range 54 
✖ gbk: range 55 
✖ gbk: range 56 
✖ gbk: range 57 
✖ gbk: range 58 
✖ gbk: range 59 
✖ gbk: range 60 
✖ gbk: range 61 
✖ gbk: range 62 
✖ gbk: range 63 
✖ gbk: range 64 
✖ gbk: range 65 
✖ gbk: range 66 
✖ gbk: range 67 
✖ gbk: range 68 
✖ gbk: range 69 
✖ gbk: range 70 
✖ gbk: range 71 
✖ gbk: range 72 
✖ gbk: range 73 
✖ gbk: range 74 
✖ gbk: range 75 
✖ gbk: range 76 
✖ gbk: range 77 
✖ gbk: range 78 
✖ gbk: range 79 
✖ gbk: range 80 
✖ gbk: range 81 
✖ gbk: range 82 
✖ gbk: range 83 
✖ gbk: range 84 
✖ gbk: range 85 
✖ gbk: range 86 
✖ gbk: range 87 
✖ gbk: range 88 
✖ gbk: range 89 
✖ gbk: range 90 
✖ gbk: range 91 
✖ gbk: range 92 
✖ gbk: range 93 
✖ gbk: range 94 
✖ gbk: range 95 
✖ gbk: range 96 
✖ gbk: range 97 
✖ gbk: range 98 
✖ gbk: range 99 
✖ gbk: range 100 
✖ gbk: range 101 
✖ gbk: range 102 
✖ gbk: range 103 
✖ gbk: range 104 
✖ gbk: range 105 
✖ gbk: range 106 
✖ gbk: range 107 
✖ gbk: range 108 
✖ gbk: range 109 
✖ gbk: range 110 
✖ gbk: range 111 
✖ gbk: range 112 
✖ gbk: range 113 
✖ gbk: range 114 
✖ gbk: range 115 
✖ gbk: range 116 
✖ gbk: range 117 
✖ gbk: range 118 
✖ gbk: range 119 
✖ gbk: range 120 
✖ gbk: range 121 
✖ gbk: range 122 
✖ gbk: range 123 
✖ gbk: range 124 
✖ gbk: range 125 
✖ gbk: range 126 
✖ gbk: range 127 
✖ gbk: range 128 
✖ gbk: range 129 
✖ gbk: range 130 
✖ gbk: range 131 
✖ gbk: range 132 
✖ gbk: range 133 
✖ gbk: range 134 
✖ gbk: range 135 
✖ gbk: range 136 
✖ gbk: range 137 
✖ gbk: range 138 
✖ gbk: range 139 
✖ gbk: range 140 
✖ gbk: range 141 
✖ gbk: range 142 
✖ gbk: range 143 
✖ gbk: range 144 
✖ gbk: range 145 
✖ gbk: range 146 
✖ gbk: range 147 
✖ gbk: range 148 
✖ gbk: range 149 
✖ gbk: range 150 
✖ gbk: range 151 
✖ gbk: range 152 
✖ gbk: range 153 
✖ gbk: range 154 
✖ gbk: range 155 
✖ gbk: range 156 
✖ gbk: range 157 
✖ gbk: range 158 
✖ gbk: range 159 
✖ gbk: range 160 
✖ gbk: range 161 
✖ gbk: range 162 
✖ gbk: range 163 
✖ gbk: range 164 
✖ gbk: range 165 
✖ gbk: range 166 
✖ gbk: range 167 
✖ gbk: range 168 
✖ gbk: range 169 
✖ gbk: range 170 
✖ gbk: range 171 
✖ gbk: range 172 
✖ gbk: range 173 
✖ gbk: range 174 
✖ gbk: range 175 
✖ gbk: range 176 
✖ gbk: range 177 
✖ gbk: range 178 
✖ gbk: range 179 
✖ gbk: range 180 
✖ gbk: range 181 
✖ gbk: range 182 
✖ gbk: range 183 
✖ gbk: range 184 
✖ gbk: range 185 
✖ gbk: range 186 
✖ gbk: range 187 
✖ gbk: range 188 
✖ gbk: range 189 
✖ gbk: range 190 
✖ gbk: range 191 
✖ gbk: range 192 
✖ gbk: range 193 
✖ gbk: range 194 
✖ gbk: range 195 
✖ gbk: range 196 
✖ gbk: range 197 
✖ gbk: range 198 
✖ gbk: range 199 
✖ gbk: range 200 
✖ gbk: range 201 
✖ gbk: range 202 
✖ gbk: range 203 
✖ gbk: range 204 
✖ gbk: range 205 
✖ gbk: range 206 

// Modified from WPT
import assert from 'node:assert';
import test from 'node:test';
import ranges from './ranges.mjs';

const decode = (input, output, desc) => {
  for (const encoding of ["gb18030", "gbk"]) {
      test(`${encoding}: ${desc}`, () => {
        assert.strictEqual(
          new TextDecoder(encoding).decode(new Uint8Array(input)),
          output,
        );
      })
    };
};

decode([115], "s", "ASCII");
decode([0x80], "\u20AC", "euro");
decode([0xFF], "\uFFFD", "initial byte out of accepted ranges");
decode([0x81], "\uFFFD", "end of queue, gb18030 first not 0");
decode([0x81, 0x28], "\ufffd(", "two bytes 0x81 0x28");
decode([0x81, 0x40], "\u4E02", "two bytes 0x81 0x40");
decode([0x81, 0x7E], "\u4E8A", "two bytes 0x81 0x7e");
decode([0x81, 0x7F], "\ufffd\u007f", "two bytes 0x81 0x7f");
decode([0x81, 0x80], "\u4E90", "two bytes 0x81 0x80");
decode([0x81, 0xFE], "\u4FA2", "two bytes 0x81 0xFE");
decode([0x81, 0xFF], "\ufffd", "two bytes 0x81 0xFF");
decode([0xFE, 0x40], "\uFA0C", "two bytes 0xFE 0x40");
decode([0xFE, 0xFE], "\uE4C5", "two bytes 0xFE 0xFE");
decode([0xFE, 0xFF], "\ufffd", "two bytes 0xFE 0xFF");
decode([0x81, 0x30], "\ufffd", "two bytes 0x81 0x30");
decode([0x81, 0x30, 0xFE], "\ufffd", "three bytes 0x81 0x30 0xFE");
decode([0x81, 0x30, 0xFF], "\ufffd0\ufffd", "three bytes 0x81 0x30 0xFF");
decode(
  [0x81, 0x30, 0xFE, 0x29],
  "\ufffd0\ufffd)",
  "four bytes 0x81 0x30 0xFE 0x29",
);
decode([0xFE, 0x39, 0xFE, 0x39], "\ufffd", "four bytes 0xFE 0x39 0xFE 0x39");
decode([0x81, 0x35, 0xF4, 0x36], "\u1E3E", "pointer 7458");
decode([0x81, 0x35, 0xF4, 0x37], "\ue7c7", "pointer 7457");
decode([0x81, 0x35, 0xF4, 0x38], "\u1E40", "pointer 7459");
decode([0x84, 0x31, 0xA4, 0x39], "\uffff", "pointer 39419");
decode([0x84, 0x31, 0xA5, 0x30], "\ufffd", "pointer 39420");
decode([0x8F, 0x39, 0xFE, 0x39], "\ufffd", "pointer 189999");
decode([0x90, 0x30, 0x81, 0x30], "\u{10000}", "pointer 189000");
decode([0xE3, 0x32, 0x9A, 0x35], "\u{10FFFF}", "pointer 1237575");
decode([0xE3, 0x32, 0x9A, 0x36], "\ufffd", "pointer 1237576");
decode([0x83, 0x36, 0xC8, 0x30], "\uE7C8", "legacy ICU special case 1");
decode([0xA1, 0xAD], "\u2026", "legacy ICU special case 2");
decode([0xA1, 0xAB], "\uFF5E", "legacy ICU special case 3");

// GB18030-2022
decode([0xA6, 0xD9], "\uFE10", "GB18030-2022 1");
decode([0xA6, 0xDA], "\uFE12", "GB18030-2022 2");
decode([0xA6, 0xDB], "\uFE11", "GB18030-2022 3");
decode([0xA6, 0xDC], "\uFE13", "GB18030-2022 4");
decode([0xA6, 0xDD], "\uFE14", "GB18030-2022 5");
decode([0xA6, 0xDE], "\uFE15", "GB18030-2022 6");
decode([0xA6, 0xDF], "\uFE16", "GB18030-2022 7");
decode([0xA6, 0xEC], "\uFE17", "GB18030-2022 8");
decode([0xA6, 0xED], "\uFE18", "GB18030-2022 9");
decode([0xA6, 0xF3], "\uFE19", "GB18030-2022 10");
decode([0xFE, 0x59], "\u9FB4", "GB18030-2022 11");
decode([0xFE, 0x61], "\u9FB5", "GB18030-2022 12");
decode([0xFE, 0x66], "\u9FB6", "GB18030-2022 13");
decode([0xFE, 0x67], "\u9FB7", "GB18030-2022 14");
decode([0xFE, 0x6D], "\u9FB8", "GB18030-2022 15");
decode([0xFE, 0x7E], "\u9FB9", "GB18030-2022 16");
decode([0xFE, 0x90], "\u9FBA", "GB18030-2022 17");
decode([0xFE, 0xA0], "\u9FBB", "GB18030-2022 18");
decode([0x82, 0x35, 0x90, 0x37], "\u9FB4", "GB18030-2022 19");
decode([0x82, 0x35, 0x90, 0x38], "\u9FB5", "GB18030-2022 20");
decode([0x82, 0x35, 0x90, 0x39], "\u9FB6", "GB18030-2022 21");
decode([0x82, 0x35, 0x91, 0x30], "\u9FB7", "GB18030-2022 22");
decode([0x82, 0x35, 0x91, 0x31], "\u9FB8", "GB18030-2022 23");
decode([0x82, 0x35, 0x91, 0x32], "\u9FB9", "GB18030-2022 24");
decode([0x82, 0x35, 0x91, 0x33], "\u9FBA", "GB18030-2022 25");
decode([0x82, 0x35, 0x91, 0x34], "\u9FBB", "GB18030-2022 26");
decode([0x84, 0x31, 0x82, 0x36], "\uFE10", "GB18030-2022 27");
decode([0x84, 0x31, 0x82, 0x37], "\uFE11", "GB18030-2022 28");
decode([0x84, 0x31, 0x82, 0x38], "\uFE12", "GB18030-2022 29");
decode([0x84, 0x31, 0x82, 0x39], "\uFE13", "GB18030-2022 30");
decode([0x84, 0x31, 0x83, 0x30], "\uFE14", "GB18030-2022 31");
decode([0x84, 0x31, 0x83, 0x31], "\uFE15", "GB18030-2022 32");
decode([0x84, 0x31, 0x83, 0x32], "\uFE16", "GB18030-2022 33");
decode([0x84, 0x31, 0x83, 0x33], "\uFE17", "GB18030-2022 34");
decode([0x84, 0x31, 0x83, 0x34], "\uFE18", "GB18030-2022 35");
decode([0x84, 0x31, 0x83, 0x35], "\uFE19", "GB18030-2022 36");

let i = 0;
for (const range of ranges) {
  const pointer = range[0];
  decode(
    [
      Math.floor(pointer / 12600) + 0x81,
      Math.floor((pointer % 12600) / 1260) + 0x30,
      Math.floor((pointer % 1260) / 10) + 0x81,
      pointer % 10 + 0x30,
    ],
    range[1],
    "range " + i++,
  );
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
icu Issues and PRs related to the ICU dependency. web-standards Issues and PRs related to Web APIs
Projects
None yet
Development

No branches or pull requests

2 participants