Indicating which character collection is used #84

murata2makoto · 2021-12-09T02:07:32Z

Unicode has 92,865 CJK ideographic characters. But each language uses a small subset. Annex A of ISO/IEC 10646 shows a list of character collections relevant to Japanese text. (Note: Annex A also provides collections for other languages as well). Each of the listed character collections contains less than 10,000 characters.

Assistive technologies (e.g., Japanese TTS) are unlikely to handle 92,865 CJK ideographic characters. According to a report from a Japanese ministry in 2015, most TTS engines support 6355 characters in JIS X 0208 only. I have not heard significant improvements since then.

Moreover, authors of textbooks or books for children use even smaller subsets for pedagogical reasons. For example, 1006 CJK ideographic characters are taught in Japanese compulsory education.

I thus think that accessibility metadata should be able to indicate (1) which character collection is used as a basis and (2) which character beyond the specified collection is used as exceptions, which are sometimes necessary. I believe that this is good for other CJK countries. Moreover, since no languages and no TTS engines support all Unicode characters, I guess that this is good for everybody.

xfq added the i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. label Dec 12, 2021

w3cbot mentioned this issue Dec 12, 2021

Indicating which character collection is used w3c/i18n-activity#1443

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indicating which character collection is used #84

Indicating which character collection is used #84

murata2makoto commented Dec 9, 2021 •

edited

Loading

Indicating which character collection is used #84

Indicating which character collection is used #84

Comments

murata2makoto commented Dec 9, 2021 • edited Loading

murata2makoto commented Dec 9, 2021 •

edited

Loading