Skip to content
Alick Zhao edited this page Jan 19, 2016 · 4 revisions

Unicode clarification

Basics

Range: 0x000000–0x10FFFF (0x110000 code points in total)

BMP (Basic Multilingual Plane): 0x0000–0xFFFF

Encodings

  • UTF-8 variable length encoding (1–4 bytes)
  • UTF-16 variable length encoding (1–2 bytes), descendant of deprecated UCS-2. Harmful?
  • UTF-32 constant length encoding (4 bytes)

UTF-8

History of UTF-8

Specification: RFC2279

Emoji

E.g.: 😷

Characters

  • ☐ U+2610 BALLOT BOX
  • ☑ U+2611 BALLOT BOX WITH CHECK
  • ☒ U+2612 BALLOT BOX WITH X

My avatar: 🈚 U+1F21A

Puncations

  • · U+00B7 MIDDLE DOT 用作中文的间隔号(外国人名等)

Pinyin Tonic Marks

See Test charts for tonal pinyin in Unicode Web pages.

Notes

A good reading material is Tom Christiansen's slides gbu at OSCON.

Clone this wiki locally