Character Encodings and Unicode: The Definitive Guide to UTF-8, UTF-16, and Beyond
In the early days of computing, representing text was simple: every character fit into 7 or 8 bits (ASCII). But as computing went global, we needed a way to represent every character from every language in the world. This is the story of Unicode and the encodings that power it.
1. The Unicode Standard
Unicode is not an encoding itself, but a universal character set. It assigns a unique number, called a code point, to every character (e.g., U+0041 for 'A'). However, we still need a way to store these numbers in binary.
UTF-8 (The King of the Web)
UTF-8 is a variable-width encoding that uses 1 to 4 bytes per character. It is backward compatible with ASCII.
- Pros: Efficient for Western languages, robust against data corruption, and the universal standard for the web.
UTF-16
Uses either 2 or 4 bytes. It is the native encoding for Windows (NTFS) and many programming environments like Java and JavaScript.
UTF-32
Uses exactly 4 bytes for every character. While simple for indexing, it is very memory-inefficient.
2. Legacy and Regional Encodings
Before Unicode became dominant, different regions used their own standards, many of which are still encountered today.
Chinese Encodings (GB Family)
- GB2312: The early standard for Simplified Chinese.
- GBK: An extension of GB2312 to support more characters.
- GB18030: The current mandatory standard in China, supporting both Simplified and Traditional Chinese and mapping fully to Unicode.
Japanese Encodings
- Shift-JIS: Historically the most popular encoding for Japanese, used extensively in Windows and older websites.
Korean and Traditional Chinese
- EUC-KR: A common legacy encoding for Korean.
- Big5: The standard for Traditional Chinese characters used primarily in Taiwan and Hong Kong.
3. Western Legacy Encodings
ISO-8859-1 (Latin-1)
An 8-bit encoding that covers most Western European languages.
Windows-1252
A slight variation of ISO-8859-1 used by default in older versions of Windows.
4. Technical Nuances: BOM and Endianness
Byte Order Mark (BOM)
The BOM is a special sequence of bytes at the beginning of a text file (e.g., EF BB BF for UTF-8). It tells the software which encoding and byte order is being used. While useful for UTF-16, it is often discouraged for UTF-8 in web environments.
Little Endian vs. Big Endian
Relevant for UTF-16 and UTF-32, this refers to the order in which bytes are stored in memory.
Comparison Summary
| Encoding | Bytes per Char | Compatibility | Best Use Case |
|---|---|---|---|
| UTF-8 | 1-4 | ASCII | Web, Linux, Mac |
| UTF-16 | 2 or 4 | None | Windows, Java, JS |
| GB18030 | 1, 2, or 4 | GBK | Chinese Government Compliance |
| ASCII | 1 | Universal | Legacy Systems, English-only |
Conclusion
The world has largely settled on UTF-8 as the standard for data exchange. However, understanding legacy encodings like GBK or Shift-JIS is still vital when dealing with legacy systems or specific regional software. When in doubt, always use UTF-8 without a BOM for modern applications.