Character Encodings and Unicode: The Definitive Guide to UTF-8, UTF-16, and Beyond

In the early days of computing, representing text was simple: every character fit into 7 or 8 bits (ASCII). But as computing went global, we needed a way to represent every character from every language in the world. This is the story of Unicode and the encodings that power it.

1. The Unicode Standard

Unicode is not an encoding itself, but a universal character set. It assigns a unique number, called a code point, to every character (e.g., U+0041 for 'A'). However, we still need a way to store these numbers in binary.

UTF-8 (The King of the Web)

UTF-8 is a variable-width encoding that uses 1 to 4 bytes per character. It is backward compatible with ASCII.

Pros: Efficient for Western languages, robust against data corruption, and the universal standard for the web.

UTF-16

Uses either 2 or 4 bytes. It is the native encoding for Windows (NTFS) and many programming environments like Java and JavaScript.

UTF-32

Uses exactly 4 bytes for every character. While simple for indexing, it is very memory-inefficient.

2. Legacy and Regional Encodings

Before Unicode became dominant, different regions used their own standards, many of which are still encountered today.

Chinese Encodings (GB Family)

GB2312: The early standard for Simplified Chinese.
GBK: An extension of GB2312 to support more characters.
GB18030: The current mandatory standard in China, supporting both Simplified and Traditional Chinese and mapping fully to Unicode.

Japanese Encodings

Shift-JIS: Historically the most popular encoding for Japanese, used extensively in Windows and older websites.

Korean and Traditional Chinese

EUC-KR: A common legacy encoding for Korean.
Big5: The standard for Traditional Chinese characters used primarily in Taiwan and Hong Kong.

3. Western Legacy Encodings

ISO-8859-1 (Latin-1)

An 8-bit encoding that covers most Western European languages.

Windows-1252

A slight variation of ISO-8859-1 used by default in older versions of Windows.

4. Technical Nuances: BOM and Endianness

Byte Order Mark (BOM)

The BOM is a special sequence of bytes at the beginning of a text file (e.g., EF BB BF for UTF-8). It tells the software which encoding and byte order is being used. While useful for UTF-16, it is often discouraged for UTF-8 in web environments.

Little Endian vs. Big Endian

Relevant for UTF-16 and UTF-32, this refers to the order in which bytes are stored in memory.

Comparison Summary

Encoding	Bytes per Char	Compatibility	Best Use Case
UTF-8	1-4	ASCII	Web, Linux, Mac
UTF-16	2 or 4	None	Windows, Java, JS
GB18030	1, 2, or 4	GBK	Chinese Government Compliance
ASCII	1	Universal	Legacy Systems, English-only

Conclusion

The world has largely settled on UTF-8 as the standard for data exchange. However, understanding legacy encodings like GBK or Shift-JIS is still vital when dealing with legacy systems or specific regional software. When in doubt, always use UTF-8 without a BOM for modern applications.

The Absolute Beginner's Guide to Character Encodings and Unicode