2016 December 29 —— improvement; java; unicode



最近遇到一些字符集相关的问题,由于没有有效整理,所以知识条理不够清晰,关于 Java中对于Unicode的处理需要理解的一些知识这里总结一下:


Basic Multilingual Plane (BMP)

supplementary character —— Java 平台中的增补字符,Java采用16位Unicode编码,最初 Unicode被设计为16位固定长度的编码字符集,但16位最多表示65536个字符,事实证明16位不足以表示全球所有字符,那些超过原始16位的字符成为增补字符


  • high-surrogates range (U+D800 to U+DBFF)

  • low-surrogates range (U+DC00 to U+DFFF)

以下引用[Supplementary Characters in the Java Platform]一段很精辟的原文,用很精确的描述解释了编码,字符集等概念:

Code Points, Character Encoding Schemes, UTF-16: What’s All This?

The introduction of supplementary characters unfortunately makes the character model quite a bit more complicated. Where in the past we could simply talk about “characters” and, in a Unicode based environment such as the Java platform, assume that a character has 16 bits, we now need more terminology. We’ll try to keep it relatively simple – for a full-blown discussion with all details you can read Chapter 2 of The Unicode Standard or Unicode Technical Report 17 “ Character Encoding Model.” Unicode experts may skip all but the last definition in this section.

A character is just an abstract minimal unit of text. It doesn’t have a fixed shape (that would be a glyph), and it doesn’t have a value. “A” is a character, and so is “€”, the symbol for the common currency of Germany, France, and numerous other European countries.

A character set is a collection of characters. For example, the Han characters are the characters originally invented by the Chinese, which have been used to write Chinese, Japanese, Korean, and Vietnamese.

A coded character set is a character set where each character has been assigned a unique number. At the core of the Unicode standard is a coded character set that assigns the letter “A” the number 0041 16 and the letter “€” the number 20AC 16. The Unicode standard always uses hexadecimal numbers, and writes them with the prefix “U+”, so the number for “A” is written as “U+0041”.

Code points are the numbers that can be used in a coded character set. A coded character set defines a range of valid code points, but doesn’t necessarily assign characters to all those code points. The valid code points for Unicode are U+0000 to U+10FFFF. Unicode 4.0 assigns characters to 96,382 of these more than a million code points.

Supplementary characters are characters with code points in the range U+10000 to U+10FFFF, that is, those characters that could not be represented in the original 16-bit design of Unicode. The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Thus, each Unicode character is either in the BMP or a supplementary character.

A character encoding scheme is a mapping from the numbers of one or more coded character sets to sequences of one or more fixed-width code units. The most commonly used code units are bytes, but 16-bit or 32-bit integers can also be used for internal processing. UTF-32, UTF-16, and UTF-8 are character encoding schemes for the coded character set of the Unicode standard.

UTF-32 simply represents each Unicode code point as the 32-bit integer of the same value. It’s clearly the most convenient representation for internal processing, but uses significantly more memory than necessary if used as a general string representation.

UTF-16 uses sequences of one or two unsigned 16-bit code units to encode Unicode code points. Values U+0000 to U+FFFF are encoded in one 16-bit unit with the same value. Supplementary characters are encoded in two code units, the first from the high-surrogates range (U+D800 to U+DBFF), the second from the low-surrogates range (U+DC00 to U+DFFF). This may seem similar in concept to multi-byte encodings, but there is an important difference: The values U+D800 to U+DFFF are reserved for use in UTF-16; no characters are assigned to them as code points. This means, software can tell for each individual code unit in a string whether it represents a one-unit character or whether it is the first or second unit of a two-unit character. This is a significant improvement over some traditional multi-byte character encodings, where the byte value 0x41 could mean the letter “A” or be the second byte of a two-byte character.

UTF-8 uses sequences of one to four bytes to encode Unicode code points. U+0000 to U+007F are encoded in one byte, U+0080 to U+07FF in two bytes, U+0800 to U+FFFF in three bytes, and U+10000 to U+10FFFF in four bytes. UTF-8 is designed so that the byte values 0x00 to 0x7F always represent code points U+0000 to U+007F (the Basic Latin block, which corresponds to the ASCII character set). These byte values never occur in the representation of other code points, a characteristic that makes UTF-8 convenient to use in software that assigns special meanings to certain ASCII characters.


Unicode 万国码为解决这一问题诞生,Unicode收集世界所有符号,将其纳入其中以对应方式编码成二进制,从而将世界上所有字符在对应unicode编码方式下有了唯一对应的二进制码,这样一来大家全部用万国码就不需要关注文件的方式,乱码问题随之解决;




Lexical Structure

Supplementary Characters in the Java Platform

Class Character

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)




Loading Disqus comments...
Table of Contents