Unicode -- 从code point到UTF16的计算方法

简介:
UTF16,即是通常所说的Unicode。 其实把UTF16叫成Unicode不太合适,容易给人造成混乱。因为Unicode是字符集,而不是实际的存储编码方案。

UTF16是 变长 编码方案。

比如Unicode code point为2F92B的字,把它保存成UTF16(也就是Windows XP记事本中的Unicode),就变成了FC D8 2B DD,如果是 Big endian 的话就应该是D8 FC DD 2B。这个值是怎么来的?

对于 0-FFFF 的Unicode字符,UTF16中用一个两个字节的Unicode code point直接表示。对于 10000-10FFFF 的Unicode字符,UTF16中用surrogate pair表示,既用两个字符表示,它们之间的转换过程是:

下面把code point为U+64321(十六进制)的Unicode字符编码成UTF-16,由于它大于U+FFFF,所以它要编码成surrogate pair:
v  = 0x64321
v′ = v - 0x10000
= 0x54321
= 0101 0100 0011 0010 0001

vh = 0101010000 // higher 10 bits of v′
vl = 1100100001 // lower 10 bits of v′
w1 = 0xD800 // the resulting 1st word is initialized with the high bits
w2 = 0xDC00 // the resulting 2nd word is initialized with the low bits

w1 = w1 | vh
= 1101 1000 0000 0000 |
01 0101 0000
= 1101 1001 0101 0000
= 0xD950

w2 = w2 | vl
= 1101 1100 0000 0000 |
11 0010 0001
= 1101 1111 0010 0001
= 0xDF21

详细描述:
The improvement that UTF-16 made over UCS-2 is its ability to encode characters in planes 1–16, not just those in plane 0 (BMP).
UTF-16 represents non-BMP characters (those from U+10000 through U+10FFFF) using a pair of 16-bit words, known as a  surrogate pair. First 10000 16 is subtracted from the code point to give a 20-bit value. This is then split into two separate 10-bit values each of which is represented as a surrogate with the most significant half placed in the first surrogate. To allow safe use of simple  word-oriented string processing, separate ranges of values are used for the two surrogates: 0xD800–0xDBFF for the first, most significant surrogate and 0xDC00-0xDFFF for the second, least significant surrogate.
For example, the character at code point U+10000 becomes the code unit sequence 0xD800 0xDC00, and the character at U+10FFFD, the upper limit of Unicode, becomes the sequence 0xDBFF 0xDFFD. Unicode and ISO/IEC 10646 do not, and will never, assign characters to any of the code points in the U+D800–U+DFFF range, so an individual code value from a surrogate pair does not ever represent a character.



我们可以用Windows自带的计算器的科学计算模式完成上述计算,当然也可以自己写个小程序:)
要输入10000-10FFFF的字符,可以使用微软拼音输入法。它有一项以Unicode码输入字符的功能。
要显示这些字符中的汉字部分,可以安装Unifont,参见海峰五笔的网站。

关于编码知识,可以google一下这一系列文章,写的非常精彩:“Java中的字符集编码入门”


一点问题:
.net framework平台下,string类型变量name包含两个字符,一个是0-FFFF的字符,另一个是10000-10FFFF的字符,那么name的长度将是3而不是2,因为name有6个字节。




InBlock.gif using System;
InBlock.gif using System.Collections.Generic;
InBlock.gif using System.ComponentModel;
InBlock.gif using System.Data;
InBlock.gif using System.Drawing;
InBlock.gif using System.Text;
InBlock.gif using System.Windows.Forms;
InBlock.gif
namespace CodePoint2UTF16
InBlock.gif{
InBlock.gif         public partial  class Form1 : Form
InBlock.gif        {
InBlock.gif                 public Form1()
InBlock.gif                {
InBlock.gif                        InitializeComponent();
InBlock.gif                }
InBlock.gif
                 private  void btnConvert_Click( object sender, EventArgs e)
InBlock.gif                {
InBlock.gif                        String cp = tbUnicodeCodePoint.Text.Trim();
InBlock.gif
                         try
InBlock.gif                        {
InBlock.gif                                 int n = Convert.ToInt32(cp, 16);
InBlock.gif                                 if (n < 0 || n > 0x10FFFF)
InBlock.gif                                {
InBlock.gif                                        MessageBox.Show(cp +  " is not in 0x0 - 0x10FFFF");
InBlock.gif                                         return;
InBlock.gif                                }
InBlock.gif                                 if (n < 0x10000)
InBlock.gif                                {
InBlock.gif                                        tbUTF16Code.Text = Convert.ToString(n, 16);
InBlock.gif                                         return;
InBlock.gif                                }
InBlock.gif                                 else
InBlock.gif                                {
InBlock.gif                                        n -= 0x10000;
InBlock.gif                                         int h = n >> 10;
InBlock.gif                                         int l = n & 0x3FF;
InBlock.gif                                        h |= 0xD800;
InBlock.gif                                        l |= 0xDC00;
InBlock.gif                                        tbUTF16Code.Text = Convert.ToString(h, 16) +  " " + Convert.ToString(l, 16);
InBlock.gif                                }
InBlock.gif                        }
InBlock.gif                         catch (Exception ex)
InBlock.gif                        {
InBlock.gif                                MessageBox.Show( "Invalid text: " + cp + Environment.NewLine + ex.Message);
InBlock.gif                        }
InBlock.gif                }
InBlock.gif        }
InBlock.gif}









本文转自 h2appy  51CTO博客,原文链接:http://blog.51cto.com/h2appy/144639,如需转载请自行联系原作者
目录
相关文章
|
1月前
|
编解码 程序员 开发者
【Python】已解决:UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xa1 in position 0: invalid start by
【Python】已解决:UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xa1 in position 0: invalid start by
108 0
|
1月前
|
编解码 开发者 Python
【Python】已解决:SyntaxError: (unicode error) ‘unicodeescape’ codec can’t decode bytes in position 2-3: t
【Python】已解决:SyntaxError: (unicode error) ‘unicodeescape’ codec can’t decode bytes in position 2-3: t
34 0
|
3月前
|
网络协议
STX (Start of Text) - ASCII值2 (0x02)
STX (Start of Text) - ASCII值2 (0x02)
224 2
|
3月前
|
数据处理 数据库
SOH (Start of Header) - ASCII值1 (0x01)
SOH (Start of Header) - ASCII值1 (0x01)
355 2
|
10月前
|
编解码 Python
pandas - read_csv报错:‘utf-8‘/‘gbk‘ codec can‘t decode byte 0xb1 in position 0:invalid start byte
pandas - read_csv报错:‘utf-8‘/‘gbk‘ codec can‘t decode byte 0xb1 in position 0:invalid start byte
209 0
|
8月前
|
编解码 Python
pandas读取csv错误UnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 0: invalid start byte
pandas读取csv错误UnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 0: invalid start byte
304 0
|
9月前
|
XML 编解码 数据格式
python报错 ‘utf-8‘ codec can‘t encode characters in position xxxx-xxxx: surrogates not allowed
python报错 ‘utf-8‘ codec can‘t encode characters in position xxxx-xxxx: surrogates not allowed
306 0
|
编解码 Python
Python ‘utf-8‘ codec can‘t decode byte 0x8b in position 1: invalid start byte
Python ‘utf-8‘ codec can‘t decode byte 0x8b in position 1: invalid start byte
184 0
|
前端开发
CSS安装出错:Unicode Character Check -> Your temp directory path contains Unicode characters........
CSS安装出错:Unicode Character Check -> Your temp directory path contains Unicode characters........
415 0
|
关系型数据库 MySQL 数据库
Mysql case 视图操作报错 1267 Illegal mix of collations (utf8mb4_unicode_ci,COERCIBLE)……
Mysql case 视图操作报错 1267 Illegal mix of collations (utf8mb4_unicode_ci,COERCIBLE)……
218 0
Mysql case 视图操作报错 1267 Illegal mix of collations (utf8mb4_unicode_ci,COERCIBLE)……