标签
PostgreSQL , 按拼音排序 , collate
背景
数据库为了支持国际化,通常会涉及到collate, ctype的概念。
初始化数据库集群时,可以设置如下参数,用于设置数据库的字符串排序、字符归类方法、数值\日期\时间\货币的格式等。
LC_COLLATE |
String sort order |
LC_CTYPE |
Character classification (What is a letter? Its upper-case equivalent?) |
LC_MESSAGES |
Language of messages |
LC_MONETARY |
Formatting of currency amounts |
LC_NUMERIC |
Formatting of numbers |
LC_TIME |
Formatting of dates and times |
用户可以利用这些特性,按本土化需求,输出对应的顺序或者格式。
按中文的拼音为顺序排序就是一个常见的需求。
PostgreSQL支持哪些字符集(encoding)
用户可以参考PostgreSQL的官方文档,有对应的字符集支持列表
https://www.postgresql.org/docs/9.6/static/multibyte.html
Server=Yes表示该字符集支持用于create database。否则只支持作为客户端字符集。
Name |
Description |
Language |
Server? |
Bytes/Char |
Aliases |
BIG5 |
Big Five |
Traditional Chinese |
No |
1-2 |
WIN950, Windows950 |
EUC_CN |
Extended UNIX Code-CN |
Simplified Chinese |
Yes |
1-3 |
- |
EUC_JP |
Extended UNIX Code-JP |
Japanese |
Yes |
1-3 |
- |
EUC_JIS_2004 |
Extended UNIX Code-JP, JIS X 0213 |
Japanese |
Yes |
1-3 |
- |
EUC_KR |
Extended UNIX Code-KR |
Korean |
Yes |
1-3 |
- |
EUC_TW |
Extended UNIX Code-TW |
Traditional Chinese, Taiwanese |
Yes |
1-3 |
- |
GB18030 |
National Standard |
Chinese |
No |
1-4 |
- |
GBK |
Extended National Standard |
Simplified Chinese |
No |
1-2 |
WIN936, Windows936 |
ISO_8859_5 |
ISO 8859-5, ECMA 113 |
Latin/Cyrillic |
Yes |
1 |
- |
ISO_8859_6 |
ISO 8859-6, ECMA 114 |
Latin/Arabic |
Yes |
1 |
- |
ISO_8859_7 |
ISO 8859-7, ECMA 118 |
Latin/Greek |
Yes |
1 |
- |
ISO_8859_8 |
ISO 8859-8, ECMA 121 |
Latin/Hebrew |
Yes |
1 |
- |
JOHAB |
JOHAB |
Korean (Hangul) |
No |
1-3 |
- |
KOI8R |
KOI8-R |
Cyrillic (Russian) |
Yes |
1 |
KOI8 |
KOI8U |
KOI8-U |
Cyrillic (Ukrainian) |
Yes |
1 |
- |
LATIN1 |
ISO 8859-1, ECMA 94 |
Western European |
Yes |
1 |
ISO88591 |
LATIN2 |
ISO 8859-2, ECMA 94 |
Central European |
Yes |
1 |
ISO88592 |
LATIN3 |
ISO 8859-3, ECMA 94 |
South European |
Yes |
1 |
ISO88593 |
LATIN4 |
ISO 8859-4, ECMA 94 |
North European |
Yes |
1 |
ISO88594 |
LATIN5 |
ISO 8859-9, ECMA 128 |
Turkish |
Yes |
1 |
ISO88599 |
LATIN6 |
ISO 8859-10, ECMA 144 |
Nordic |
Yes |
1 |
ISO885910 |
LATIN7 |
ISO 8859-13 |
Baltic |
Yes |
1 |
ISO885913 |
LATIN8 |
ISO 8859-14 |
Celtic |
Yes |
1 |
ISO885914 |
LATIN9 |
ISO 8859-15 |
LATIN1 with Euro and accents |
Yes |
1 |
ISO885915 |
LATIN10 |
ISO 8859-16, ASRO SR 14111 |
Romanian |
Yes |
1 |
ISO885916 |
MULE_INTERNAL |
Mule internal code |
Multilingual Emacs |
Yes |
1-4 |
- |
SJIS |
Shift JIS |
Japanese |
No |
1-2 |
Mskanji, ShiftJIS, WIN932, Windows932 |
SHIFT_JIS_2004 |
Shift JIS, JIS X 0213 |
Japanese |
No |
1-2 |
- |
SQL_ASCII |
unspecified (see text) |
any |
Yes |
1 |
- |
UHC |
Unified Hangul Code |
Korean |
No |
1-2 |
WIN949, Windows949 |
UTF8 |
Unicode, 8-bit |
all |
Yes |
1-4 |
Unicode |
WIN866 |
Windows CP866 |
Cyrillic |
Yes |
1 |
ALT |
WIN874 |
Windows CP874 |
Thai |
Yes |
1 |
- |
WIN1250 |
Windows CP1250 |
Central European |
Yes |
1 |
- |
WIN1251 |
Windows CP1251 |
Cyrillic |
Yes |
1 |
WIN |
WIN1252 |
Windows CP1252 |
Western European |
Yes |
1 |
- |
WIN1253 |
Windows CP1253 |
Greek |
Yes |
1 |
- |
WIN1254 |
Windows CP1254 |
Turkish |
Yes |
1 |
- |
WIN1255 |
Windows CP1255 |
Hebrew |
Yes |
1 |
- |
WIN1256 |
Windows CP1256 |
Arabic |
Yes |
1 |
- |
WIN1257 |
Windows CP1257 |
Baltic |
Yes |
1 |
- |
WIN1258 |
Windows CP1258 |
Vietnamese |
Yes |
1 |
ABC, TCVN, TCVN5712, VSCII |
如何获取字符集支持的LC_COLLATE, LC_CTYPE信息
使用如下SQL可以查询系统表pg_collation得到字符集支持的lc_collate和lc_ctype。
其中encoding为空时,表示这个collation支持所有的字符集。
test=> select pg_encoding_to_char(collencoding) as encoding,collname,collcollate,collctype from pg_collation ;
encoding | collname | collcollate | collctype
------------+-----------------------+-----------------------+-----------------------
| default | |
| C | C | C
| POSIX | POSIX | POSIX
UTF8 | aa_DJ | aa_DJ.utf8 | aa_DJ.utf8
LATIN1 | aa_DJ | aa_DJ | aa_DJ
LATIN1 | aa_DJ.iso88591 | aa_DJ.iso88591 | aa_DJ.iso88591
UTF8 | aa_DJ.utf8 | aa_DJ.utf8 | aa_DJ.utf8
UTF8 | aa_ER | aa_ER | aa_ER
UTF8 | aa_ER.utf8 | aa_ER.utf8 | aa_ER.utf8
.......
EUC_CN | zh_CN | zh_CN | zh_CN
UTF8 | zh_CN | zh_CN.utf8 | zh_CN.utf8
EUC_CN | zh_CN.gb2312 | zh_CN.gb2312 | zh_CN.gb2312
UTF8 | zh_CN.utf8 | zh_CN.utf8 | zh_CN.utf8
UTF8 | zh_HK | zh_HK.utf8 | zh_HK.utf8
UTF8 | zh_HK.utf8 | zh_HK.utf8 | zh_HK.utf8
EUC_CN | zh_SG | zh_SG | zh_SG
UTF8 | zh_SG | zh_SG.utf8 | zh_SG.utf8
EUC_CN | zh_SG.gb2312 | zh_SG.gb2312 | zh_SG.gb2312
UTF8 | zh_SG.utf8 | zh_SG.utf8 | zh_SG.utf8
EUC_TW | zh_TW | zh_TW.euctw | zh_TW.euctw
UTF8 | zh_TW | zh_TW.utf8 | zh_TW.utf8
EUC_TW | zh_TW.euctw | zh_TW.euctw | zh_TW.euctw
UTF8 | zh_TW.utf8 | zh_TW.utf8 | zh_TW.utf8
UTF8 | zu_ZA | zu_ZA.utf8 | zu_ZA.utf8
LATIN1 | zu_ZA | zu_ZA | zu_ZA
LATIN1 | zu_ZA.iso88591 | zu_ZA.iso88591 | zu_ZA.iso88591
UTF8 | zu_ZA.utf8 | zu_ZA.utf8 | zu_ZA.utf8
(869 rows)
如何设置数据库的本土化(collate)信息
《如何设置数据库的LC_COLLATE, LC_CTYPE, ENCODING, TEMPLATE》
如何设置字段的本土化(collate)
在操作前,请了解清楚与您当前数据库字符集(encoding)兼容的collate,使用如下SQL可以得到当前数据库的encoding
postgres=# select datname,pg_encoding_to_char(encoding) as encoding from pg_database;
datname | encoding
--------------------+-----------
template1 | UTF8
template0 | UTF8
db | SQL_ASCII
db1 | EUC_CN
contrib_regression | UTF8
test01 | UTF8
test02 | UTF8
postgres | UTF8
(8 rows)
1. 在创建表时,指定兼容当前字符集的collate
CREATE TABLE test1 (
a text COLLATE "de_DE",
b text COLLATE "es_ES",
...
);
2. 修改列collate(会导致rewrite table),大表请谨慎操作
alter table a alter c1 type text COLLATE "zh_CN";
如何在SQL用使用本土化(collate)
1. 使用本土化, 改变order by输出排序
test=# select * from a order by c1 collate "C";
c1
--------
刘少奇
刘德华
(2 rows)
test=# select * from a order by c1 collate "zh_CN";
c1
--------
刘德华
刘少奇
(2 rows)
2. 使用本土化, 改变操作符的结果
test=# select * from a where c1 > '刘少奇' collate "C";
c1
--------
刘德华
(1 row)
test=# select * from a where c1 > '刘少奇' collate "zh_CN";
c1
----
(0 rows)
如何使用本土化索引, 按拼音排序
注意排序语句中的collate与索引的collate保持一致,才能使用这个索引进行排序。
postgres=# create index idxa on a(c1 collate "zh_CN");
CREATE INDEX
postgres=# explain select * from a order by c1 collate "zh_CN";
QUERY PLAN
------------------------------------------------------------------------
Index Only Scan using idxa on a (cost=0.15..31.55 rows=1360 width=64)
(1 row)
如何按拼音排序
1. 方法1,使用本土化SQL(不修改原有数据)
test=# select * from a order by c1 collate "zh_CN";
c1
--------
刘德华
刘少奇
(2 rows)
2. 方法2,使用本土化字段(如果已有数据,则需要调整原有数据)
alter table a alter c1 type text COLLATE "zh_CN";
3. 方法3,使用本土化索引以及本土化SQL(不修改原有数据)
postgres=# create index idxa on a(c1 collate "zh_CN");
CREATE INDEX
postgres=# explain select * from a order by c1 collate "zh_CN";
QUERY PLAN
------------------------------------------------------------------------
Index Only Scan using idxa on a (cost=0.15..31.55 rows=1360 width=64)
(1 row)
4. 设置数据库的collate为zh_CN,将默认使用这个collate,按拼音排序
test02=# create database test03 encoding 'UTF8' lc_collate 'zh_CN.utf8' lc_ctype 'zh_CN.utf8' template template0;
CREATE DATABASE
test02=# \c test03
You are now connected to database "test03" as user "postgres".
test03=# select * from (values ('刘德华'),('刘少奇')) as a(c1) order by c1 ;
c1
--------
刘德华
刘少奇
(2 rows)
注意多音字
有些多音字,例如重庆(chongqing), 编码时"重"可能是按zhong编码,影响输出。
test03=# select * from (values ('中山'),('重庆')) as a(c1) order by c1 collate "zh_CN";
c1
------
中山
重庆
(2 rows)
Greenplum按拼音排序
greenplum不支持单列设置collate,按拼音排序有些许不同。
在greenplum中,可以使用字符集转换,按对应二进制排序,得到拼音排序的效果。
postgres=# select * from (values ('刘德华'), ('刘少奇')) t(id) order by byteain(textout(convert(id,'UTF8','EUC_CN')));
id
--------
刘德华
刘少奇
(2 rows)
参考
《PostgreSQL 按拼音排序 - convert to GBK/EUC_CN coding》
《如何设置数据库的LC_COLLATE, LC_CTYPE, ENCODING, TEMPLATE》
https://www.postgresql.org/docs/9.6/static/charset.html