背景
在Oracle中有一个函数asciistr,可以将非ASCII字符转换成UTF-16编码的字符,因此转换后的字符串可以存储在只支持ASCII编码的数据库中。
ASCIISTR takes as its argument a string, or an expression that resolves to a string, in any character set and returns an ASCII version of the string in the database character set.
Non-ASCII characters are converted to the form \xxxx, where xxxx represents a UTF-16 code unit.
例子
SELECT ASCIISTR('ABÄCDE') FROM DUAL;
ASCIISTR('
----------
AB\00C4CDE
\00C4即转换后的UTF-16编码,这样整个字符串就可以存储在ASCII编码的数据库中了。
PostgreSQL ASCIISTR
了解了ASCIISTR的目的之后,我们就可以在PostgreSQL中实现对非ASCII编码的字符转换了。
PostgreSQL数据库没有UTF-16字符集,但是它支持UTF-8,UTF-8是变长字符集,支持全球所有的语言字符编码。
PostgreSQL提供了几个转换函数如下
postgres=# \df convert*
List of functions
Schema | Name | Result data type | Argument data types | Type
------------+--------------+------------------+---------------------+--------
pg_catalog | convert | bytea | bytea, name, name | normal
pg_catalog | convert_from | text | bytea, name | normal
pg_catalog | convert_to | bytea | text, name | normal
(3 rows)
例子,支持来回转换
postgres=# select convert_to('abc你好中国cde','UTF8');
convert_to
----------------------------------------
\x616263e4bda0e5a5bde4b8ade59bbd636465
(1 row)
postgres=# select convert_from(convert_to('abc你好中国cde','UTF8'),'UTF8');
convert_from
----------------
abc你好中国cde
(1 row)
postgres=# select convert_from('\x616263e4bda0e5a5bde4b8ade59bbd636465', 'UTF8');
convert_from
----------------
abc你好中国cde
(1 row)
方法2,使用textsend函数,这个函数没有编码输入,所以需要设置client_encoding。
postgres=# set client_encoding='UTF8';
SET
postgres=# select textsend('abc你好中国cde');
textsend
----------------------------------------
\x616263e4bda0e5a5bde4b8ade59bbd636465
(1 row)
postgres=# set client_encoding='GBK';
SET
postgres=# select textsend('abc你好中国cde');
ERROR: character with byte sequence 0xad 0xe5 in encoding "GBK" has no equivalent in encoding "UTF8"
LOCATION: report_untranslatable_char, wchar.c:2051
postgres=# select textsend('abc你好cde');
textsend
----------------------------
\x616263e4bda0e5a5bd636465
(1 row)
PostgreSQL支持的字符编码转换函数在这里
postgres=# select * from pg_conversion ;
conname | connamespace | conowner | conforencoding | contoencoding | conproc | condefault
--------------------------------+--------------+----------+----------------+---------------+--------------------------------+------------
ascii_to_mic | 11 | 10 | 0 | 7 | ascii_to_mic | t
mic_to_ascii | 11 | 10 | 7 | 0 | mic_to_ascii | t
......
utf8_to_shift_jis_2004 | 11 | 10 | 6 | 41 | utf8_to_shift_jis_2004 | t
euc_jis_2004_to_shift_jis_2004 | 11 | 10 | 5 | 41 | euc_jis_2004_to_shift_jis_2004 | t
shift_jis_2004_to_euc_jis_2004 | 11 | 10 | 41 | 5 | shift_jis_2004_to_euc_jis_2004 | t
(132 rows)
以上例子把整个字符串都进行了转换,包括一些已经是ASCII的字符也被转换了,转换后长度变长了,如果你要尽量缩小长度怎么办呢?
如果你要对一个字符串中非ASCII字符转换为bytea,而ASCII字符保持不变,则需要自定义PostgreSQL函数,如下demo
create or replace function asciistr(str text, encoding text) returns text[] as $$
declare
mid text;
res text[] := (array[])::text[];
begin
foreach mid in array regexp_split_to_array(str, '')
loop
if ascii(mid)<256 then
res := array_append(res, mid);
else
res := array_append(res, (convert_to(mid,encoding))::text);
end if;
end loop;
return res;
end;
$$
language plpgsql strict;
测试
postgres=# select asciistr('abc中国你好ced', 'UTF8');
asciistr
---------------------------------------------------------------
{a,b,c,"\\xe4b8ad","\\xe59bbd","\\xe4bda0","\\xe5a5bd",c,e,d}
(1 row)
你还可以输出字符串,也可以直接输出字符串
postgres=# select array_to_string(asciistr('abc中国你好ced', 'UTF8'), '');
array_to_string
----------------------------------------
abc\xe4b8ad\xe59bbd\xe4bda0\xe5a5bdced
(1 row)
使用这种方法,在反转换时,也要通过函数来处理。
create or replace function reverse_asciistr(str text[], encoding text) returns text as $$
declare
mid text;
res text := '';
begin
foreach mid in array str
loop
if mid ~ '^\\x' then
res := concat(res, convert_from(mid::bytea, encoding));
else
res := concat(res, mid);
end if;
end loop;
return res;
end;
$$
language plpgsql strict;
反转转换举例
postgres=# select reverse_asciistr(asciistr('abc_12\ab你好\ade中国_1jjr3', 'UTF8'), 'UTF8');
reverse_asciistr
-----------------------------
abc_12\ab你好\ade中国_1jjr3
(1 row)