标签
PostgreSQL , json , 字符串转义 , unicode , SQL注入 , backslash_quote , escape_string_warning , standard_conforming_strings
背景
通过本文,你可以了解:
1. 如何在字符串中输入特殊字符,例如回车。
2. 如何在字符串中输入单引号。
3. 转义字符是什么?用法。
4. 在SQL标准中,反斜线是转义符吗?PostgreSQL如何使用反斜线转义?书写格式如何?
5. 为什么转义单引号存在SQL注入的风险?数据库有什么参数可以控制这个风险?有哪些参数可以发出警告?
6. 如何以UNICODE的形式输入字符串。
7. 以UNICODE的格式输入,支持所有字符集吗?目前支持哪些字符集,支持的CODE范围是多少?
8. 在Greenplum的JSON中输入UNICODE value有什么问题吗?PostgreSQL中有没有同样的问题?
转义配置
有三个配置控制转义如下:
1. (与SQL注入有关)是否允许使用反斜线\转义单引号? \'
backslash_quote = on 允许, off 不允许, safe_encoding(仅当client_encoding不允许反斜线\
出现在多字节字符中时(大多数字符集\
都是单字节表示),那么才允许使用\'
转义单引号.)
为什么要控制,不让转义单引号呢?
因为这样可能引入SQL注入的风险,比如最终客户将\
放在末尾,就可以将单引号转义掉,本来应该正常输入的字符串结束符,就不存在了。
backslash_quote (enum)
This controls whether a quote mark can be represented by \' in a string literal.
The preferred, SQL-standard way to represent a quote mark is by doubling it ('') but PostgreSQL has historically also accepted \'.
However, use of \' creates security risks because in some client character set encodings, there are multibyte characters in which the last byte is numerically equivalent to ASCII \.
If client-side code does escaping incorrectly then a SQL-injection attack is possible.
This risk can be prevented by making the server reject queries in which a quote mark appears to be escaped by a backslash.
The allowed values of backslash_quote are on (allow \' always), off (reject always), and safe_encoding (allow only if client encoding does not allow ASCII \ within a multibyte character).
safe_encoding is the default setting.
Note that in a standard-conforming string literal, \ just means \ anyway. This parameter only affects the handling of non-standard-conforming literals, including escape string syntax (E'...').
注意,仅仅当standard_conforming_strings=off或者使用E''的写法时,backslash_quote的配置才起作用。
否则当standard_conforming_strings=on时,\
会被作为普通的字符串处理。
例子1, 当standard_conforming_strings=on时,\
会被作为普通的字符串处理
postgres=# set standard_conforming_strings=on;
SET
postgres=# select '\';
?column?
----------
\
(1 row)
例子2, 当standard_conforming_strings=off 或者 使用E''的写法时,backslash_quote的配置才起作用。
postgres=# set backslash_quote = off; -- 不允许转义'
SET
postgres=# select E'\';
';
ERROR: 22P06: unsafe use of \' in a string literal
LINE 1: select E'\';
^
HINT: Use '' to write quotes in strings. \' is insecure in client-only encodings.
LOCATION: core_yylex, scan.l:641
postgres=# set backslash_quote = off; -- 不允许转义'
SET
postgres=# set standard_conforming_strings=off;
SET
postgres=# select '\';
';
ERROR: 22P06: unsafe use of \' in a string literal
LINE 1: select '\';
^
HINT: Use '' to write quotes in strings. \' is insecure in client-only encodings.
LOCATION: core_yylex, scan.l:641
postgres=# set backslash_quote = on; -- 允许转义'
SET
postgres=# select '\';
';
WARNING: 22P06: nonstandard use of \' in a string literal
LINE 1: select '\';
^
HINT: Use '' to write quotes in strings, or use the escape string syntax (E'...').
LOCATION: check_string_escape_warning, scan.l:1483
?column?
----------
'; +
(1 row)
postgres=# set backslash_quote = safe_encoding; -- 当client_encoding不允许反斜线```\```出现在多字节字符中时(大多数字符集```\```都是单字节表示),允许使用```\'```转义单引号.
SET
postgres=# show client_encoding;
client_encoding
-----------------
UTF8
(1 row)
postgres=# select '\';
';
WARNING: 22P06: nonstandard use of \' in a string literal
LINE 1: select '\';
^
HINT: Use '' to write quotes in strings, or use the escape string syntax (E'...').
LOCATION: check_string_escape_warning, scan.l:1483
?column?
----------
'; +
(1 row)
2. 当standard_conforming_strings=off,同时在''中的字符串中包含反斜线时,是否输出警告。
因为SQL标准中,字符串中的反斜线不是转义字符,而是普通字符。
如果你要转义,那么请使用E''的写法。
escape_string_warning (boolean)
When on, a warning is issued if a backslash (\) appears in an ordinary string literal ('...' syntax) and standard_conforming_strings is off.
The default is on.
Applications that wish to use backslash as escape should be modified to use escape string syntax (E'...'),
because the default behavior of ordinary strings is now to treat backslash as an ordinary character, per SQL standard.
This variable can be enabled to help locate code that needs to be changed.
例子1
postgres=# set escape_string_warning=on;
SET
postgres=# set standard_conforming_strings=off;
SET
postgres=# select '\';
';
WARNING: 22P06: nonstandard use of \' in a string literal -- 收到警告消息
LINE 1: select '\';
^
HINT: Use '' to write quotes in strings, or use the escape string syntax (E'...').
LOCATION: check_string_escape_warning, scan.l:1483
?column?
----------
'; +
(1 row)
例子2
postgres=# set escape_string_warning=on;
SET
postgres=# set standard_conforming_strings=off;
SET
postgres=# select E'\'; -- 使用E''写法,\作为转义字符
';
?column?
----------
'; +
(1 row)
3. 告诉数据库,在''字符串中的反斜线,是否作为普通字符。
standard_conforming_strings = on 反斜线作为普通字符(SQL标准用法), off 反斜线作为转义字符。
standard_conforming_strings (boolean)
This controls whether ordinary string literals ('...') treat backslashes literally, as specified in the SQL standard.
Beginning in PostgreSQL 9.1, the default is on (prior releases defaulted to off).
Applications can check this parameter to determine how string literals will be processed.
The presence of this parameter can also be taken as an indication that the escape string syntax (E'...') is supported.
Escape string syntax (Section 4.1.2.2) should be used if an application desires backslashes to be treated as escape characters.
如果standard_conforming_strings=on,要使用转义怎么办?采样E''写法即可
postgres=# set standard_conforming_strings=on;
SET
postgres=# select '\'; -- SQL标准写法
?column?
----------
\
(1 row)
postgres=# select E'\''; -- 转义写法,此时\作为转义字符
?column?
----------
'
(1 row)
如何转义
1. standard_conforming_strings=on(SQL标准),那么字符串''中的反斜线是普通字符。 使用E''写法,反斜线才是转义字符。
2. standard_conforming_strings=off(非SQL标准), 那么字符串''中的反斜线都是转义字符。
C语言风格字符串
Backslash Escape Sequence | Interpretation |
---|---|
\b | backspace |
\f | form feed |
\n | newline |
\r | carriage return |
\t | tab |
\o, \oo, \ooo (o = 0 - 7) | octal byte value |
\xh, \xhh (h = 0 - 9, A - F) | hexadecimal byte value |
\uxxxx, \Uxxxxxxxx (x = 0 - 9, A - F) | 16 or 32-bit hexadecimal Unicode character value |
\\ | 反斜线 |
' | 单引号 |
unicode输入
如何输入unicode的字符串呢?
用户可以直接向数据库输入UNICODE编码,省去编码转换的过程,但是用户必须保证编码与数据库服务端编码一致,否则可能出现编码溢出,或者乱码的问题。存入后,就和直接存储字符串是一样的。
unicode输入格式1
当反斜线为转义字符配置时,这样输入UNICODE
(注意这种输入格式,要求反斜线必须为转义字符,参考前面的章节,如果让反斜线变成转义字符)
\uxxxx, \Uxxxxxxxx (x = 0 - 9, A - F)
小写
16 or 32-bit hexadecimal Unicode character value
例子
postgres=# set standard_conforming_strings =on;
SET
postgres=# select E'\u00f7'; -- \为转义字符
?column?
----------
÷
(1 row)
postgres=# select '\u00f7'; -- \为普通字符, 没有输入unicode, 而是字符串
?column?
----------
\u00f7
(1 row)
postgres=# set standard_conforming_strings =off; -- \为转义字符
SET
postgres=# select '\u00f7';
WARNING: 22P06: nonstandard use of escape in a string literal
LINE 1: select '\u00f7';
^
HINT: Use the escape string syntax for escapes, e.g., E'\r\n'.
LOCATION: check_escape_warning, scan.l:1508
?column?
----------
÷
(1 row)
postgres=# create table u(id int, info text);
CREATE TABLE
postgres=# select E'\u00f0';
?column?
----------
e
(1 row)
postgres=# insert into u values (1, 'e');
INSERT 0 1
postgres=# insert into u values (2, E'\u00f0');
INSERT 0 1
postgres=# select * from u;
id | info
----+------
1 | e
2 | e
(2 rows)
目前只有UTF8字符集允许输入大于007f的UNICODE,其他字符集只能输入ascii范围的unicode.
postgres=# \l
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges
--------------------+----------+-----------+------------+------------+-----------------------
contrib_regression | postgres | UTF8 | C | C |
db | postgres | SQL_ASCII | C | C |
db1 | postgres | EUC_CN | C | C |
postgres | postgres | UTF8 | C | C |
template0 | postgres | UTF8 | C | C | =c/postgres +
| | | | | postgres=CTc/postgres
template1 | postgres | UTF8 | C | C | =c/postgres +
| | | | | postgres=CTc/postgres
test | postgres | UTF8 | en_US.UTF8 | en_US.UTF8 |
test01 | postgres | UTF8 | C | C |
test02 | postgres | UTF8 | zh_CN.UTF8 | zh_CN.UTF8 |
(9 rows)
postgres=# \c db1 -- EUC_CN编码
You are now connected to database "db1" as user "postgres".
db1=# select U&'\00fe';
ERROR: 42601: Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8 at or near "\00fe'"
LINE 1: select U&'\00fe';
^
LOCATION: scanner_yyerror, scan.l:1086
db1=# \c postgres -- UTF8编码,允许输入比007F更大的unicode
You are now connected to database "postgres" as user "postgres".
postgres=# select U&'\00fe';
?column?
----------
t
(1 row)
详情
It is your responsibility that the byte sequences you create, especially when using the octal or hexadecimal escapes, compose valid characters in the server character set encoding.
When the server encoding is UTF-8, then the Unicode escapes or the alternative Unicode escape syntax, explained in Section 4.1.2.3, should be used instead.
(The alternative would be doing the UTF-8 encoding by hand and writing out the bytes, which would be very cumbersome.)
The Unicode escape syntax works fully only when the server encoding is UTF8.
When other server encodings are used, only code points in the ASCII range (up to \u007F) can be specified.
Both the 4-digit and the 8-digit form can be used to specify UTF-16 surrogate pairs to compose characters with code points larger than U+FFFF, although the availability of the 8-digit form technically makes this unnecessary.
(When surrogate pairs are used when the server encoding is UTF8, they are first combined into a single code point that is then encoded in UTF-8.)
Caution
If the configuration parameter standard_conforming_strings is off, then PostgreSQL recognizes backslash escapes in both regular and escape string constants.
However, as of PostgreSQL 9.1, the default is on, meaning that backslash escapes are recognized only in escape string constants.
This behavior is more standards-compliant, but might break applications which rely on the historical behavior, where backslash escapes were always recognized.
As a workaround, you can set this parameter to off, but it is better to migrate away from using backslash escapes.
If you need to use a backslash escape to represent a special character, write the string constant with an E.
In addition to standard_conforming_strings, the configuration parameters escape_string_warning and backslash_quote govern treatment of backslashes in string constants.
The character with the code zero cannot be in a string constant.
unicode输入格式2
格式2,刚好相反,反斜线不是转义字符时的unicode写法。
这种格式支持2、3个字节的UNICODE。
格式如下,
u&'\4个[0-F]'
u&'\+6个[0-F]'
u不区分大小写
字符串里面的其他位置的字符都被识别为标准字符, 而\则作为UNICODE的转义字符,\后面必须是4个[0-F]或者+6个[0-F], 如果要输入\则输入\\即可。
如果想更换转义字符,使用UESCAPE语法。
例子1
U&'d\0061t\+000061'
U&'\0441\043B\043E\043D'
U&'d!0061t!+000061' UESCAPE '!' -- 更改转义字符为!
U&'\\' -- 输入两个转义字符,输出转义字符本身
例子2, 不能使用转义格式
postgres=# set standard_conforming_strings =off; -- 转义字符模式, ''号中的\为转义字符,而非普通字符
SET
postgres=# select U&'d\0061t\+000061';
ERROR: 0A000: unsafe use of string constant with Unicode escapes
LINE 1: select U&'d\0061t\+000061';
^
DETAIL: String constants with Unicode escapes cannot be used when standard_conforming_strings is off.
LOCATION: core_yylex, scan.l:534
postgres=# set standard_conforming_strings =on; -- 非转义字符模式,''中的\为普通字符,此时使用u&''就正常了
SET
postgres=# select U&'d\0061t\+000061';
?column?
----------
data
(1 row)
例子3
postgres=# select U&'d\0061t\+000061';
?column?
----------
data
(1 row)
postgres=# select U&'\0441\043B\043E\043D';
?column?
----------
слон
(1 row)
postgres=# select U&'d!0061t!+000061' UESCAPE '!';
?column?
----------
data
(1 row)
postgres=# select U&'\\';
?column?
----------
\
(1 row)
postgres=# select U&'\\\+000061';
?column?
----------
\a
(1 row)
postgres=# select U&'\\!+000061' UESCAPE '!';
?column?
----------
\\a
(1 row)
详情
PostgreSQL also supports another type of escape syntax for strings that allows specifying arbitrary Unicode characters by code point.
A Unicode escape string constant starts with U& (upper or lower case letter U followed by ampersand) immediately before the opening quote, without any spaces in between, for example U&'foo'.
(Note that this creates an ambiguity with the operator &. Use spaces around the operator to avoid this problem.)
Inside the quotes, Unicode characters can be specified in escaped form by writing a backslash followed by the four-digit hexadecimal code point number
or alternatively a backslash followed by a plus sign followed by a six-digit hexadecimal code point number.
For example, the string 'data' could be written as
U&'d\0061t\+000061'
The following less trivial example writes the Russian word "slon" (elephant) in Cyrillic letters:
U&'\0441\043B\043E\043D'
If a different escape character than backslash is desired, it can be specified using the UESCAPE clause after the string, for example:
U&'d!0061t!+000061' UESCAPE '!'
The escape character can be any single character other than a hexadecimal digit, the plus sign, a single quote, a double quote, or a whitespace character.
The Unicode escape syntax works only when the server encoding is UTF8. When other server encodings are used, only code points in the ASCII range (up to \007F) can be specified.
Both the 4-digit and the 6-digit form can be used to specify UTF-16 surrogate pairs to compose characters with code points larger than U+FFFF,
although the availability of the 6-digit form technically makes this unnecessary.
(When surrogate pairs are used when the server encoding is UTF8, they are first combined into a single code point that is then encoded in UTF-8.)
Also, the Unicode escape syntax for string constants only works when the configuration parameter standard_conforming_strings is turned on.
This is because otherwise this syntax could confuse clients that parse the SQL statements to the point that it could lead to SQL injections and similar security issues.
If the parameter is set to off, this syntax will be rejected with an error message.
To include the escape character in the string literally, write it twice.
字符串书写
1. 单引号
例子
'字符串内容'
2. 美元符号
例子
$$字符串内容$$
greenplum JSON中的unicode问题
当greenplum的json中存储的是unicode原生的字符串时,使用->提取,可能会报错。(估计是转换过程导致的异常)
创建测试表
postgres=# create table tbl(id int, info json);
NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
当前为转义模式,也就是说,UNICODE会被提前转换,并存入
postgres=# set standard_conforming_strings =off; -- 设置为转义模式,即 \ 被作为转义字符使用,UNICODE格式也可以被正常识别并转换
插入,UNICODE转换为最终字符串被插入了,提取也正常
postgres=# insert into tbl(id,info) values (1,'{"activitytitle":"你好中华人民共和国","giftname":"\u6447\u4E00\u6447\u7EA2\u5305","ruledesc":null}');
WARNING: nonstandard use of escape in a string literal
LINE 1: insert into tbl(id,info) values (1,'{"activitytitle":"你好中...
^
HINT: Use the escape string syntax for escapes, e.g., E'\r\n'.
INSERT 0 1
postgres=# select info->'activitytitle' from tbl;
?column?
----------------------
"你好中华人民共和国"
(1 row)
postgres=# select * from tbl;
id | info
----+--------------------------------------------------------------------------------
1 | {"activitytitle":"你好中华人民共和国","giftname":"摇一摇红包","ruledesc":null}
(1 row)
使用非转义模式,也就是说UNICODE的字符串还是原样被插入
此时提取会发生异常。
postgres=# set standard_conforming_strings =on; -- 非转义模式,\会当成普通字符插入,所以不会被识别为unicode模式,插入的是原样的字符
SET
postgres=# insert into tbl(id,info) values (1,'{"activitytitle":"你好中华人民共和国","giftname":"\u6447\u4E00\u6447\u7EA2\u5305","ruledesc":null}');
INSERT 0 1
postgres=# select * from tbl;
id | info
----+----------------------------------------------------------------------------------------------------
1 | {"activitytitle":"你好中华人民共和国","giftname":"摇一摇红包","ruledesc":null}
1 | {"activitytitle":"你好中华人民共和国","giftname":"\u6447\u4E00\u6447\u7EA2\u5305","ruledesc":null}
(2 rows)
-- 提取unicode对应的字段失败
postgres=# \set VERBOSITY verbose
postgres=# select info->'activitytitle' from tbl;
ERROR: 22021: invalid byte sequence for encoding "UTF8": 0xe6 (seg0 slice1 digoal:39999 pid=22528)
LOCATION: cdbdisp_finishCommand, cdbdisp.c:1326
postgres=# select info->'giftname' from tbl;
ERROR: 22021: invalid byte sequence for encoding "UTF8": 0xe6 (seg0 slice1 digoal:39999 pid=27053)
LOCATION: cdbdisp_finishCommand, cdbdisp.c:1326
解决办法
使用转义模式,将UNICODE转义后存入JSON中。
普通字符串中不会有这样的问题,仅仅是Greenplum的JSON类型有这样的问题,
PG 9.4的JSON也没有这样的问题,如下。
postgres=# create table tbl(id int, info json);
CREATE TABLE
postgres=# set standard_conforming_strings =on;
SET
postgres=# insert into tbl(id,info) values (1,'{"activitytitle":"你好中华人民共和国","giftname":"\u6447\u4E00\u6447\u7EA2\u5305","ruledesc":null}');
INSERT 0 1
postgres=# select * from tbl;
id | info
----+----------------------------------------------------------------------------------------------------
1 | {"activitytitle":"你好中华人民共和国","giftname":"\u6447\u4E00\u6447\u7EA2\u5305","ruledesc":null}
(1 row)
postgres=# select info->'activitytitle' from tbl;
?column?
----------------------
"你好中华人民共和国"
(1 row)
postgres=# select info->'giftname' from tbl;
?column?
----------------------------------
"\u6447\u4E00\u6447\u7EA2\u5305"
(1 row)
小结
1. 如何在字符串中输入特殊字符,例如回车。
1. 使用转义的方法输入特殊字符
E'\?'
2. 使用UNIDOCE的方法输入特殊字符,standard_conforming_strings=off, escape_string_warning=off
U&'\xxxx'
2. 如何在字符串中输入单引号。
4种方式在字符串中输入单引号
U&'\????'
E'\''
$$'$$
''''
3. 转义字符是什么?PostgreSQL如何使用反斜线转义?书写格式如何?
反斜线作为转义字符,与C语言风格类似,可以输入特殊字符。
E'\?'
或者
standard_conforming_strings=off, escape_string_warning=off
'\?'
4. 在SQL标准中,反斜线是转义符吗?
SQL标准中,反斜线是普通字符
standard_conforming_strings=on
'\' 中的\是普通字符
5. 为什么转义单引号存在SQL注入的风险?数据库有什么参数可以控制这个风险?有哪些参数可以发出警告?
转义字符可以将单引号转换为普通字符,这么做可能导致正常的字符串结束变成未结束。
通过参数可以控制是否允许转义单引号
backslash_quote = on 允许, off 不允许, safe_encoding(仅当client_encoding不允许反斜线```\```出现在多字节字符中时(大多数字符集```\```都是单字节表示),那么才允许使用```\'```转义单引号.)
6. 如何以UNICODE的形式输入字符串。
6种写法
U&'\xxxx'
U&'\+xxxxxx'
E'\uxxxx'
E'\Uxxxxxxxx'
standard_conforming_strings=off
'\uxxxx'
'\Uxxxxxxxx'
7. 以UNICODE的格式输入,支持所有字符集吗?目前支持哪些字符集,支持的CODE范围是多少?
当数据库encoding 为 UTF8时,支持除zero字符以外的所有合法字符
当数据库encoding <> UTF8时,仅支持ASCII字符,即CODE<007F的.
8. 在Greenplum的JSON中输入UNICODE value有什么问题吗?PostgreSQL中有没有同样的问题?
目前如果你在greenplum中使用json, 不建议在value中存储UNICODE,请将目标字符存入,否则会报错。
存入目标字符,即将UNICODE转成字符集对应的字符串。
4种方法实现
当输入为 '\????' 时,设置standard_conforming_strings=off
或者使用如下几种格式
E'\u????'
或者
U&'\xxxx'
或者
U&'\+xxxxxx'
PostgreSQL没有问题。
参考
https://www.postgresql.org/docs/9.6/static/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS-UESCAPE