http://blog.csdn.net/mycwq/article/details/18838151#comments
C++ regex 正则表达式的使用
版权声明:本文为“没有开花的树”原创文章,未经博主允许不得转载。
在c++中,有三种正则可以选择使用,C ++regex,C regex,boost regex ,如果在windows下开发c++,默认不支持后面两种正则,如果想快速应用,显然C++ regex 比较方便使用。文章将讨论C++ regex 正则表达式的使用。
C++ regex函数有3个:regex_match、 regex_search 、regex_replace
regex_match
regex_match是正则表达式匹配的函数,下面以例子说明。如果想系统的了解,参考regex_match
[cpp] view plain copy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
#include <iostream>
#include <regex>
#include <string>
int
main(
void
){
if
(std::regex_match(
"subject"
,std::regex(
"(sub).(.*)"
))){
std::cout <<
"string literal matched\n"
;
}
std::string s(
"subject"
);
std::regex e(
"(sub)(.*)"
);
if
(std::regex_match(s,e)){
std::cout <<
"string literal matched\n"
;
}
std::cmatch cm;
std::regex_match(
"subject"
,cm,e);
std::cout <<
"string literal with"
<< cm.size() <<
"matches\n"
;
std::smatch sm;
std::regex_match(s,sm,e);
std::cout <<
"string object with"
<< sm.size() <<
" matcheds\n"
;
std::regex_match(s.cbegin(),s.cend(),sm,e);
std::cout <<
"range with"
<< sm.size() <<
" matched\n"
;
std::regex_match(
"subject"
,cm,e,std::regex_constants::match_default);
std::cout <<
"the matches were:"
;
for
(unsigned i = 0;i<sm.size();++i){
std::cout <<
"["
<< sm.str() <<
"]"
;
}
std::cout <<
'\n'
;
for
(unsigned i = 0;i<sm.size();++i){
std::cout <<
"["
<< sm[i] <<
"]"
;
}
}
|
输出如下:
[plain] view plain copy
- string literal matched
string literal matched
string literal with3matches
string object with3 matcheds
range with3 matched
the matches were:[subject][subject][subject]
[subject][sub][ject]
regex_search
regex_match是另外一个正则表达式匹配的函数,下面是regex_search的例子。regex_search和regex_match的主要区别是:regex_match是全词匹配,而regex_search是搜索其中匹配的字符串。如果想系统了解,请参考regex_search
[cpp] view plain copy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
// regex_search example
#include <iostream>
#include <regex>
#include <string>
int
main(){
std::string s (
"this subject has a submarine as a subsequence"
);
std::smatch m;
std::regex e (
"\\b(sub)([^ ]*)"
);
// matches words beginning by "sub"
std::cout <<
"Target sequence: "
<< s << std::endl;
std::cout <<
"Regular expression: /\\b(sub)([^ ]*)/"
<< std::endl;
std::cout <<
"The following matches and submatches were found:"
<< std::endl;
while
(std::regex_search (s,m,e)) {
for
(
auto
x=m.begin();x!=m.end();x++)
std::cout << x->str() <<
" "
;
std::cout <<
"--> ([^ ]*) match "
<< m.format(
"$2"
) <<std::endl;
s = m.suffix().str();
}
}
|
输出如下:
[plain] view plain copy
Target sequence: this subject has a submarine as a subsequence
Regular expression: /\b(sub)([^ ]*)/
The following matches and submatches were found:
subject sub ject --> ([^ ]*) match ject
submarine sub marine --> ([^ ]*) match marine
subsequence sub sequence --> ([^ ]*) match sequence
[cpp] view plain copy
regex_replace
regex_replace是替换正则表达式匹配内容的函数,下面是regex_replace的例子。如果想系统了解,请参考regex_replace
[cpp] view plain copy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
|
#include <regex>
#include <iostream>
int
main() {
char
buf[20];
const
char
*first =
"axayaz"
;
const
char
*last = first +
strlen
(first);
std::regex rx(
"a"
);
std::string fmt(
"A"
);
std::regex_constants::match_flag_type fonly =
std::regex_constants::format_first_only;
*std::regex_replace(&buf[0], first, last, rx, fmt) =
'\0'
;
std::cout << &buf[0] << std::endl;
*std::regex_replace(&buf[0], first, last, rx, fmt, fonly) =
'\0'
;
std::cout << &buf[0] << std::endl;
std::string str(
"adaeaf"
);
std::cout << std::regex_replace(str, rx, fmt) << std::endl;
std::cout << std::regex_replace(str, rx, fmt, fonly) << std::endl;
return
0;
}
|
输出如下:
[plain] view plain copy
AxAyAz
Axayaz
AdAeAf
Adaeaf
C++ regex正则表达式的规则和其他编程语言差不多,如下:
特殊字符(用于匹配很难形容的字符):
characters | description | matches |
---|---|---|
. | not newline | any character except line terminators (LF, CR, LS, PS). |
\t | tab (HT) | a horizontal tab character (same as \u0009). |
\n | newline (LF) | a newline (line feed) character (same as \u000A). |
\v | vertical tab (VT) | a vertical tab character (same as \u000B). |
\f | form feed (FF) | a form feed character (same as \u000C). |
\r | carriage return (CR) | a carriage return character (same as \u000D). |
\cletter | control code | a control code character whose code unit value is the same as the remainder of dividing the code unit value of letter by 32. For example: \ca is the same as \u0001, \cb the same as \u0002, and so on... |
\xhh | ASCII character | a character whose code unit value has an hex value equivalent to the two hex digits hh. For example: \x4c is the same as L, or \x23 the same as #. |
\uhhhh | unicode character | a character whose code unit value has an hex value equivalent to the four hex digitshhhh. |
\0 | null | a null character (same as \u0000). |
\int | backreference | the result of the submatch whose opening parenthesis is the int-th (int shall begin by a digit other than 0). See groups below for more info. |
\d | digit | a decimal digit character |
\D | not digit | any character that is not a decimal digit character |
\s | whitespace | a whitespace character |
\S | not whitespace | any character that is not a whitespace character |
\w | word | an alphanumeric or underscore character |
\W | not word | any character that is not an alphanumeric or underscore character |
\character | character | the character character as it is, without interpreting its special meaning within a regex expression. Any character can be escaped except those which form any of the special character sequences above. Needed for: ^ $ \ . * + ? ( ) [ ] { } | |
[class] | character class | the target character is part of the class |
[^class] | negated character class | the target character is not part of the class |
注意了,在C++反斜杠字符(\)会被转义
[cpp] view plain copy
std::regex e1 ("\\d"); // \d -> 匹配数字字符
std::regex e2 ("\\\\"); // \\ -> 匹配反斜杠字符
数量:
characters | times | effects |
---|---|---|
* | 0 or more | The preceding atom is matched 0 or more times. |
+ | 1 or more | The preceding atom is matched 1 or more times. |
? | 0 or 1 | The preceding atom is optional (matched either 0 times or once). |
{int} | int | The preceding atom is matched exactly int times. |
{int,} | int or more | The preceding atom is matched int or more times. |
{min,max} | between min and max | The preceding atom is matched at least min times, but not more than max. |
注意了,模式 "(a+).*" 匹配 "aardvark" 将匹配到 aa,模式 "(a+?).*" 匹配 "aardvark" 将匹配到 a
组(用以匹配连续的多个字符):
characters | description | effects |
---|---|---|
(subpattern) | Group | Creates a backreference. |
(?:subpattern) | Passive group | Does not create a backreference. |
注意了,第一种将创建一个反向引用,用于提取匹配到的内容,第二种则没有,相对来说性能方面也没这部分的开销
characters | description | condition for match |
---|---|---|
^ | Beginning of line | Either it is the beginning of the target sequence, or follows a line terminator. |
$ | End of line | Either it is the end of the target sequence, or precedes a line terminator. |
| | Separator | Separates two alternative patterns or subpatterns.. |
单个字符
[abc] 匹配 a, b 或 c.
[^xyz] 匹配任何非 x, y, z的字符
范围
[a-z] 匹配任何小写字母 (a, b, c, ..., z).
[abc1-5] 匹配 a, b , c, 或 1 到 5 的数字.
c++ regex还有一种类POSIX的写法
class | description | equivalent (with regex_traits, default locale) |
---|---|---|
[:alnum:] | alpha-numerical character | isalnum |
[:alpha:] | alphabetic character | isalpha |
[:blank:] | blank character | isblank |
[:cntrl:] | control character | iscntrl |
[:digit:] | decimal digit character | isdigit |
[:graph:] | character with graphical representation | isgraph |
[:lower:] | lowercase letter | islower |
[:print:] | printable character | isprint |
[:punct:] | punctuation mark character | ispunct |
[:space:] | whitespace character | isspace |
[:upper:] | uppercase letter | isupper |
[:xdigit:] | hexadecimal digit character | isxdigit |
[:d:] | decimal digit character | isdigit |
[:w:] | word character | isalnum |
[:s:] | whitespace character | isspace |
本文转自神ge 51CTO博客,原文链接:http://blog.51cto.com/12218412/1872058