常用正则表达式-阿里云开发者社区

常用正则表达式

2017-11-03 1028

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介：

下面的例子默认以python为实现语言，用到python的re模块。

0、正则表达式的文档。

（1）、正则表达式30分钟入门教程

（2）、另一个不错的入门教程。

（3）、揭开正则表达式的神秘面纱，个人觉得这篇文章对Multiline的讲解特别到位，截图如下：

1、提取双引号及之间的内容。

（1）、用re.findall。

 
        text 
        = 
        '''abc"def"ghi''' 
       
        re.findall(r
        '".+"'
        , text) 
       
        #结果 
       
        [
        '"def"'
        ]

（2）、用re.search。

 
        >>> text 
        = 
        '''abc"def"ghi''' 
       
        >>> re.search(r
        '"(.+)"'
        , text).group(
        0
        ) 
       
        '"def"'

2、提取双引号之间的内容。规则: (pattern)

（1）、用re.findall。

 
        text 
        = 
        '''abc"def"ghi''' 
       
        re.findall(r
        '"(.+)"'
        , text) 
       
        #结果 
       
        [
        'def'
        ]

与1的区别是在需要返回的内容两边加上了括号。

（2）、用re.search。

 
        >>> text 
        = 
        '''abc"def"ghi''' 
       
        >>> re.search(r
        '"(.+)"'
        , text).group(
        1
        ) 
       
        'def'

3、效果同2。规则: (?<=pattern)、(?=pattern)

 
        text 
        = 
        '''abc"def"ghi''' 
       
        re.findall(r
        '(?<=").+(?=")'
        , text) 
       
        #结果 
       
        [
        'def'
        ]

4、C++中三种正则表达式比较(C regex，C ++regex，boost regex)。

5、查找以某些字符串打头的行。比如查找以+++、---、index打头的行：

 
        #方法一，按行匹配 
       
        for 
        i 
        in 
        lst: 
       
        if 
        re.match(r
        "(---|\+\+\+|index).*"
        , i): 
       
        print 
        i 
       
        #方法二，一次性匹配 
       
        re.findall(r
        '^(?:\+\+\+|---|index).*$'
        , content, re.M) 
       
        #方法二精简版 
       
        re.findall(r
        '^(?:[-\+]{3}|index).*$'
        , content, re.M)

6、包含/不包含（参考：利用正则表达式排除特定字符串）

（0）、文本内容

 
        >>> 
        print
        (text) 
       
        www.sina.com.cn 
       
        www.educ.org 
       
        www.hao.cc 
       
        www.baidu.com 
       
        www.
        123.com 
       
        sina.com.cn 
       
        educ.org 
       
        hao.cc 
       
        baidu.com 
       
        123.com

（1）、匹配以www打头的行

 
        >>> re.findall(r
        '^www.*$'
        , text, re.M) 
       
        [
        'www.sina.com.cn'
        , 
        'www.educ.org'
        , 
        'www.hao.cc'
        , 
        'www.baidu.com'
        , 
        'www.123.com'
        ]

（2）、匹配不以www打头的行

 
        >>> re.findall(r
        '^(?!www).*$'
        , text, re.M) 
       
        ['
        ', '
        sina.com.cn
        ', '
        educ.org
        ', '
        hao.cc
        ', '
        baidu.com
        ', '
        123.com
        ']

（3）、匹配以cn结尾的行

 
        >>> re.findall(r
        '^.*?cn$'
        , text, re.M) 
       
        [
        'www.sina.com.cn'
        , 
        'sina.com.cn'
        ]

（4）、匹配不以com结尾的行

 
        >>> re.findall(r
        '^.*?(?<!com)$'
        , text, re.M) 
       
        [
        'www.sina.com.cn'
        , 
        'www.educ.org'
        , 
        'www.hao.cc'
        , '
        ', '
        sina.com.cn
        ', '
        educ.org
        ', '
        hao.cc']

（5）、匹配包含com的行

 
        >>> re.findall(r
        '^.*?com.*?$'
        , text, re.M) 
       
        [
        'www.sina.com.cn'
        , 
        'www.baidu.com'
        , 
        'www.123.com'
        , 
        'sina.com.cn'
        , 
        'baidu.com'
        , 
        '123.com'
        ]

（6）、匹配不包含com的行

 
        >>> re.findall(r
        '^(?!.*com).*$'
        , text, re.M) 
       
        [
        'www.educ.org'
        , 
        'www.hao.cc'
        , '
        ', '
        educ.org
        ', '
        hao.cc'] 
       
        >>> re.findall(r
        '^(?:(?!com).)*?$'
        , text, re.M) 
       
        [
        'www.educ.org'
        , 
        'www.hao.cc'
        , '
        ', '
        educ.org
        ', '
        hao.cc']

7、利用分组得到网址的第一级，即去除后面几级。（匹配全部，去除部分）

方法一：

 
        >>> strr 
        = 
        'http://www.baidu.com/abc/d.html' 
       
        >>> re.findall(r
        '(http://.+?)/.*'
        , strr) 
       
        [
        'http://www.baidu.com'
        ]

方法二：

 
        >>> re.sub(r
        '(http://.+?)/.*'
        , r
        '\1'
        , strr) 
       
        'http://www.baidu.com'

8、两个有助于理解正则分组的例子。

（1）、

 
        >>> strr 
        = 
        'A/B/C' 
       
        >>> re.sub(r
        '(.)/(.)/(.)'
        , r
        'xx'
        , strr) 
       
        'xx' 
       
        >>> re.sub(r
        '(.)/(.)/(.)'
        , r
        '\1xx'
        , strr) 
       
        'Axx' 
       
        >>> re.sub(r
        '(.)/(.)/(.)'
        , r
        '\2xx'
        , strr) 
       
        'Bxx' 
       
        >>> re.sub(r
        '(.)/(.)/(.)'
        , r
        '\3xx'
        , strr) 
       
        'Cxx'

（2）、

 
        >>> text 
        = 
        'AA,BB:222' 
       
        >>> re.search(r
        '(.+),(.+):(\d+)'
        , text).group(
        0
        ) 
       
        'AA,BB:222' 
       
        >>> re.search(r
        '(.+),(.+):(\d+)'
        , text).group(
        1
        ) 
       
        'AA' 
       
        >>> re.search(r
        '(.+),(.+):(\d+)'
        , text).group(
        2
        ) 
       
        'BB' 
       
        >>> re.search(r
        '(.+),(.+):(\d+)'
        , text).group(
        3
        ) 
       
        '222'

9、提取含有hello字符串的div。

 
        >>> content 
       
        '<div id="abc"><div id="hello1"><div id="def"><div id="hello2"><div id="hij">' 
       
        >>>  
       
        >>> p 
        = 
        r
        '<div((?!div).)+hello.+?>' 
       
        >>> re.search(p, content).group() 
       
        '<div id="hello1">' 
       
        >>> re.findall(p, content) 
       
        [
        '"', '"'
        ] 
       
        >>> 
        for 
        iter 
        in 
        re.finditer(p, content): 
       
        print
        (
        iter
        .group()) 
       
        <div 
        id
        =
        "hello1"
        > 
       
        <div 
        id
        =
        "hello2"
        > 
       
        >>>  
       
        >>> p 
        = 
        r
        '<div[^>]+hello.+?>' 
       
        >>> re.search(p, content).group() 
       
        '<div id="hello1">' 
       
        >>> re.findall(p, content) 
       
        [
        '<div id="hello1">'
        , 
        '<div id="hello2">'
        ] 
       
        >>> 
        for 
        iter 
        in 
        re.finditer(p, content): 
       
        print
        (
        iter
        .group()) 
       
        <div 
        id
        =
        "hello1"
        > 
       
        <div 
        id
        =
        "hello2"
        >

10、据walker猜测：在python3的Unicode字符集下，\s匹配\f\n\r\t\v加全角半角空格，共7个字符。

12、如果所使用的工具支持肯定环视（positive lookahead），同时可以在肯定环视中使用捕获括号（capturing parentheses），就能模拟实现固化分组（atomic grouping）和占有优先量词（possessive quantifiers）。

13、千分位。

（1）、Python

 
        >>> 
        format
        (
        23456789
        , 
        ','
        ) 
       
        '23,456,789' 
       
        # 利用肯定逆序环视与肯定顺序环视 
       
        >>> re.sub(r
        '(?<=\d)(?=(?:\d{3})+$)'
        , 
        ','
        , 
        '2345678'
        ) 
       
        '2,345,678'

（2）、JavaScript

 
        //利用肯定顺序环视（因为js不支持肯定逆序环视） 
       
        //结果为"23,456,789" 
       
        "23456789"
        .replace(/(\d)(?=(?:\d{3})+$)/g, 
        "$1,"
        )

常用正则表达式

热门文章

最新文章

相关电子书

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

常用正则表达式

热门文章

最新文章

相关电子书