正则表达式

需要导入：import re

1. 匹配符

常用普通字符的含义见下表

普通字符	含义
\W	匹配非数字、字母、下划线、汉字
\w	匹配数字、字母、下划线、汉字
\S	匹配任意非空白字符
\s	匹配任意空白字符
\D	匹配非数字
\d	匹配数字

常用元字符的含义见下表

元字符	含义
.	匹配任意字符（除换行符\r,\n）
^	匹配字符串的开始位置
$	匹配字符串的结束位置
*	匹配该元字符的前一个字符任意出现次数（包括0次）

？	匹配该元字符的前一个字符0次或1次
\	转义字符，其后的一个元字符失去特殊含义，匹配字符本身
()	()中的表达式称为一个组，组匹配到的字符能被取出
[]	字符集，范围内的所有字符都能被匹配
\|	将匹配条件进行逻辑或运算

[abc]	匹配括号内任意字符
[^abc]	匹配非括号内任意字符

2. findall()

使用re模块里面的findall()函数进行查找匹配，返回一个列表。

2.1 \w和\W

import re
str1 = '123Qwe!_@#你我他'
print(re.findall('\w', str1))    # 匹配数字、字母、下划线、汉字
print(re.findall('\W', str1))    # 匹配非数字、字母、下划线、汉字

['1', '2', '3', 'Q', 'w', 'e', '_', '你', '我', '他']
['!', '@', '#']

2.2 \s和\S

import re
str2 = "123Qwe!_@#你我他\t \n\r"
print(re.findall('\s', str2))   # 匹配任意空白字符，如空格、换行符\r
print(re.findall('\S', str2))   # 匹配任意非空白字符

['\t', ' ', '\n', '\r']
['1', '2', '3', 'Q', 'w', 'e', '!', '_', '@', '#', '你', '我', '他']

2.3 \d和\D

import re
str3 = "123Qwe!_@#你我他\t \n\r"
print(re.findall('\d', str3))  # 匹配数字
print(re.findall('\D', str3))  # 匹配非数字

['1', '2', '3']
['Q', 'w', 'e', '!', '_', '@', '#', '你', '我', '他', '\t', ' ', '\n', '\r']

2.4 ^和$

import re
str4 = '你好吗，我很好'
print(re.findall('^你好', str4))  # 匹配字符串的开始的 你好
str5 = '我很好，你好'
print(re.findall('你好$', str5))  # 匹配字符串的结束的 你好

['你好']
['你好']

2.5 . * ？

import re
str6 = 'abcaaabb'
print(re.findall('a.b', str6))  # 匹配任意一个字符（除换行符\r,\n）
print(re.findall('a?b', str6))  # 匹配字符a0次或1次
print(re.findall('a*b', str6))  # 匹配字符a任意次数（包括0次）
print(re.findall('a.*b', str6)) # 匹配任意字符任意次数 (贪婪匹配)    能匹配多长就多长
print(re.findall('a.*?b', str6))# 匹配任意字符任意次数 (非贪婪匹配)  能匹配多短就多短

['aab']
['ab', 'ab', 'b']
['ab', 'aaab', 'b']
['abcaaabb']
['ab', 'aaab']

2.6 \

import re
str7 = '\t123456'
print(re.findall('t', str7))   # 匹配不到字符t，因为\t有特殊含义，是一个先整体
str8 = '\\t123456' 
print(re.findall('t', str8))   # 使用转义字符后，\t变为无特殊含义的普通字符，能匹配到字符t
str9 = r'\t123456'
print(re.findall('t', str9))   # 在字符串前加r也可以对字符串进行转义

[]
['t']
['t']

2.7 []

import re
str10 = 'aab abb acb azb a1b'
print(re.findall('a[a-z]b', str10))   # 只要中间的字符在字母a~z之间就能匹配到
print(re.findall('a[0-9]b', str10))   # 只要中间的字符在数字0~9之间就能匹配到
print(re.findall('a[ac1]b', str10))   # 只要中间的字符是[ac1]的成员就能匹配到

['aab', 'abb', 'acb', 'azb']
['a1b']
['aab', 'acb', 'a1b']

2.8 ()

import re
str11 = '123qwer'
print(re.findall('(\w+)q(\w+)', str11))  # \w+ 代表匹配一个或多个数字、字母、下划线、汉字

[('123', 'wer')]

2.9 |

import re
str12 = '你好，女士们先生们，大家好好学习呀'
print(re.findall('女士|先生', str12))  # 匹配  先生  或 女士

['女士', '先生']

3. search()

3.1 匹配电话

实例：查找电话

def isPhoneNumber(text):
    """查找\d\d\d-\d\d\d-\d\d\d\d类型的电话的函数 非正则版"""
    if len(text) != 12:
        return False
    for i in range(0,3):
        if not text[i].isdecimal():
            return False
    if text[3] != '-':
        return False
    for i in range(4,7):
        if not text[i].isdecimal():
            return False
    if text[7] != '-':
        return False
    for i in range(8,12):
        if not text[i].isdecimal():
            return False
    return True
message = "Call me at 415-555-1011 tomorrow. 415-555-9999 is my office"
for i in range(len(message)):
    chunk = message[i:i+12]
    if isPhoneNumber(chunk):
        print("Phone number found: " + chunk)
print("Done")

Phone number found: 415-555-1011
Phone number found: 415-555-9999
Done

实例：用正则表达式查找电话号

import re
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = phoneNumRegex.search("My number is 415-555-4242.")
print("Phone number found: " + mo.group())

Phone number found: 415-555-4242

正则表达式使用步骤总结：

用import re导入正则表达式模块

用re.compile()函数创建一个Regex对象（记得使用原始字符串）

向Regex对象的search()方法传入想查找的字符串。它返回一个Match对象

调用Match对象的group()方法，返回实际匹配文本的字符串

3.2 利用括号分组

import re
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
mo = phoneNumRegex.search("My number is 415-555-4242.")
print(mo.group())
print(mo.group(1)) 
print(mo.group(2))
print(mo.groups())   # 获取所有的分组
areaCode, mainNumber = mo.groups()
print(areaCode, mainNumber)

415-555-4242
415
555-4242
('415', '555-4242')
415 555-4242

3.3 用管道匹配多个分组

字符|称为”管道“，希望匹配许多表达式中的一个，就用它。

第一次出现的匹配文本，将作为Match对象返回。

heroRegex = re.compile(r'Batman|Tina Fey')
mo1 = heroRegex.search('Batman and Tina Fey.')
print(mo1.group())   # 查找第一次出现的   findall()则是查找所有
mo2 = heroRegex.search("Tina Fey and Batman.")
print(mo2.group())

Batman
Tina Fey

例

batRegex = re.compile(r'Bat(man|mobile|copter|bat)')
mo = batRegex.search('Batmobile lost a wheel')
print(mo.group())
print(mo.group(1))

Batmobile
mobile

3.4 用问号实现可选匹配

batRegex = re.compile(r'Bat(wo)?man')
mo1 = batRegex.search('The Adventures of Batman')
print(mo1.group())
mo2 = batRegex.search('The Adventures of Batwoman')
print(mo2.group())

Batman
Batwoman

例2

phoneRegex = re.compile(r'(\d\d\d-)?\d\d\d-\d\d\d\d')
mo1 = phoneRegex.search('My number is 415-555-4242')
print(mo1.group())
mo2 = phoneRegex.search('My number is 555-4242')
print(mo2.group())

415-555-4242
555-4242

3.5 用星号匹配零次或多次

batRegex = re.compile(r'Bat(wo)*man')
mo1 = batRegex.search('The Adventures of Batman')
print(mo1.group())
mo2 = batRegex.search('The Adventures of Batwoman')
print(mo2.group())
mo3 = batRegex.search('The Adventures of Batwowowowoman')
print(mo3.group())

Batman
Batwoman
Batwowowowoman

3.6 用花括号匹配待定次数

(Ha){3}将匹配字符串HaHaHa
(Ha){3,5}将匹配字符串HaHaHa | HaHaHaHa | HaHaHaHaHa
(Ha){3,}匹配3次及以上
(Ha){,5}匹配0到5次

例

haRegex = re.compile(r'(Ha){3}')
mo1 = haRegex.search('HaHaHa')
print(mo1.group())
mo2 = haRegex.search('Ha')
print(mo2 == None)  # 因为(Ha){3}匹配HaHaHa，不匹配Ha，所以返回None

HaHaHa
True

3.7 贪心和非贪心匹配

# 贪心
greedyHaRegex = re.compile(r'(Ha){3,5}')
mo1 = greedyHaRegex.search('HaHaHaHaHa')
print(mo1)
# 非贪心
greedyHaRegex = re.compile(r'(Ha){3,5}?')
mo1 = greedyHaRegex.search('HaHaHaHaHa')
print(mo1)

<re.Match object; span=(0, 10), match='HaHaHaHaHa'>
<re.Match object; span=(0, 6), match='HaHaHa'>

3.8 练习

例：search和findall的区别

phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = phoneNumRegex.search('Cell: 415-555-9999 Work: 212-555-0000')
print(mo.group())
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
print(phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000'))
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)')
print(phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000'))

415-555-9999
['415-555-9999', '212-555-0000']
[('415', '555', '9999'), ('212', '555', '0000')]

例：匹配元音字符

[abc]：匹配括号内任意字符

[^abc]：匹配不再括号内的任意字符

# 匹配所有元音字符
voweRgegx = re.compile(r'[aeiouAEIOU]')
print(voweRgegx.findall('RoboCop eats baby food. BABY FOOD.'))
# 匹配所有非元音字符
consonantRgegx = re.compile(r'[^aeiouAEIOU]')
print(consonantRgegx.findall('RoboCop eats baby food. BABY FOOD.'))

例：插入字符

beginWithHello = re.compile(r'^Hello')
print(beginWithHello.search('Hello world!'))
print(beginWithHello.search('He said hello.') == None)

<re.Match object; span=(0, 5), match='Hello'>
True

例：美元字符

endWithNumber = re.compile(r'\d$')
print(endWithNumber.search('Your number is 42'))
print(endWithNumber.search("Your number is forty two") == None)

<re.Match object; span=(16, 17), match='2'>
True

例：匹配以字符开始字符结束的

wholeStringIsNum = re.compile(r'^\d+$')
print(wholeStringIsNum.search('123456789'))
print(wholeStringIsNum.search('12345xyz678') == None)
print(wholeStringIsNum.search('123 456789') == None)

<re.Match object; span=(0, 9), match='123456789'>
True
True

例：通配字符

atRegex = re.compile(r'.at')
print(atRegex.findall('The cat in the hat sat on the flat mat.'))

['cat', 'hat', 'sat', 'lat', 'mat']

例：用点-星匹配所有字符

nameRegex = re.compile(r'First Name: (.*) Last Name: (.*)')
mo = nameRegex.search("First Name: A1 Last Name: Sweigart")
print(mo.group(1))
print(mo.group(2))

A1
Sweigart

例：点-星的贪心和非贪心模式

# 非贪心
nongreedyRegex = re.compile(r'<.*?>')
print(nongreedyRegex.search('<To serve man> for dinner.>'))
# 贪心
greedyRegex = re.compile(r'<.*>')
print(greedyRegex.search('<To serve man> for dinner.>'))

<re.Match object; span=(0, 14), match='<To serve man>'>
<re.Match object; span=(0, 27), match='<To serve man> for dinner.>'>

例：用句点字符匹配换行

# 正常情况不匹配换行符
noNewlineRegex = re.compile('.*')
print(noNewlineRegex.search("Serve the public trust.\nProtect the innocent.\nUphold the law.").group())
# 添加第二个参数，匹配换行符
NewlineRegex = re.compile('.*', re.DOTALL)
print(NewlineRegex.search("aaa.\nbbb.").group())

Serve the public trust.
aaa.
bbb.

例：匹配不区分大小写

# 传入第二个参数，匹配不区分大小写
robocop = re.compile(r'robocop', re.I)
print(robocop.search('RoboCop is part manchine, all cop.').group())
print(robocop.search('ROBOcop is part manchine, all cop.').group())

RoboCop
ROBOcop

例：用sub()方法替换字符串

namesRegex = re.compile(r'Agent \w+')
print(namesRegex.sub("CENSORED", "Agent Alice gave the secret documents to Agent Bob."))

CENSORED gave the secret documents to CENSORED.

agentNamesRegex = re.compile(r'Agent (\w)\w*')
print(agentNamesRegex.sub(r'\1****', "Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a double agent."))

A**** told C**** that E**** knew B**** was a double agent.

3.9 实践：匹配电话号码和E-mail地址

import pyperclip, re
phoneRegex = re.compile(r'''(
        (\d{3}|\(\d{3}\))?
        (\s|-|\.)?
        (\d{3})
        (\s|-|\.)
        (\d{4})
        (\s*(ext|x|ext\.)\s*(\d{2,5}))?
        )''', re.VERBOSE)
emailRegex = re.compile(r'''(
        [a-zA-Z0-9._%=-]+
        @
        [a-zA-Z0-9.-]+
        (\.[a-zA-Z]{2,4})
        )''', re.VERBOSE)
text = str(pyperclip.paste())   # 读取剪切板
matches = []
for groups in phoneRegex.findall(text):
    phoneNum = '-'.join([groups[1], groups[3], groups[5]])  # 拼接电话
    if groups[8] != '':
        phoneNum += ' x' + groups[8]
    matches.append(phoneNum)
for groups in emailRegex.findall(text):
    matches.append(groups[0])
if len(matches) > 0:
    pyperclip.copy('\n'.join(matches))
    print('找到以下电话号码和电子邮箱：:')
    print('\n'.join(matches))
else:
    print("未发现电话号码和电子邮箱！")

Python进阶笔记（1）：正则表达式

正则表达式

1. 匹配符

2. findall()

2.1 \w和\W

2.2 \s和\S

2.3 \d和\D

2.4 ^和$

2.5 . * ？

2.6 \

2.7 []

2.8 ()

2.9 |

3. search()

3.1 匹配电话

3.2 利用括号分组

3.3 用管道匹配多个分组

3.4 用问号实现可选匹配

3.5 用星号匹配零次或多次

3.6 用花括号匹配待定次数

3.7 贪心和非贪心匹配

3.8 练习

3.9 实践：匹配电话号码和E-mail地址

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

直播

下载

镜像站

技术资料

Python进阶笔记（1）：正则表达式

正则表达式

1. 匹配符

2. findall()

2.1 \w和\W

2.2 \s和\S

2.3 \d和\D

2.4 ^和$

2.5 . * ？

2.6 \

2.7 []

2.8 ()

2.9 |

3. search()

3.1 匹配电话

3.2 利用括号分组

3.3 用管道匹配多个分组

3.4 用问号实现可选匹配

3.5 用星号匹配零次或多次

3.6 用花括号匹配待定次数

3.7 贪心和非贪心匹配

3.8 练习

3.9 实践：匹配电话号码和E-mail地址

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像