正则表达式
需要导入:import re
1. 匹配符
常用普通字符的含义见下表
普通字符 | 含义 |
\W | 匹配非数字、字母、下划线、汉字 |
\w | 匹配数字、字母、下划线、汉字 |
\S | 匹配任意非空白字符 |
\s | 匹配任意空白字符 |
\D | 匹配非数字 |
\d | 匹配数字 |
常用元字符的含义见下表
元字符 | 含义 |
. | 匹配任意字符(除换行符\r,\n) |
^ | 匹配字符串的开始位置 |
$ | 匹配字符串的结束位置 |
* | 匹配该元字符的前一个字符任意出现次数(包括0次) |
? | 匹配该元字符的前一个字符0次或1次 |
\ | 转义字符,其后的一个元字符失去特殊含义,匹配字符本身 |
() | ()中的表达式称为一个组,组匹配到的字符能被取出 |
[] | 字符集,范围内的所有字符都能被匹配 |
| | 将匹配条件进行逻辑或运算 |
[abc] | 匹配括号内任意字符 |
[^abc] | 匹配非括号内任意字符 |
2. findall()
使用re模块里面的findall()函数进行查找匹配,返回一个列表。
2.1 \w和\W
import re str1 = '123Qwe!_@#你我他' print(re.findall('\w', str1)) # 匹配数字、字母、下划线、汉字 print(re.findall('\W', str1)) # 匹配非数字、字母、下划线、汉字
['1', '2', '3', 'Q', 'w', 'e', '_', '你', '我', '他'] ['!', '@', '#']
2.2 \s和\S
import re str2 = "123Qwe!_@#你我他\t \n\r" print(re.findall('\s', str2)) # 匹配任意空白字符,如空格、换行符\r print(re.findall('\S', str2)) # 匹配任意非空白字符
['\t', ' ', '\n', '\r'] ['1', '2', '3', 'Q', 'w', 'e', '!', '_', '@', '#', '你', '我', '他']
2.3 \d和\D
import re str3 = "123Qwe!_@#你我他\t \n\r" print(re.findall('\d', str3)) # 匹配数字 print(re.findall('\D', str3)) # 匹配非数字
['1', '2', '3'] ['Q', 'w', 'e', '!', '_', '@', '#', '你', '我', '他', '\t', ' ', '\n', '\r']
2.4 ^和$
import re str4 = '你好吗,我很好' print(re.findall('^你好', str4)) # 匹配字符串的开始的 你好 str5 = '我很好,你好' print(re.findall('你好$', str5)) # 匹配字符串的结束的 你好
['你好'] ['你好']
2.5 . * ?
import re str6 = 'abcaaabb' print(re.findall('a.b', str6)) # 匹配任意一个字符(除换行符\r,\n) print(re.findall('a?b', str6)) # 匹配字符a0次或1次 print(re.findall('a*b', str6)) # 匹配字符a任意次数(包括0次) print(re.findall('a.*b', str6)) # 匹配任意字符任意次数 (贪婪匹配) 能匹配多长就多长 print(re.findall('a.*?b', str6))# 匹配任意字符任意次数 (非贪婪匹配) 能匹配多短就多短
['aab'] ['ab', 'ab', 'b'] ['ab', 'aaab', 'b'] ['abcaaabb'] ['ab', 'aaab']
2.6 \
import re str7 = '\t123456' print(re.findall('t', str7)) # 匹配不到字符t,因为\t有特殊含义,是一个先整体 str8 = '\\t123456' print(re.findall('t', str8)) # 使用转义字符后,\t变为无特殊含义的普通字符,能匹配到字符t str9 = r'\t123456' print(re.findall('t', str9)) # 在字符串前加r也可以对字符串进行转义
[] ['t'] ['t']
2.7 []
import re str10 = 'aab abb acb azb a1b' print(re.findall('a[a-z]b', str10)) # 只要中间的字符在字母a~z之间就能匹配到 print(re.findall('a[0-9]b', str10)) # 只要中间的字符在数字0~9之间就能匹配到 print(re.findall('a[ac1]b', str10)) # 只要中间的字符是[ac1]的成员就能匹配到
['aab', 'abb', 'acb', 'azb'] ['a1b'] ['aab', 'acb', 'a1b']
2.8 ()
import re str11 = '123qwer' print(re.findall('(\w+)q(\w+)', str11)) # \w+ 代表匹配一个或多个数字、字母、下划线、汉字
[('123', 'wer')]
2.9 |
import re str12 = '你好,女士们先生们,大家好好学习呀' print(re.findall('女士|先生', str12)) # 匹配 先生 或 女士
['女士', '先生']
3. search()
3.1 匹配电话
实例:查找电话
def isPhoneNumber(text): """查找\d\d\d-\d\d\d-\d\d\d\d类型的电话的函数 非正则版""" if len(text) != 12: return False for i in range(0,3): if not text[i].isdecimal(): return False if text[3] != '-': return False for i in range(4,7): if not text[i].isdecimal(): return False if text[7] != '-': return False for i in range(8,12): if not text[i].isdecimal(): return False return True message = "Call me at 415-555-1011 tomorrow. 415-555-9999 is my office" for i in range(len(message)): chunk = message[i:i+12] if isPhoneNumber(chunk): print("Phone number found: " + chunk) print("Done")
Phone number found: 415-555-1011 Phone number found: 415-555-9999 Done
实例:用正则表达式查找电话号
import re phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') mo = phoneNumRegex.search("My number is 415-555-4242.") print("Phone number found: " + mo.group())
Phone number found: 415-555-4242
正则表达式使用步骤总结:
- 用
import re
导入正则表达式模块
- 用
re.compile()
函数创建一个Regex对象(记得使用原始字符串)
- 向Regex对象的search()方法传入想查找的字符串。它返回一个Match对象
- 调用Match对象的group()方法,返回实际匹配文本的字符串
3.2 利用括号分组
import re phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)') mo = phoneNumRegex.search("My number is 415-555-4242.") print(mo.group()) print(mo.group(1)) print(mo.group(2)) print(mo.groups()) # 获取所有的分组 areaCode, mainNumber = mo.groups() print(areaCode, mainNumber)
415-555-4242 415 555-4242 ('415', '555-4242') 415 555-4242
3.3 用管道匹配多个分组
字符|称为”管道“,希望匹配许多表达式中的一个,就用它。
第一次出现的匹配文本,将作为Match对象返回。
heroRegex = re.compile(r'Batman|Tina Fey') mo1 = heroRegex.search('Batman and Tina Fey.') print(mo1.group()) # 查找第一次出现的 findall()则是查找所有 mo2 = heroRegex.search("Tina Fey and Batman.") print(mo2.group())
Batman Tina Fey
例
batRegex = re.compile(r'Bat(man|mobile|copter|bat)') mo = batRegex.search('Batmobile lost a wheel') print(mo.group()) print(mo.group(1))
Batmobile mobile
3.4 用问号实现可选匹配
batRegex = re.compile(r'Bat(wo)?man') mo1 = batRegex.search('The Adventures of Batman') print(mo1.group()) mo2 = batRegex.search('The Adventures of Batwoman') print(mo2.group())
Batman Batwoman
例2
phoneRegex = re.compile(r'(\d\d\d-)?\d\d\d-\d\d\d\d') mo1 = phoneRegex.search('My number is 415-555-4242') print(mo1.group()) mo2 = phoneRegex.search('My number is 555-4242') print(mo2.group())
415-555-4242 555-4242
3.5 用星号匹配零次或多次
batRegex = re.compile(r'Bat(wo)*man') mo1 = batRegex.search('The Adventures of Batman') print(mo1.group()) mo2 = batRegex.search('The Adventures of Batwoman') print(mo2.group()) mo3 = batRegex.search('The Adventures of Batwowowowoman') print(mo3.group())
Batman Batwoman Batwowowowoman
3.6 用花括号匹配待定次数
(Ha){3}将匹配字符串HaHaHa (Ha){3,5}将匹配字符串HaHaHa | HaHaHaHa | HaHaHaHaHa (Ha){3,}匹配3次及以上 (Ha){,5}匹配0到5次
例
haRegex = re.compile(r'(Ha){3}') mo1 = haRegex.search('HaHaHa') print(mo1.group()) mo2 = haRegex.search('Ha') print(mo2 == None) # 因为(Ha){3}匹配HaHaHa,不匹配Ha,所以返回None
HaHaHa True
3.7 贪心和非贪心匹配
# 贪心 greedyHaRegex = re.compile(r'(Ha){3,5}') mo1 = greedyHaRegex.search('HaHaHaHaHa') print(mo1) # 非贪心 greedyHaRegex = re.compile(r'(Ha){3,5}?') mo1 = greedyHaRegex.search('HaHaHaHaHa') print(mo1)
<re.Match object; span=(0, 10), match='HaHaHaHaHa'> <re.Match object; span=(0, 6), match='HaHaHa'>
3.8 练习
例:search和findall的区别
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') mo = phoneNumRegex.search('Cell: 415-555-9999 Work: 212-555-0000') print(mo.group()) phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') print(phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')) phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)') print(phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000'))
415-555-9999 ['415-555-9999', '212-555-0000'] [('415', '555', '9999'), ('212', '555', '0000')]
例:匹配元音字符
[abc]
:匹配括号内任意字符
[^abc]
:匹配不再括号内的任意字符
# 匹配所有元音字符 voweRgegx = re.compile(r'[aeiouAEIOU]') print(voweRgegx.findall('RoboCop eats baby food. BABY FOOD.')) # 匹配所有非元音字符 consonantRgegx = re.compile(r'[^aeiouAEIOU]') print(consonantRgegx.findall('RoboCop eats baby food. BABY FOOD.'))
例:插入字符
beginWithHello = re.compile(r'^Hello') print(beginWithHello.search('Hello world!')) print(beginWithHello.search('He said hello.') == None)
<re.Match object; span=(0, 5), match='Hello'> True
例:美元字符
endWithNumber = re.compile(r'\d$') print(endWithNumber.search('Your number is 42')) print(endWithNumber.search("Your number is forty two") == None)
<re.Match object; span=(16, 17), match='2'> True
例:匹配以字符开始字符结束的
wholeStringIsNum = re.compile(r'^\d+$') print(wholeStringIsNum.search('123456789')) print(wholeStringIsNum.search('12345xyz678') == None) print(wholeStringIsNum.search('123 456789') == None)
<re.Match object; span=(0, 9), match='123456789'> True True
例:通配字符
atRegex = re.compile(r'.at') print(atRegex.findall('The cat in the hat sat on the flat mat.'))
['cat', 'hat', 'sat', 'lat', 'mat']
例:用点-星匹配所有字符
nameRegex = re.compile(r'First Name: (.*) Last Name: (.*)') mo = nameRegex.search("First Name: A1 Last Name: Sweigart") print(mo.group(1)) print(mo.group(2))
A1 Sweigart
例:点-星的贪心和非贪心模式
# 非贪心 nongreedyRegex = re.compile(r'<.*?>') print(nongreedyRegex.search('<To serve man> for dinner.>')) # 贪心 greedyRegex = re.compile(r'<.*>') print(greedyRegex.search('<To serve man> for dinner.>'))
<re.Match object; span=(0, 14), match='<To serve man>'> <re.Match object; span=(0, 27), match='<To serve man> for dinner.>'>
例:用句点字符匹配换行
# 正常情况不匹配换行符 noNewlineRegex = re.compile('.*') print(noNewlineRegex.search("Serve the public trust.\nProtect the innocent.\nUphold the law.").group()) # 添加第二个参数,匹配换行符 NewlineRegex = re.compile('.*', re.DOTALL) print(NewlineRegex.search("aaa.\nbbb.").group())
Serve the public trust. aaa. bbb.
例:匹配不区分大小写
# 传入第二个参数,匹配不区分大小写 robocop = re.compile(r'robocop', re.I) print(robocop.search('RoboCop is part manchine, all cop.').group()) print(robocop.search('ROBOcop is part manchine, all cop.').group())
RoboCop ROBOcop
例:用sub()方法替换字符串
namesRegex = re.compile(r'Agent \w+') print(namesRegex.sub("CENSORED", "Agent Alice gave the secret documents to Agent Bob."))
CENSORED gave the secret documents to CENSORED.
agentNamesRegex = re.compile(r'Agent (\w)\w*') print(agentNamesRegex.sub(r'\1****', "Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a double agent."))
A**** told C**** that E**** knew B**** was a double agent.
3.9 实践:匹配电话号码和E-mail地址
import pyperclip, re phoneRegex = re.compile(r'''( (\d{3}|\(\d{3}\))? (\s|-|\.)? (\d{3}) (\s|-|\.) (\d{4}) (\s*(ext|x|ext\.)\s*(\d{2,5}))? )''', re.VERBOSE) emailRegex = re.compile(r'''( [a-zA-Z0-9._%=-]+ @ [a-zA-Z0-9.-]+ (\.[a-zA-Z]{2,4}) )''', re.VERBOSE) text = str(pyperclip.paste()) # 读取剪切板 matches = [] for groups in phoneRegex.findall(text): phoneNum = '-'.join([groups[1], groups[3], groups[5]]) # 拼接电话 if groups[8] != '': phoneNum += ' x' + groups[8] matches.append(phoneNum) for groups in emailRegex.findall(text): matches.append(groups[0]) if len(matches) > 0: pyperclip.copy('\n'.join(matches)) print('找到以下电话号码和电子邮箱::') print('\n'.join(matches)) else: print("未发现电话号码和电子邮箱!")