正则表达式
「本文将从简单到容易,通过🌰解释常用的正则表达式的用法。全文分为以下几部分:」
- python库函数
- 匹配单个字符
- 匹配多个字符
- 元字符
- 重复字符串匹配
- 位置匹配
- 使用子表达式
- 使用回溯
- 前后查找
- 嵌入条件
0. python库函数
常用的正则表达式包是re,其中使用最多的几个函数为:
re.match 尝试从字符串的起始位置匹配一个模式,如果不是起始位置匹配成功的话,match()就返回none。
re.search 扫描整个字符串并返回第一个成功的匹配。
re.findall 在字符串中找到正则表达式所匹配的所有子串,并返回一个列表,如果没有找到匹配的,则返回空列表。
本文将使用其中的函数,简单封装成一个便于输出的接口调用
import re#python常用的正则表达式包 def get_res(regex,text): res=re.findall(regex,text) print('findall res:',res) res=re.search(regex,text) print('search res:',res)
1.匹配单个字符
example 1: 单个字符
text="Hello, my name is Van. Please visit my website at https://www.zhihu.com/people/le-yifan-35."
get_res(r'Van',text)
findall res: ['Van'] search res: <re.Match object; span=(18, 21), match='Van'>
example 2:匹配任何字符
text="sales1.xls\ orders3.xls\ sales2.xls\ sales3.xls\ apac1.xls\ europe2.xls"
get_res(r'sales.',text)
findall res: ['sales1', 'sales2', 'sales3'] search res: <re.Match object; span=(0, 6), match='sales1'>
2. 匹配多个字符
example 1:匹配多个字符
text="sales1.xls\ orders3.xls\ sales2.xls\ sales3.xls\ apac1.xls\ europe2.xls\ na1.xls\ na2.xls\ sa1.xls\ ca1.xls"
get_res(r'[ns]a.\.xls',text)
findall res: ['na1.xls', 'na2.xls', 'sa1.xls'] search res: <re.Match object; span=(61, 68), match='na1.xls'>
example 2:匹配多个字符
text="The phrase “regular expression” is often abbreviated as RegEx or regex."
get_res(r'[Rr]eg[Ee]x',text)
findall res: ['RegEx', 'regex'] search res: <re.Match object; span=(56, 61), match='RegEx'>
example 3:设定字符的范围
text="sales1.xls\ orders3.xls\ sales2.xls\ sales3.xls\ apac1.xls\ europe2.xls\ na1.xls\ na2.xls\ sa1.xls\ ca1.xls" get_res(r'[ns]a[0123456789]\.xls',text) get_res(r'[ns]a[0-9]\.xls',text)
findall res: ['na1.xls', 'na2.xls', 'sa1.xls'] search res: <re.Match object; span=(61, 68), match='na1.xls'> findall res: ['na1.xls', 'na2.xls', 'sa1.xls'] search res: <re.Match object; span=(61, 68), match='na1.xls'>
example 4:设定字符的范围
text="<BODY BGCOLOR=”#336633” TEXT=”#FFFFFF” MARGINWIDTH=”0” MARGINHEIGHT=”0” TOPMARGIN=”0” LEFTMARGIN=”0”>" get_res(r'[0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f]',text)
findall res: ['336633', 'FFFFFF'] search res: <re.Match object; span=(16, 22), match='336633'>
example 5:取非匹配
text="sales1.xls\ orders3.xls\ sales2.xls\ sales3.xls\ apac1.xls\ europe2.xls\ sam.xls\ na1.xls\ na2.xls\ sa1.xls\ ca1.xls" get_res(r'[ns]a[^0-9]\.xls',text)
findall res: ['sam.xls'] search res: <re.Match object; span=(61, 68), match='sam.xls'>
3. 元字符
元字符指的是这个字符有特殊的含义,而不是字符本身的含义
example 1:元字符'[]'
text="var myArray = new Array();\ ...\ if (myArray[0] == 0) {\ ...\ }" get_res(r'myArray[0]',text) get_res(r'myArray\[0\]',text)
findall res: [] search res: None findall res: ['myArray[0]'] search res: <re.Match object; span=(33, 43), match='myArray[0]'>
example 2:元字符'\'
text="\ home\ ben\ sales\ " get_res(r'\\',text)
findall res: ['\\', '\\', '\\', '\\'] search res: <re.Match object; span=(0, 1), match='\\'>
example 3:空白字符
text="“101”,”Ben”,”Forta”\ “102”,”Jim”,”James”\r\n\ \r\n\ “103”,”Roberta”,”Robertson”\ “104”,”Bob”,”Bobson”" get_res(r'\r\n\r\n',text)
findall res: ['\r\n\r\n'] search res: <re.Match object; span=(38, 42), match='\r\n\r\n'>
example 4:数字元字符
text="var myArray = new Array();\ ...\ if (myArray[0] == 0) {\ ...\ }" get_res(r'myArray\[\d\]',text)
findall res: ['myArray[0]'] search res: <re.Match object; span=(33, 43), match='myArray[0]'>
example 5:字母数字元字符
text='11213 \ A1C2E3 \ 48075 \ 48237 \ M1B4F2 \ 90046 \ H1H2H2' get_res(r'\w\d\w\d\w\d',text)
findall res: ['A1C2E3', 'M1B4F2', 'H1H2H2'] search res: <re.Match object; span=(6, 12), match='A1C2E3'>
4. 重复字符串匹配
example 1:匹配0个或者多个字符
text='Hello .ben@forta.com is my email address.' get_res(r'\w+[\w.]*@[\w.]+\.\w+',text)
findall res: ['ben@forta.com'] search res: <re.Match object; span=(7, 20), match='ben@forta.com'>
text="The URL is http://www.forta.com/, to connect securely use https://www.forta.com/ instead." get_res(r"https?://[\w./]+",text)
findall res: ['http://www.forta.com/', 'https://www.forta.com/'] search res: <re.Match object; span=(11, 32), match='http://www.forta.com/'>
example 2:
text="<BODY BGCOLOR=”#336633” TEXT=”#FFFFFF” MARGINWIDTH=”0” MARGINHEIGHT=”0” TOPMARGIN=”0” LEFTMARGIN=”0”>" get_res(r"\d{6}",text)
findall res: ['336633'] search res: <re.Match object; span=(16, 22), match='336633'>
example 3:范围匹配
text="4/8/03 \ 10-6-2004 \ 2/2/2 \ 01-01-01" get_res(r"\d{1,2}[-\/]\d{1,2}[-\/]\d{2,4}",text)
findall res: ['4/8/03', '10-6-2004', '01-01-01'] search res: <re.Match object; span=(0, 6), match='4/8/03'>
example 4:至少多少 次数匹配
text="1001: $496.80 1002: $1290.69 1003: $26.43 1004: $613.42 1005: $7.61 1006: $414.90 1007: $25.00" get_res(r"\d+: \$\d{3,}\.\d{2}",text)
findall res: ['1001: $496.80', '1002: $1290.69', '1004: $613.42', '1006: $414.90'] search res: <re.Match object; span=(0, 13), match='1001: $496.80'>
example 5:防止过度匹配
text="This offer is not available to customers living in <B>AK</B> and <B>HI</B>." get_res(r"<[Bb]>.*</[Bb]>",text) get_res(r"<[Bb]>.*?</[Bb]>",text)
findall res: ['<B>AK</B> and <B>HI</B>'] search res: <re.Match object; span=(51, 74), match='<B>AK</B> and <B>HI</B>'> findall res: ['<B>AK</B>', '<B>HI</B>'] search res: <re.Match object; span=(51, 60), match='<B>AK</B>'>