5. 位置匹配
example 1:字符边界
text="The cat scattered his food all over the room." get_res(r"cat",text) get_res(r"\bcat\b",text)
findall res: ['cat', 'cat'] search res: <re.Match object; span=(4, 7), match='cat'> findall res: ['cat'] search res: <re.Match object; span=(4, 7), match='cat'>
example 2:"\B"边界
text="Please enter the nine-digit id as it appears on your color - coded pass-key." get_res(r'\B-\B',text) print(text[55:63]) get_res(r'\b-\b',text)
findall res: ['-'] search res: <re.Match object; span=(59, 60), match='-'> lor - co findall res: ['-', '-'] search res: <re.Match object; span=(21, 22), match='-'>
example 3:字符串边界
text="<?xml version=”1.0” encoding=”UTF-8” ?> <wsdl:definitions targetNamespace=”http://tips.cf” xmlns:impl=”http://tips.cf” xmlns:intf=”http://tips.cf” xmlns:apachesoap=”http://xml.apache.org/xml-soap”" get_res(r'^\s*<\?xml.*\?>',text)
findall res: ['<?xml version=”1.0” encoding=”UTF-8” ?>'] search res: <re.Match object; span=(0, 39), match='<?xml version=”1.0” encoding=”UTF-8” ?>'>
6. 使用子表达式
example 1:子表达式
text="Hello, my name is Ben Forta, and I am the author of books on SQL, ColdFusion, WAP,\ Windows 2000, and other subjects." get_res(r' {2,}',text) get_res(r'( ){2,}',text)
findall res: [] search res: None findall res: [' '] search res: <re.Match object; span=(94, 106), match=' '>
example 2:子表达式
text="Pinging hog.forta.com [12.159.46.200] with [12.159.89.200] 32 bytes of data:" #findall中含有“()”时,只表示出子表达式特定位置的字符,见下文结果 get_res(r"(\d{1,3}\.){3}\d{1,3}",text)
findall res: ['46.', '89.'] search res: <re.Match object; span=(23, 36), match='12.159.46.200'>
example 3:子表达式
text='ID: 042 \ SEX: M \ DOB: 1967-08-17 \ Status: Active' get_res(r"19|20\d{2}",text) get_res(r"(19|20)\d{2}",text)
findall res: ['19'] search res: <re.Match object; span=(20, 22), match='19'> findall res: ['19'] search res: <re.Match object; span=(20, 24), match='1967'>
7.使用回溯
「回溯引用允许正则表达式模式引用前面的匹配结果。可以把回溯引用想象成一个变量。」
example 1:回溯
text='This is a block of of text, several words here are are repeated, and and they should not be.' get_res(r"[ ]+(\w+)[ ]+\1",text)#\1 就是引用(\w+)的内容,表示第1个子表达式
findall res: ['of', 'are', 'and'] search res: <re.Match object; span=(15, 21), match=' of of'>
example 2: 回溯
text="<BODY> \ <H1>Welcome to my Homepage</H1> \ Content is divided into two sections:<BR> <H2>ColdFusion</H2> \ Information about Macromedia ColdFusion. <H2>Wireless</H2> \ Information about Bluetooth, 802.11, and more. <H2>This is not valid HTML</H3> \ </BODY>" get_res(r"<[hH]([1-6])>.*?</[hH]\1>",text)
findall res: ['1', '2', '2'] search res: <re.Match object; span=(7, 38), match='<H1>Welcome to my Homepage</H1>'>
8. 前后查找
前后查找(lookaround)对某一位置的前、后内容进行查找。
example 1:向前查找
text='http://www.forta.com/\n\ https://mail.forta.com/\n\ ftp://ftp.forta.com/\n' get_res(r".+(?=:)",text)
findall res: ['http', 'https', 'ftp'] search res: <re.Match object; span=(0, 4), match='http'>
example 2:向后查找
text='ABC01: $23.45\n\ HGG42: $5.31\n\ CFMX1: $899.00 XTC99: $69.96\n\ Total items found: 4' get_res(r"(?<=\$)[0-9.]+",text)
findall res: ['23.45', '5.31', '899.00', '69.96'] search res: <re.Match object; span=(8, 13), match='23.45'>
example 3:向前向后查找
text='<HEAD>\n\ <TITLE>Ben Forta’s Homepage</TITLE>\n\ </HEAD>' get_res(r"(?<=<[tT][iI][tT][lL][eE]>).*(?=</[tT][iI][tT][lL][eE]>)",text)
findall res: ['Ben Forta’s Homepage'] search res: <re.Match object; span=(14, 34), match='Ben Forta’s Homepage'>
example 4:负向查找
text="I paid $30 for 100 apples, 50 oranges, and 60 pears. I saved $5 on this order." get_res(r"(?<!\$)\d+",text)
findall res: ['0', '100', '50', '60'] search res: <re.Match object; span=(9, 10), match='0'>
9. 嵌入条件
example 1:(?(backreference)true-regex)
text='<!-- Nav bar -->\n\ <TD>\n\ <A HREF=”/home”><IMG SRC=”/images/home.gif”></A>\n\ <IMG SRC=”/images/spacer.gif”>\n\ <A HREF=”/search”><IMG SRC=”/images/search.gif”></A>\n\ <IMG SRC=”/images/spacer.gif”>\n\ <A HREF=”/help”><IMG SRC=”/images/help.gif”></A> </TD>' get_res(r"(<[Aa]\s+[^>]+>\s*)?<[Ii][Mm][Gg]\s+[^>]+>(?(1)\s*</[Aa]>)",text)
findall res: ['<A HREF=”/home”>', '', '<A HREF=”/search”>', '', '<A HREF=”/help”>'] search res: <re.Match object; span=(22, 70), match='<A HREF=”/home”><IMG SRC=”/images/home.gif”></A>'>
example 2:(?(backreference)true-regex|false-regex)
text='123-456-7890\n\ (123)456-7890\n\ (123)-456-7890\n\ (123-456-7890\n\ 1234567890\n\ 123 456 7890' get_res(r"(\()?\d{3}(?(1)\)|-)\d{3}-\d{4}",text)
findall res: ['', '(', ''] search res: <re.Match object; span=(0, 12), match='123-456-7890'>
代码
以上的东西我已经整理好了:
一文入坑正则表达式代码(https://github.com/weizaiff/regexp)