Python的re模块 --- 正则表达式操作（三）-阿里云开发者社区

正则表达式对象（正则对象）

编译后的正则表达式对象支持一下方法和属性：

Pattern.search(string[, pos[, endpos]])

扫描整个 string 寻找第一个匹配的位置，并返回一个相应的匹配对象。如果没有匹配，就返回 None ；注意它和零长度匹配是不同的。

可选的第二个参数 pos 给出了字符串中开始搜索的位置索引；默认为 0，它不完全等价于字符串切片； '^' 样式字符匹配字符串真正的开头，和换行符后面的第一个字符，但不会匹配索引规定开始的位置。

可选参数 endpos 限定了字符串搜索的结束；它假定字符串长度到 endpos ，所以只有从 pos 到 endpos - 1的字符会被匹配。如果 endpos 小于 pos，就不会有匹配产生；另外，如果 rx 是一个编译后的正则对象， rx.search(string, 0, 50) 等价于 rx.search(string[:50], 0)。

>>>
>>> pattern = re.compile("d")
>>> pattern.search("dog")     # Match at index 0
<re.Match object; span=(0, 1), match='d'>
>>> pattern.search("dog", 1)  # No match; search doesn't include the "d"
Pattern.match(string[, pos[, endpos]])

如果 string 的开始位置能够找到这个正则样式的任意个匹配，就返回一个相应的匹配对象。如果不匹配，就返回 None ；注意它与零长度匹配是不同的。

可选参数 pos 和 endpos 与 search() 含义相同。

>>>
>>> pattern = re.compile("o")
>>> pattern.match("dog")      # No match as "o" is not at the start of "dog".
>>> pattern.match("dog", 1)   # Match as "o" is the 2nd character of "dog".
<re.Match object; span=(1, 2), match='o'>

如果你想定位匹配在 string 中的位置，使用 search() 来替代（另参考 search() vs. match()）。

Pattern.fullmatch(string[, pos[, endpos]])

如果整个 string 匹配这个正则表达式，就返回一个相应的匹配对象。否则就返回 None ；注意跟零长度匹配是不同的。

可选参数 pos 和 endpos 与 search() 含义相同。

>>>
>>> pattern = re.compile("o[gh]")
>>> pattern.fullmatch("dog")      # No match as "o" is not at the start of "dog".
>>> pattern.fullmatch("ogre")     # No match as not the full string matches.
>>> pattern.fullmatch("doggie", 1, 3)   # Matches within given limits.
<re.Match object; span=(1, 3), match='og'>

3.4 新版功能

Pattern.split(string, maxsplit=0)

等价于 split() 函数，使用了编译后的样式。

Pattern.findall(string[, pos[, endpos]])

类似函数 findall() ，使用了编译后样式，但也可以接收可选参数 pos 和 endpos ，限制搜索范围，就像 search()。

Pattern.finditer(string[, pos[, endpos]])

类似函数 finiter() ，使用了编译后样式，但也可以接收可选参数 pos 和 endpos ，限制搜索范围，就像 search()。

Pattern.sub(repl, string, count=0)

等价于 sub() 函数，使用了编译后的样式。

Pattern.subn(repl, string, count=0)

等价于 subn() 函数，使用了编译后的样式。

Pattern.flags

正则匹配标记。这是可以传递给 compile() 的参数，任何 (?…) 内联标记，隐性标记比如 UNICODE 的结合。

Pattern.groups

捕获组合的数量。

Pattern.groupindex

映射由 (?P<id>) 定义的命名符号组合和数字组合的字典。如果没有符号组，那字典就是空的。

Pattern.pattern

编译对象的原始样式字符串。

在 3.7 版更改: 添加 copy.copy() 和 copy.deepcopy() 函数的支持。编译后的正则表达式对象被认为是原子性的。

匹配对象

匹配对象总是有一个布尔值 True。如果没有匹配的话 match() 和 search() 返回 None 所以你可以简单的用 if 语句来判断是否匹配

match = re.search(pattern, string)
if match:
    process(match)

匹配对象支持以下方法和属性：

Match.expand(template)

对 template 进行反斜杠转义替换并且返回，就像 sub() 方法中一样。转义如同 \n 被转换成合适的字符，数字引用(\1, \2)和命名组合(\g<1>, \g<name>) 替换为相应组合的内容。

在 3.5 版更改: 不匹配的组合替换为空字符串。

Match.group([group1, ...])

返回一个或者多个匹配的子组。如果只有一个参数，结果就是一个字符串，如果有多个参数，结果就是一个元组（每个参数对应一个项），如果没有参数，组1默认到0（整个匹配都被返回）。如果一个组N 参数值为 0，相应的返回值就是整个匹配字符串；如果它是一个范围 [1..99]，结果就是相应的括号组字符串。如果一个组号是负数，或者大于样式中定义的组数，一个 IndexError 索引错误就 raise。如果一个组包含在样式的一部分，并被匹配多次，就返回最后一个匹配。:

>>>
>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
>>> m.group(0)       # The entire match
'Isaac Newton'
>>> m.group(1)       # The first parenthesized subgroup.
'Isaac'
>>> m.group(2)       # The second parenthesized subgroup.
'Newton'
>>> m.group(1, 2)    # Multiple arguments give us a tuple.
('Isaac', 'Newton')

如果正则表达式使用了 (?P<name>…) 语法， groupN 参数就也可能是命名组合的名字。如果一个字符串参数在样式中未定义为组合名，一个 IndexError 就 raise。

一个相对复杂的例子

>>>
>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
>>> m.group('first_name')
'Malcolm'
>>> m.group('last_name')
'Reynolds'

命名组合同样可以通过索引值引用

>>>
>>> m.group(1)
'Malcolm'
>>> m.group(2)
'Reynolds'

如果一个组匹配成功多次，就只返回最后一个匹配

>>>
>>> m = re.match(r"(..)+", "a1b2c3")  # Matches 3 times.
>>> m.group(1)                        # Returns only the last match.
'c3'

Match.__getitem__(g)

这个等价于 m.group(g)。这允许更方便的引用一个匹配

>>>
>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
>>> m[0]       # The entire match
'Isaac Newton'
>>> m[1]       # The first parenthesized subgroup.
'Isaac'
>>> m[2]       # The second parenthesized subgroup.
'Newton'

3.6 新版功能

Match.groups(default=None)

返回一个元组，包含所有匹配的子组，在样式中出现的从1到任意多的组合。 default 参数用于不参与匹配的情况，默认为 None。

例如

>>>
>>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
>>> m.groups()
('24', '1632')

如果我们使小数点可选，那么不是所有的组都会参与到匹配当中。这些组合默认会返回一个 None ，除非指定了 default 参数。

>>>
>>> m = re.match(r"(\d+)\.?(\d+)?", "24")
>>> m.groups()      # Second group defaults to None.
('24', None)
>>> m.groups('0')   # Now, the second group defaults to '0'.
('24', '0')

Match.groupdict(default=None)

返回一个字典，包含了所有的命名子组。key就是组名。 default 参数用于不参与匹配的组合；默认为 None。例如

>>>
>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
>>> m.groupdict()
{'first_name': 'Malcolm', 'last_name': 'Reynolds'}
Match.start([group])

Match.end([group])

返回 group 匹配到的字串的开始和结束标号。group 默认为0（意思是整个匹配的子串）。如果 group 存在，但未产生匹配，就返回 -1 。对于一个匹配对象 m，和一个未参与匹配的组 g ，组 g (等价于 m.group(g))产生的匹配是

m.string[m.start(g):m.end(g)]

注意 m.start(group) 将会等于 m.end(group) ，如果 group 匹配一个空字符串的话。比如，在 m =re.search('b(c?)', 'cba') 之后，m.start(0) 为 1, m.end(0) 为 2, m.start(1) 和 m.end(1) 都是 2, m.start(2) raise 一个 IndexError 例外。

这个例子会从email地址中移除掉 remove_this

>>>
>>> email = "tony@tiremove_thisger.net"
>>> m = re.search("remove_this", email)
>>> email[:m.start()] + email[m.end():]
'tony@tiger.net'
Match.span([group])

对于一个匹配 m ，返回一个二元组 (m.start(group), m.end(group)) 。注意如果 group 没有在这个匹配中，就返回 (-1, -1) 。group 默认为0，就是整个匹配。

Match.pos

pos 的值，会传递给 search() 或 match() 的方法 a 正则对象。这个是正则引擎开始在字符串搜索一个匹配的索引位置。

Match.endpos

endpos 的值，会传递给 search() 或 match() 的方法 a 正则对象。这个是正则引擎停止在字符串搜索一个匹配的索引位置。

Match.lastindex

捕获组的最后一个匹配的整数索引值，或者 None 如果没有匹配产生的话。比如，对于字符串 'ab'，表达式 (a)b, ((a)(b)), 和 ((ab)) 将得到 lastindex == 1 ，而 (a)(b) 会得到 lastindex == 2 。

Match.lastgroup

最后一个匹配的命名组名字，或者 None 如果没有产生匹配的话。

Match.re

返回产生这个实例的正则对象，这个实例是由正则对象的 match() 或 search() 方法产生的。

Match.string

传递到 match() 或 search() 的字符串。

在 3.7 版更改: 添加了对 copy.copy() 和 copy.deepcopy() 的支持。匹配对象被看作是原子性的。

正则表达式例子

检查对子

在这个例子里，我们使用以下辅助函数来更好的显示匹配对象：

def displaymatch(match):
    if match is None:
        return None
    return '<Match: %r, groups=%r>' % (match.group(), match.groups())

假设你在写一个扑克程序，一个玩家的一手牌为五个字符的串，每个字符表示一张牌，"a" 就是 A, "k" K， "q" Q, "j" J, "t" 为 10, "2" 到 "9" 表示2 到 9。

要看给定的字符串是否有效，我们可以按照以下步骤

>>>
>>> pair = re.compile(r".*(.).*\1")
>>> displaymatch(pair.match("717ak"))     # Pair of 7s.
"<Match: '717', groups=('7',)>"
>>> displaymatch(pair.match("718ak"))     # No pairs.
>>> displaymatch(pair.match("354aa"))     # Pair of aces.
"<Match: '354aa', groups=('a',)>"

最后一手牌，"727ak" ，包含了一个对子，或者两张同样数值的牌。要用正则表达式匹配它，应该使用向后引用如下

>>>
>>> pair = re.compile(r".*(.).*\1")
>>> displaymatch(pair.match("717ak"))     # Pair of 7s.
"<Match: '717', groups=('7',)>"
>>> displaymatch(pair.match("718ak"))     # No pairs.
>>> displaymatch(pair.match("354aa"))     # Pair of aces.
"<Match: '354aa', groups=('a',)>"

要找到对子包含的是哪一张牌，应该按照下面的方式使用 group() 方法:

>>> pair.match("717ak").group(1)
'7'
# Error because re.match() returns None, which doesn't have a group() method:
>>> pair.match("718ak").group(1)
Traceback (most recent call last):
  File "<pyshell#23>", line 1, in <module>
    re.match(r".*(.).*\1", "718ak").group(1)
AttributeError: 'NoneType' object has no attribute 'group'
>>> pair.match("354aa").group(1)
'a'

模拟 scanf()

Python 目前没有一个类似c函数 scanf() 的替代品。正则表达式通常比 scanf() 格式字符串要更强大一些，但也带来更多复杂性。下面的表格提供了 scanf() 格式符和正则表达式大致相同的映射。

从文件名和数字提取字符串

/usr/sbin/sendmail - 0 errors, 4 warnings

你可以使用 scanf() 格式化

%s - %d errors, %d warnings

等价的正则表达式是：

(\S+) - (\d+) errors, (\d+) warnings
search() vs. match()

Python 提供了两种不同的操作：基于 re.match() 检查字符串开头，或者 re.search() 检查字符串的任意位置（默认Perl中的行为）。

例如

>>>
>>> re.match("c", "abcdef")    # No match
>>> re.search("c", "abcdef")   # Match
<re.Match object; span=(2, 3), match='c'>

在 search() 中，可以用 '^' 作为开始来限制匹配到字符串的首位

>>>
>>> re.match("c", "abcdef")    # No match
>>> re.search("^c", "abcdef")  # No match
>>> re.search("^a", "abcdef")  # Match
<re.Match object; span=(0, 1), match='a'>

注意 MULTILINE 多行模式中函数 match() 只匹配字符串的开始，但使用 search() 和以 '^' 开始的正则表达式会匹配每行的开始

>>>
>>> re.match('X', 'A\nB\nX', re.MULTILINE)  # No match
>>> re.search('^X', 'A\nB\nX', re.MULTILINE)  # Match
<re.Match object; span=(4, 5), match='X'>

建立一个电话本

split() 将字符串用参数传递的样式分隔开。这个方法对于转换文本数据到易读而且容易修改的数据结构，是很有用的，如下面的例子证明。

首先，这里是输入。通常是一个文件，这里我们用三引号字符串语法

>>>
>>> text = """Ross McFluff: 834.345.1254 155 Elm Street
...
... Ronald Heathmore: 892.345.3428 436 Finley Avenue
... Frank Burger: 925.541.7625 662 South Dogwood Way
...
...
... Heather Albrecht: 548.326.4584 919 Park Place"""

条目用一个或者多个换行符分开。现在我们将字符串转换为一个列表，每个非空行都有一个条目:

>>> entries = re.split("\n+", text)
>>> entries
['Ross McFluff: 834.345.1254 155 Elm Street',
'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
'Frank Burger: 925.541.7625 662 South Dogwood Way',
'Heather Albrecht: 548.326.4584 919 Park Place']

最终，将每个条目分割为一个由名字、姓氏、电话号码和地址组成的列表。我们为 split() 使用了 maxsplit 形参，因为地址中包含有被我们作为分割模式的空格符:

>>> [re.split(":? ", entry, 3) for entry in entries]
[['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]

:? 样式匹配姓后面的冒号，因此它不出现在结果列表中。如果 maxsplit 设置为 4 ，我们还可以从地址中获取到房间号:

>>> [re.split(":? ", entry, 4) for entry in entries]
[['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]

文字整理

sub() 替换字符串中出现的样式的每一个实例。这个例子证明了使用 sub() 来整理文字，或者随机化每个字符的位置，除了首位和末尾字符

>>>
>>> def repl(m):
...     inner_word = list(m.group(2))
...     random.shuffle(inner_word)
...     return m.group(1) + "".join(inner_word) + m.group(3)
>>> text = "Professor Abdolmalek, please report your absences promptly."
>>> re.sub(r"(\w)(\w+)(\w)", repl, text)
'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
>>> re.sub(r"(\w)(\w+)(\w)", repl, text)
'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'

找到所有副词

findall() 匹配样式所有的出现，不仅是像 search() 中的第一个匹配。比如，如果一个作者希望找到文字中的所有副词，他可能会按照以下方法用 findall()

>>>
>>> text = "He was carefully disguised but captured quickly by police."
>>> re.findall(r"\w+ly", text)
['carefully', 'quickly']

找到所有副词和位置

如果需要匹配样式的更多信息， finditer() 可以起到作用，它提供了匹配对象作为返回值，而不是字符串。继续上面的例子，如果一个作者希望找到所有副词和它的位置，可以按照下面方法使用 finditer()

>>>
>>> text = "He was carefully disguised but captured quickly by police."
>>> for m in re.finditer(r"\w+ly", text):
...     print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))
07-16: carefully
40-47: quickly

原始字符记法

原始字符串记法 (r"text") 保持正则表达式正常。否则，每个正则式里的反斜杠('\') 都必须前缀一个反斜杠来转义。比如，下面两行代码功能就是完全一致的

>>>
>>> re.match(r"\W(.)\1\W", " ff ")
<re.Match object; span=(0, 4), match=' ff '>
>>> re.match("\\W(.)\\1\\W", " ff ")
<re.Match object; span=(0, 4), match=' ff '>

当需要匹配一个字符反斜杠，它必须在正则表达式中转义。在原始字符串记法，就是 r"\\"。否则就必须用 "\\\\"，来表示同样的意思

>>>
>>> re.match(r"\\", r"\\")
<re.Match object; span=(0, 1), match='\\'>
>>> re.match("\\\\", r"\\")
<re.Match object; span=(0, 1), match='\\'>

写一个词法分析器

一个词法器或词法分析器分析字符串，并分类成目录组。这是写一个编译器或解释器的第一步。

文字目录是由正则表达式指定的。这个技术是通过将这些样式合并为一个主正则式，并且循环匹配来实现的

import collections
import re
Token = collections.namedtuple('Token', ['type', 'value', 'line', 'column'])
def tokenize(code):
    keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'}
    token_specification = [
        ('NUMBER',   r'\d+(\.\d*)?'),  # Integer or decimal number
        ('ASSIGN',   r':='),           # Assignment operator
        ('END',      r';'),            # Statement terminator
        ('ID',       r'[A-Za-z]+'),    # Identifiers
        ('OP',       r'[+\-*/]'),      # Arithmetic operators
        ('NEWLINE',  r'\n'),           # Line endings
        ('SKIP',     r'[ \t]+'),       # Skip over spaces and tabs
        ('MISMATCH', r'.'),            # Any other character
    ]
    tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
    line_num = 1
    line_start = 0
    for mo in re.finditer(tok_regex, code):
        kind = mo.lastgroup
        value = mo.group()
        column = mo.start() - line_start
        if kind == 'NUMBER':
            value = float(value) if '.' in value else int(value)
        elif kind == 'ID' and value in keywords:
            kind = value
        elif kind == 'NEWLINE':
            line_start = mo.end()
            line_num += 1
            continue
        elif kind == 'SKIP':
            continue
        elif kind == 'MISMATCH':
            raise RuntimeError(f'{value!r} unexpected on line {line_num}')
        yield Token(kind, value, line_num, column)
statements = '''
    IF quantity THEN
        total := total + price * quantity;
        tax := price * 0.05;
    ENDIF;
'''
for token in tokenize(statements):
    print(token)

这个词法器产生以下输出

Token(type='IF', value='IF', line=2, column=4)
Token(type='ID', value='quantity', line=2, column=7)
Token(type='THEN', value='THEN', line=2, column=16)
Token(type='ID', value='total', line=3, column=8)
Token(type='ASSIGN', value=':=', line=3, column=14)
Token(type='ID', value='total', line=3, column=17)
Token(type='OP', value='+', line=3, column=23)
Token(type='ID', value='price', line=3, column=25)
Token(type='OP', value='*', line=3, column=31)
Token(type='ID', value='quantity', line=3, column=33)
Token(type='END', value=';', line=3, column=41)
Token(type='ID', value='tax', line=4, column=8)
Token(type='ASSIGN', value=':=', line=4, column=12)
Token(type='ID', value='price', line=4, column=15)
Token(type='OP', value='*', line=4, column=21)
Token(type='NUMBER', value=0.05, line=4, column=23)
Token(type='END', value=';', line=4, column=27)
Token(type='ENDIF', value='ENDIF', line=5, column=4)
Token(type='END', value=';', line=5, column=9)

Python的re模块 --- 正则表达式操作（三）

正则表达式对象（正则对象）

3.4 新版功能

匹配对象

3.6 新版功能

正则表达式例子

检查对子

模拟 scanf()

建立一个电话本

文字整理

找到所有副词

找到所有副词和位置

原始字符记法

写一个词法分析器

热门文章

最新文章

相关课程

相关电子书

相关实验场景

Python的re模块 --- 正则表达式操作（三）

正则表达式对象 （正则对象）

3.4 新版功能

匹配对象

3.6 新版功能

正则表达式例子

检查对子

模拟 scanf()

建立一个电话本

文字整理

找到所有副词

找到所有副词和位置

原始字符记法

写一个词法分析器

热门文章

最新文章

相关课程

相关电子书

相关实验场景

正则表达式对象（正则对象）