正则表达式实战-阿里云开发者社区

本文都是基于Linux 支持的正则来实战。关于正则的流派见这里。

一、提取字符串需要的部分

这个一般有两种思路:

awk + sed 替换
grep -o 精确match，有了正则match 会越来越精确。

有时候总有一些需求，需要提取一行中需要的部分，例如需要提取博客链接中的数字。

regExp]# echo https://www.cnblogs.com/luoahong/p/11577600.html | sed 's/.*p\/\([0-9]*\).*/\1/g'
11577600

或者使用pcre流派的grep完成

home]# echo https://www.cnblogs.com/luoahong/p/11577600.html | pcregrep -o '\d+'
11577600

注意：

1、s表示替换，\1：模式从左侧起，第一个左括号以及与其匹配的右括号之间的内容,(可以看作是一个子组)

sed支持*，不支持?、+，不能用\d之类，这个正则的历史渊源有关系。

2、匹配的字符串中如果有斜杠，需要转义。否则会报错，如下

 regExp]# echo https://www.cnblogs.com/luoahong/p/11577600.html | sed 's/.*p/\([0-9]*\).*/\1/g'
sed: -e expression #1, char 20: unknown option to `s'

3、值得学习的是之前想过滤出一行中需要的信息时，思路是把不需要的用空替换掉。在sed实战中有很多此使用，事实上可以转换思路，使用正则结合sed来操作，用匹配到的结果用子组替换是一个不错的选择。

再来一个例子来加深这个理解

用1 替换匹配到的内容
[root@gptest01 regExp]# sed 's/\bhttps[^"]*html/1/g' html_parse.html 
<p><a class="entrylistItemTitle" href="1">Linux性能优化实战学习笔记：第二讲</a></p>
<p><a class="entrylistItemTitle" href="1">Linux性能优化实战学习笔记：第三讲</a></p>
<p><a class="entrylistItemTitle" href="1">Linux性能优化实战学习笔记：第五讲</a></p>
<p><a class="entrylistItemTitle" href="1">Linux性能优化实战学习笔记：第六讲</a></p>
<p><a class="entrylistItemTitle" href="1">Linux性能优化实战学习笔记：第七讲</a></p>
<p><a class="entrylistItemTitle" href="1">Linux性能优化实战学习笔记：第八讲</a></p>
<p><a class="entrylistItemTitle" href="1">Linux性能优化实战学习笔记：第九讲</a></p>
<p><a class="entrylistItemTitle" href="1">Linux性能优化实战学习笔记：第十讲</a></p>
<p><a class="entrylistItemTitle" href="1">Linux性能优化实战学习笔记：第十一讲</a></p>

可以看出正则已经可以匹配到需要的部分了，可是子组却出错了。

改成这样还是不行

regExp]# sed  's/.*(\bhttps[^"]*html).*/\1/g' html_parse.html 
sed: -e expression #1, char 29: invalid reference \1 on `s' command's RHS

4、改变思路使用pcre 流派的正则也可以完成。请看下面这个例子

这里也提供了一个思路，不一定非要用正则匹配到的文本替换整个文本。

# \b代表边界
# ^ 开始
# 以https为边界，以"开始 中间匹配任意 以html结尾，就可以得到结果
 regExp]# cat html_parse.html | pcregrep -o  '\bhttps[^"]*html'
https://www.cnblogs.com/luoahong/p/10353166.html
https://www.cnblogs.com/luoahong/p/10802102.html
https://www.cnblogs.com/luoahong/p/10808031.html
https://www.cnblogs.com/luoahong/p/10808936.html
https://www.cnblogs.com/luoahong/p/10809642.html
https://www.cnblogs.com/luoahong/p/10811704.html
https://www.cnblogs.com/luoahong/p/10815283.html
https://www.cnblogs.com/luoahong/p/10820252.html
https://www.cnblogs.com/luoahong/p/10592001.html
HTML原文本是这样的
home]# echo "<p><a class="entrylistItemTitle" href="https://www.cnblogs.com/luoahong/p/10882536.html">Linux性能优化实战学习笔记：第二十三讲</a></p>"| pcregrep -o  '\bhttps[^"]*html'
https://www.cnblogs.com/luoahong/p/10882536.html

二、处理连续出现多此的单词

the little cat cat is in the hat hat, we like it.

其中 cat 和 hat 连接出现多次，要求处理后结果是

the little cat is in the hat, we like it.

subGroup]# echo "the little cat cat is in the hat hat, we like it."|sed -E 's/(\w+) \1/\1/g'
the little cat is in the hat, we like it.

为什么这能匹配出来连续重复字符？

（\w+）代表分组，此时只有一个分组，“\1”代表第一个分组的内容

于是，该正则意思是：某单词+空格+某单词，这样就实现了连续重复单词的匹配。

如果是一些单词连续出现了多次该怎么处理呢？

the little cat cat is in the hat hat hat, we like it

如果是多个连续的单词需要注意是(\s\1)+ ，这个+ 应该外面，如果在括号里面会是什么情况呢？

subGroup]# echo "the little cat cat is in the hat hat hat, we like it."|sed -E 's/(\w+)( \1+)/\1/g'
the little cat is in the hat hat, we like it.

这是什么原因呢？如果在里面代表第一个子组出现1或多此，在外面代表匹配到整个子组出现1次或多此。

 subGroup]# echo "the little cat cat is in the hat hat hat, we like it." |sed -E 's/(\b\w+)(\s\1)+/\1/g'
the little cat is in the hat, we like it.
如果不用\s， 直接用空格替换\s 也是ok的
 subGroup]# echo "the little cat cat is in the hat hat hat, we like it." |sed -E 's/(\b\w+)( \1)+/\1/g'
the little cat is in the hat, we like it.

三、正则匹配时间

这应该算是最简单的正则匹配时间，一旦格式变化会引起极度不适应，如需深入，请参考大神文章。

monitorDoubleAck]# echo "2020-08-03 06:56:38" |pcregrep '\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}'
2020-08-03 06:56:38

最近有个需求，由于上传资料的机器程序出现惊天bug，出现了2020-09-06 03:60:02 这样雷人的时间，程序也没有对其合法性检查导致杯具。

你可能会想这不简单吗？

显然上面的正则表达式不能满足我们的要求。

先看图说话，基本把时间不合法的情况都测试了，我们的前提是小于两位数的位置都需要用0占位。

正则形如：

\d{4}\-(?:0[1-9]|1[0-2])\-(?:[12]\d|3[01]|0?[0-9]) (?:2?[0-3]|1\d|0\d):(:?[0-5]\d|):(:?[1-5]\d|0\d)

注意：

年域用\d{4}不是特别严格，0212-12-01 也会被匹配到。
使用(?:正则)不保存子组，例如\d{15}(?:\d{3})?

四、正则表达式替换括号中的内容

在编写 gpload 的ymal控制文件时，有时候需要指明列名，如下图

显然这就是表结构么，但是如何变成这种样子呢？

如何把表结构这种格式变为如上图所需要的格式？

1、括号中的数字要如何处理？

[(\d)]

这样替换有个问题: 字段中如果有数字也一并被替换掉了.

2、第一个字段后面的冒号(行中的第二个空格前增加一个: )

在notpad中切换到正则模式这样替换

搜索：^([^ ]+[ ]+[^ ]+)[ ]

替换：\1：

从行的开头，寻找非空格后面的空间，然后另一组非空间的格局。最后，匹配一个空格字符。

替换是导致最终空格字符的所有内容，然后是选项卡。

五、正则表达式替换括号中的内容(包括括号本身)

需求：替换掉下面sql语句的括号以及括号中的内容.

ALTER TABLE dm.basics_history ADD CONSTRAINT basics_history_pkey1 PRIMARY KEY (seq_id);

替换后的结果:

使用的正则表达式是: .∗.∗

此外，由于.*是贪婪的，它会杀死尽可能多的，因为它可以。所以"(abc)(def)"将变成""

六、如何获取字符串中左右边界之间的字符串

需求: 如下数据,想要获取

1、index 和on之间的index name

2、on之后的tablename

3、括号中的column

一共有10W多行

CREATE UNIQUE INDEX cell_pds_result_assm_t_n_pkey ON cell_pds_result_assm_t USING btree (component_id, step_id, component_start_time, equip_id, sub_equip_id, param_collection, param_name, site_name)
CREATE UNIQUE INDEX cell_pds_result_other_t_n_pkey ON cell_pds_result_other_t USING btree (component_id, step_id, component_start_time, equip_id, sub_equip_id, param_collection, param_name, site_name)
CREATE UNIQUE INDEX cell_pds_result_pipr_t_n_pkey ON cell_pds_result_pipr_t USING btree (component_id, step_id, component_start_time, equip_id, sub_equip_id, param_collection, param_name, site_name)
CREATE UNIQUE INDEX cell_pds_result_psva_t_n_pkey ON cell_pds_result_psva_t USING btree (component_id, step_id, component_start_time, equip_id, sub_equip_id, param_collection, param_name, site_name)
CREATE UNIQUE INDEX cf_mds_glass_h_cnvr_t_pkey ON cf_mds_glass_h_cnvr_t USING btree (glass_id, step_id, glass_start_time, equip_id, sub_equip_id)
......
10W行

思路:

获取index name和 tablename

可以使用 awk 可获取到

1、获取tableName 需要注意的是创建主键索引多一个unique. 注意分割位置。

cat -n  test.txt |awk -F " " {'print $6'} > ./otherindex.txt
cat -n  test.txt |awk -F " " {'print $7'} > ./pkindex.txt

2、使用正则获取INDEX之后 ON之前的字符串，区分大小写

(?<=INDEX).*(?=ON)

如何提取是个麻烦事，不如用python 直接一步到位.

使用python re提取括号中的内容:

1、用python re模块

import re
import os
if __name__ == '__main__':
    string = 'abe(ac)ad)'
    current_path = os.getcwd()
    path = os.path.join(current_path, 'file')
    with open('tableList.txt', 'r', encoding='utf-8', newline='') as f:
        for line in f:
            # print(line.rstrip())
            p1 = re.compile(r'[(](.*?)[)]', re.S)
            p11 = re.compile(r'(?<=INDEX).*(?=ON)', re.S)
            # 匹配括号中的内容
            print(re.findall(p1, line.rstrip()))
            # 匹配INDEX 和ON 之间的中的indexName
            print(re.findall(p11, line.rstrip()))

print(re.findall(p11, line.rstrip()))

2、使用正则工具查找然后到处匹配到的行

regex101: build, test, and debug regex

先真这个匹配，然后把正则匹配到的copy出来

根据实际需求应该是不需要最长匹配，反选此按钮即可完成需求

有没有更好的办法获取index name和 tablename和括号中的内容

什么是pcregrep?

/tmp]#pcregrep -o "[(](.*)[)]" test.txt

pcregrep -o "(?<=INDEX).*(?=ON)" test.txt |more

update 2022年1月6日20:15:08 ----------

使用sed 也能轻松完成

[root@ninesun ~]# cat >> file2.log << EOF
> CREATE UNIQUE INDEX cell_pds_result_assm_t_n_pkey ON cell_pds_result_assm_t USING btree (component_id, step_id, component_start_time, equip_id, sub_equip_id, param_collection, param_name, site_name)
> EOF
[root@ninesun ~]# 
[root@ninesun ~]# cat /tmp/file2.log 
cat: /tmp/file2.log: No such file or directory
[root@ninesun ~]# 
[root@ninesun ~]# 
[root@ninesun ~]# 
[root@ninesun ~]# cat file2.log 
CREATE UNIQUE INDEX cell_pds_result_assm_t_n_pkey ON cell_pds_result_assm_t USING btree (component_id, step_id, component_start_time, equip_id, sub_equip_id, param_collection, param_name, site_name)
[root@ninesun ~]# 
[root@ninesun ~]# sed -i "s/[()]//g" ./file2.log 
[root@ninesun ~]# 
[root@ninesun ~]# cat file2.log 
CREATE UNIQUE INDEX cell_pds_result_assm_t_n_pkey ON cell_pds_result_assm_t USING btree component_id, step_id, component_start_time, equip_id, sub_equip_id, param_collection, param_name, site_name

七、贪婪匹配规则

有如下文本：

I can haz haz bacon. Mmmm. Tasty, tasty bacon.

需求: 第一字符串“az”和最后字符串“as”之间的的内容。

细心观察, 后面的字母 Tasty和 tasty中均含有as

解决: 使用环视

(?<=az).*?(?=as)

有左尖括号代表看左边，没有尖括号是看右边，而感叹号是非的意思。

使用正则表达式的debug mode发现匹配的时候如果又? 则匹配是从左到右依次匹配的。而如果使用贪婪的方式是先从左向右匹配到az之后开始从右向左开始匹配，匹配到第一次出现as时停止.

(?<=az).*(?=as)

Rubular: (?<=az).*(?=as)

这里有一个比较强大的功能，正则表达式 debug模式

八、从log中用正则提取数据做运算

log如下:

2021-10-14 00:02:52.474  INFO 30780 --- [QuartzScheduler_Worker-15] c.n.l.l.i.r.DeleteRedisDataNewService    : delKeyByGlassForAC7710-delKeyByGlass-删除redis记录  result:glass_id   key:result:A11A4041A
C:ARRAY ,value共4701个
2021-10-14 00:02:52.480  INFO 30780 --- [QuartzScheduler_Worker-15] c.n.l.l.i.r.DeleteRedisDataNewService    : delKeyByGlassForAC7710-delKeyByGlass-删除redis记录  result:glass_id   key:result:A11A3051A
W:ARRAY ,value共4707个
2021-10-14 00:02:52.488  INFO 30780 --- [QuartzScheduler_Worker-15] c.n.l.l.i.r.DeleteRedisDataNewService    : delKeyByGlassForAC7710-delKeyByGlass-删除redis记录  result:glass_id   key:result:A11A3051A
C:ARRAY ,value共4705个
2021-10-14 00:02:52.495  INFO 30780 --- [QuartzScheduler_Worker-15] c.n.l.l.i.r.DeleteRedisDataNewService    : delKeyByGlassForAC7710-delKeyByGlass-删除redis记录  result:glass_id   key:result:A11A3052B
Y:ARRAY ,value共4903个
2021-10-14 00:02:52.504  INFO 30780 --- [QuartzScheduler_Worker-15] c.n.l.l.i.r.DeleteRedisDataNewService    : delKeyByGlassForAC7710-delKeyByGlass-删除redis记录  result:glass_id   key:result:A11A200CA
V:ARRAY ,value共5751个
2021-10-14 00:02:52.511  INFO 30780 --- [QuartzScheduler_Worker-15] c.n.l.l.i.r.DeleteRedisDataNewService    : delKeyByGlassForAC7710-delKeyByGlass-删除redis记录  result:glass_id   key:result:A119V04JB
L:ARRAY ,value共6646个
2021-10-14 00:02:52.516  INFO 30780 --- [QuartzScheduler_Worker-15] c.n.l.l.i.r.DeleteRedisDataNewService    : delKeyByGlassForAC7710-delKeyByGlass-删除redis记录  result:glass_id   key:result:A119V045C
F:ARRAY ,value共6278个
2021-10-14 00:02:52.526  INFO 30780 --- [QuartzScheduler_Worker-15] c.n.l.l.i.r.DeleteRedisDataNewService    : delKeyByGlassForAC7710-delKeyByGlass-删除redis记录  result:glass_id   key:result:A11A101BB
S:ARRAY ,value共6648个
2021-10-14 00:02:52.536  INFO 30780 --- [QuartzScheduler_Worker-15] c.n.l.l.i.r.DeleteRedisDataNewService    : delKeyByGlassForAC7710-delKeyByGlass-删除redis记录  result:glass_id   key:result:A11A502DB
F:ARRAY ,value共4812个
2021-10-14 00:02:52.542  INFO 30780 --- [QuartzScheduler_Worker-15] c.n.l.l.i.r.DeleteRedisDataNewService    : delKeyByGlassForAC7710-delKeyByGlass-删除redis记录  result:glass_id   key:result:A11A104NB
J:ARRAY ,value共4920个
2021-10-14 00:02:52.551  INFO 30780 --- [QuartzScheduler_Worker-15] c.n.l.l.i.r.DeleteRedisDataNewService    : delKeyByGlassForAC7710-delKeyByGlass-删除redis记录  result:glass_id   key:result:A11A104NA
N:ARRAY ,value共4920个
2021-10-14 00:02:52.558  INFO 30780 --- [QuartzScheduler_Worker-15] c.n.l.l.i.r.DeleteRedisDataNewService    : delKeyByGlassForAC7710-delKeyByGlass-删除redis记录  result:glass_id   key:result:A119U06GB
M:ARRAY ,value共4917个
2021-10-14 00:02:52.566  INFO 30780 --- [QuartzScheduler_Worker-15] c.n.l.l.i.r.DeleteRedisDataNewService    : delKeyByGlassForAC7710-delKeyByGlass-删除redis记录  result:glass_id   key:result:A11A401XB
A:ARRAY ,value共5769个
2021-10-14 00:02:52.573  INFO 30780 --- [QuartzScheduler_Worker-15] c.n.l.l.i.r.DeleteRedisDataNewService    : delKeyByGlassForAC7710-delKeyByGlass-删除redis记录  result:glass_id   key:result:A11A404GA
X:ARRAY ,value共5762个
2021-10-14 00:02:52.581  INFO 30780 --- [QuartzScheduler_Worker-15] c.n.l.l.i.r.DeleteRedisDataNewService    : delKeyByGlassForAC7710-delKeyByGlass-删除redis记录  result:glass_id   key:result:A11A304NB
X:ARRAY ,value共5764个
2021-10-14 00:02:52.588  INFO 30780 --- [QuartzScheduler_Worker-15] c.n.l.l.i.r.DeleteRedisDataNewService    : delKeyByGlassForAC7710-delKeyByGlass-删除redis记录  result:glass_id   key:result:A11A404GB
D:ARRAY ,value共5762个
2021-10-14 00:02:52.596  INFO 30780 --- [QuartzScheduler_Worker-15] c.n.l.l.i.r.DeleteRedisDataNewService    : delKeyByGlassForAC7710-delKeyByGlass-删除redis记录  result:glass_id   key:result:A119W03FC
K:ARRAY ,value共6268个
2021-10-14 00:02:52.602  INFO 30780 --- [QuartzScheduler_Worker-15] c.n.l.l.i.r.DeleteRedisDataNewService    : delKeyByGlassForAC7710-delKeyByGlass-删除redis记录  result:glass_id   key:result:A11A404GB
W:ARRAY ,value共5764个
2021-10-14 00:02:52.609  INFO 30780 --- [QuartzScheduler_Worker-15] c.n.l.l.i.r.DeleteRedisDataNewService    : delKeyByGlassForAC7710-delKeyByGlass-删除redis记录  result:glass_id   key:result:A11A506SB
D:ARRAY ,value共4812个
2021-10-14 00:02:52.615  INFO 30780 --- [QuartzScheduler_Worker-15] c.n.l.l.i.r.DeleteRedisDataNewService    : delKeyByGlassForAC7710-delKeyByGlass-删除redis记录  result:glass_id   key:result:A11A4075B
U:ARRAY ,value共4811个
2021-10-14 00:02:52.622  INFO 30780 --- [QuartzScheduler_Worker-15] c.n.l.l.i.r.DeleteRedisDataNewService    : delKeyByGlassForAC7710-delKeyByGlass-删除redis记录  result:glass_id   key:result:A11A406VC
K:ARRAY ,value共4796个

需求；提取每一行中的数字并求和

more info.log.2021-10-14.1.log|grep result:glass_i|awk -F " " {'print $12'}|pcregrep -o "\d+"|awk '{sum += $1};END {print sum}'
6194572

九、BASH_REMATCH的使用

BASH_REMATCH 含有与指定的正则表达式匹配的文本元素的数组

需求如下:

将如下文本中的id和name提取出来.

/home/vagrant# more test.txt
<List>
    <Job id="1" name="abc"/>
    <Job id="2" name="zyz"/>
    <Job id="3" name="beew"/>
</List>

结果:

abc | 1

zyz | 2

beew | 3

解决:

:/home/vagrant# while read line;do if [[ $line =~ id=\"([0-9]+).*name=\"([^\"]*) ]];then echo "${BASH_REMATCH[2]} | ${BASH_REMATCH[1]}";fi done <test.txt

=~的说明可以参见man bash

An additional binary operator, =~, is available, with the same precedence as == and !=.  When it is used, the string to the right of
       the  operator  is  considered an extended regular expression and matched accordingly (as in regex(3)).  The return value is 0 if the
       string matches the pattern, and 1 otherwise.  If the regular expression is syntactically  incorrect,  the  conditional  expression's
       return  value  is  2.   If  the shell option nocasematch is enabled, the match is performed without regard to the case of alphabetic
       characters.  Any part of the pattern may be quoted to force the quoted portion to be matched as a string.   Bracket  expressions  in
       regular expressions must be treated carefully, since normal quoting characters lose their meanings between brackets.  If the pattern
       is stored in a shell variable, quoting the variable expansion forces the entire pattern to  be  matched  as  a  string.   Substrings
       matched  by parenthesized subexpressions within the regular expression are saved in the array variable BASH_REMATCH.  The element of
       BASH_REMATCH with index 0 is the portion of the string matching the entire regular expression.  The  element  of  BASH_REMATCH  with
       index n is the portion of the string matching the nth parenthesized subexpression.

再看另外一个例子:

/home/vagrant# cat binlog.info
ster-bin.000001     481
/home/vagrant# if [[ `cat binlog.info` =~ ^(.*?)[[:space:]]+(.*?)$ ]];then  echo "${BASH_REMATCH[2]} | ${BASH_REMATCH[1]}"; fi
481 | ster-bin.000001

正则表达式实战

一、提取字符串需要的部分

二、处理连续出现多此的单词

三、正则匹配时间

四、正则表达式替换括号中的内容

五、正则表达式替换括号中的内容(包括括号本身)

六、如何获取字符串中左右边界之间的字符串

获取index name和 tablename

使用python re提取括号中的内容:

有没有更好的办法获取index name和 tablename和括号中的内容

七、贪婪匹配规则

解决: 使用环视

八、从log中用正则提取数据做运算

九、BASH_REMATCH的使用

热门文章

最新文章

相关课程

相关电子书

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

正则表达式实战

一、提取字符串需要的部分

二、处理连续出现多此的单词

三、正则匹配时间

四、正则表达式替换括号中的内容

五、正则表达式替换括号中的内容(包括括号本身)

六、如何获取字符串中左右边界之间的字符串

获取index name和 tablename

使用python re提取括号中的内容:

有没有更好的办法获取index name和 tablename和括号中的内容

七、贪婪匹配规则

解决: 使用环视

八、从log中用正则提取数据做运算

九、BASH_REMATCH的使用

热门文章

最新文章

相关课程

相关电子书