本文都是基于Linux 支持的正则来实战。关于正则的流派见这里。
一、提取字符串需要的部分
这个一般有两种思路:
- awk + sed 替换
- grep -o 精确match,有了正则match 会越来越精确。
有时候总有一些需求,需要提取一行中需要的部分,例如需要提取博客链接中的数字。
regExp]# echo https://www.cnblogs.com/luoahong/p/11577600.html | sed 's/.*p\/\([0-9]*\).*/\1/g' 11577600
或者使用pcre流派的grep完成
home]# echo https://www.cnblogs.com/luoahong/p/11577600.html | pcregrep -o '\d+' 11577600
注意:
1、s表示替换,\1:模式从左侧起,第一个左括号以及与其匹配的右括号之间的内容,(可以看作是一个子组)
sed支持*,不支持?、+,不能用\d之类,这个正则的历史渊源有关系。
2、匹配的字符串中如果有斜杠,需要转义。否则会报错,如下
regExp]# echo https://www.cnblogs.com/luoahong/p/11577600.html | sed 's/.*p/\([0-9]*\).*/\1/g' sed: -e expression #1, char 20: unknown option to `s'
3、值得学习的是之前想过滤出一行中需要的信息时,思路是把不需要的用空替换掉。在sed实战中有很多此使用,事实上可以转换思路,使用正则结合sed来操作,用匹配到的结果用子组替换是一个不错的选择。
再来一个例子来加深这个理解
用1 替换匹配到的内容 [root@gptest01 regExp]# sed 's/\bhttps[^"]*html/1/g' html_parse.html <p><a class="entrylistItemTitle" href="1">Linux性能优化实战学习笔记:第二讲</a></p> <p><a class="entrylistItemTitle" href="1">Linux性能优化实战学习笔记:第三讲</a></p> <p><a class="entrylistItemTitle" href="1">Linux性能优化实战学习笔记:第五讲</a></p> <p><a class="entrylistItemTitle" href="1">Linux性能优化实战学习笔记:第六讲</a></p> <p><a class="entrylistItemTitle" href="1">Linux性能优化实战学习笔记:第七讲</a></p> <p><a class="entrylistItemTitle" href="1">Linux性能优化实战学习笔记:第八讲</a></p> <p><a class="entrylistItemTitle" href="1">Linux性能优化实战学习笔记:第九讲</a></p> <p><a class="entrylistItemTitle" href="1">Linux性能优化实战学习笔记:第十讲</a></p> <p><a class="entrylistItemTitle" href="1">Linux性能优化实战学习笔记:第十一讲</a></p>
可以看出正则已经可以匹配到需要的部分了,可是子组却出错了。
改成这样还是不行
regExp]# sed 's/.*(\bhttps[^"]*html).*/\1/g' html_parse.html sed: -e expression #1, char 29: invalid reference \1 on `s' command's RHS
4、改变思路使用pcre 流派的正则也可以完成。请看下面这个例子
这里也提供了一个思路,不一定非要用正则匹配到的文本替换整个文本。
# \b代表边界 # ^ 开始 # 以https为边界,以"开始 中间匹配任意 以html结尾,就可以得到结果 regExp]# cat html_parse.html | pcregrep -o '\bhttps[^"]*html' https://www.cnblogs.com/luoahong/p/10353166.html https://www.cnblogs.com/luoahong/p/10802102.html https://www.cnblogs.com/luoahong/p/10808031.html https://www.cnblogs.com/luoahong/p/10808936.html https://www.cnblogs.com/luoahong/p/10809642.html https://www.cnblogs.com/luoahong/p/10811704.html https://www.cnblogs.com/luoahong/p/10815283.html https://www.cnblogs.com/luoahong/p/10820252.html https://www.cnblogs.com/luoahong/p/10592001.html HTML原文本是这样的 home]# echo "<p><a class="entrylistItemTitle" href="https://www.cnblogs.com/luoahong/p/10882536.html">Linux性能优化实战学习笔记:第二十三讲</a></p>"| pcregrep -o '\bhttps[^"]*html' https://www.cnblogs.com/luoahong/p/10882536.html
二、处理连续出现多此的单词
the little cat cat is in the hat hat, we like it.
其中 cat 和 hat 连接出现多次,要求处理后结果是
the little cat is in the hat, we like it.
subGroup]# echo "the little cat cat is in the hat hat, we like it."|sed -E 's/(\w+) \1/\1/g' the little cat is in the hat, we like it.
为什么这能匹配出来连续重复字符?
(\w+)代表分组,此时只有一个分组,“\1”代表第一个分组的内容
于是,该正则意思是:某单词+空格+某单词,这样就实现了连续重复单词的匹配。
如果是一些单词连续出现了多次 该怎么处理呢?
the little cat cat is in the hat hat hat, we like it
如果是多个连续的单词需要注意是(\s\1)+ ,这个+ 应该外面,如果在括号里面会是什么情况呢?
subGroup]# echo "the little cat cat is in the hat hat hat, we like it."|sed -E 's/(\w+)( \1+)/\1/g' the little cat is in the hat hat, we like it.
这是什么原因呢? 如果在里面代表第一个子组出现1或多此,在外面代表匹配到整个子组出现1次或多此。
subGroup]# echo "the little cat cat is in the hat hat hat, we like it." |sed -E 's/(\b\w+)(\s\1)+/\1/g' the little cat is in the hat, we like it. 如果不用\s, 直接用空格替换\s 也是ok的 subGroup]# echo "the little cat cat is in the hat hat hat, we like it." |sed -E 's/(\b\w+)( \1)+/\1/g' the little cat is in the hat, we like it.
三、正则匹配时间
这应该算是最简单的正则匹配时间,一旦格式变化会引起极度不适应,如需深入,请参考大神文章。
monitorDoubleAck]# echo "2020-08-03 06:56:38" |pcregrep '\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}' 2020-08-03 06:56:38
最近有个需求,由于上传资料的机器程序出现惊天bug,出现了2020-09-06 03:60:02 这样雷人的时间,程序也没有对其合法性检查导致杯具。
你可能会想这不简单吗?
显然上面的正则表达式不能满足我们的要求。
先看图说话,基本把时间不合法的情况都测试了,我们的前提是小于两位数的位置都需要用0占位。
正则形如:
\d{4}\-(?:0[1-9]|1[0-2])\-(?:[12]\d|3[01]|0?[0-9]) (?:2?[0-3]|1\d|0\d):(:?[0-5]\d|):(:?[1-5]\d|0\d)
注意:
- 年域用\d{4}不是特别严格,0212-12-01 也会被匹配到。
- 使用(?:正则)不保存子组,例如\d{15}(?:\d{3})?
四、正则表达式替换括号中的内容
在编写 gpload 的ymal控制文件时,有时候需要指明列名,如下图
显然这就是表结构么,但是如何变成这种样子呢?
如何把表结构这种格式变为如上图所需要的格式?
1、括号中的数字要如何处理?
[(\d)]
这样替换有个问题: 字段中如果有数字也一并被替换掉了.
2、第一个字段后面的冒号(行中的第二个空格前增加一个: )
在notpad中切换到正则模式这样替换
搜索:^([^ ]+[ ]+[^ ]+)[ ]
替换:\1:
从行的开头,寻找非空格后面的空间,然后另一组非空间的格局。最后,匹配一个空格字符。
替换是导致最终空格字符的所有内容,然后是选项卡。
五、正则表达式替换括号中的内容(包括括号本身)
需求:替换掉下面sql语句的括号以及括号中的内容.
ALTER TABLE dm.basics_history ADD CONSTRAINT basics_history_pkey1 PRIMARY KEY (seq_id);
替换后的结果:
使用的正则表达式是: .∗.∗
此外,由于.*是贪婪的,它会杀死尽可能多的,因为它可以。所以"(abc)(def)"将变成""
六、如何获取字符串中左右边界之间的字符串
需求: 如下数据,想要获取
1、index 和on之间的index name
2、on之后的tablename
3、括号中的column
一共有10W多行
CREATE UNIQUE INDEX cell_pds_result_assm_t_n_pkey ON cell_pds_result_assm_t USING btree (component_id, step_id, component_start_time, equip_id, sub_equip_id, param_collection, param_name, site_name) CREATE UNIQUE INDEX cell_pds_result_other_t_n_pkey ON cell_pds_result_other_t USING btree (component_id, step_id, component_start_time, equip_id, sub_equip_id, param_collection, param_name, site_name) CREATE UNIQUE INDEX cell_pds_result_pipr_t_n_pkey ON cell_pds_result_pipr_t USING btree (component_id, step_id, component_start_time, equip_id, sub_equip_id, param_collection, param_name, site_name) CREATE UNIQUE INDEX cell_pds_result_psva_t_n_pkey ON cell_pds_result_psva_t USING btree (component_id, step_id, component_start_time, equip_id, sub_equip_id, param_collection, param_name, site_name) CREATE UNIQUE INDEX cf_mds_glass_h_cnvr_t_pkey ON cf_mds_glass_h_cnvr_t USING btree (glass_id, step_id, glass_start_time, equip_id, sub_equip_id) ...... 10W行
思路:
获取index name和 tablename
可以使用 awk 可获取到
1、获取tableName 需要注意的是 创建主键索引多一个unique. 注意分割位置。
cat -n test.txt |awk -F " " {'print $6'} > ./otherindex.txt cat -n test.txt |awk -F " " {'print $7'} > ./pkindex.txt
2、使用正则获取INDEX之后 ON之前的字符串,区分大小写
(?<=INDEX).*(?=ON)
如何提取是个麻烦事,不如用python 直接一步到位.
使用python re提取括号中的内容:
1、用python re模块
import re import os if __name__ == '__main__': string = 'abe(ac)ad)' current_path = os.getcwd() path = os.path.join(current_path, 'file') with open('tableList.txt', 'r', encoding='utf-8', newline='') as f: for line in f: # print(line.rstrip()) p1 = re.compile(r'[(](.*?)[)]', re.S) p11 = re.compile(r'(?<=INDEX).*(?=ON)', re.S) # 匹配括号中的内容 print(re.findall(p1, line.rstrip())) # 匹配INDEX 和ON 之间的中的indexName print(re.findall(p11, line.rstrip()))
print(re.findall(p11, line.rstrip()))
2、使用正则工具查找然后到处匹配到的行
regex101: build, test, and debug regex
先真这个匹配,然后把正则匹配到的copy出来
根据实际需求应该是不需要最长匹配,反选此按钮即可完成需求
有没有更好的办法获取index name和 tablename和括号中的内容
/tmp]#pcregrep -o "[(](.*)[)]" test.txt
pcregrep -o "(?<=INDEX).*(?=ON)" test.txt |more
update 2022年1月6日20:15:08 ----------
使用sed 也能轻松完成
[root@ninesun ~]# cat >> file2.log << EOF > CREATE UNIQUE INDEX cell_pds_result_assm_t_n_pkey ON cell_pds_result_assm_t USING btree (component_id, step_id, component_start_time, equip_id, sub_equip_id, param_collection, param_name, site_name) > EOF [root@ninesun ~]# [root@ninesun ~]# cat /tmp/file2.log cat: /tmp/file2.log: No such file or directory [root@ninesun ~]# [root@ninesun ~]# [root@ninesun ~]# [root@ninesun ~]# cat file2.log CREATE UNIQUE INDEX cell_pds_result_assm_t_n_pkey ON cell_pds_result_assm_t USING btree (component_id, step_id, component_start_time, equip_id, sub_equip_id, param_collection, param_name, site_name) [root@ninesun ~]# [root@ninesun ~]# sed -i "s/[()]//g" ./file2.log [root@ninesun ~]# [root@ninesun ~]# cat file2.log CREATE UNIQUE INDEX cell_pds_result_assm_t_n_pkey ON cell_pds_result_assm_t USING btree component_id, step_id, component_start_time, equip_id, sub_equip_id, param_collection, param_name, site_name
七、贪婪匹配规则
有如下文本:
I can haz haz bacon. Mmmm. Tasty, tasty bacon.
需求: 第一字符串“az”和最后字符串“as”之间的的内容。
细心观察, 后面的字母 Tasty和 tasty中均含有as
解决: 使用环视
(?<=az).*?(?=as)
有左尖括号代表看左边,没有尖括号是看右边,而感叹号是非的意思。
使用正则表达式的debug mode发现匹配的时候如果又? 则匹配是从左到右依次匹配的。而如果使用贪婪的方式是先从左向右匹配到az之后开始从右向左开始匹配,匹配到第一次出现as时停止.
(?<=az).*(?=as)
这里有一个比较强大的功能,正则表达式 debug模式
八、从log中用正则提取数据做运算
log如下:
2021-10-14 00:02:52.474 INFO 30780 --- [QuartzScheduler_Worker-15] c.n.l.l.i.r.DeleteRedisDataNewService : delKeyByGlassForAC7710-delKeyByGlass-删除redis记录 result:glass_id key:result:A11A4041A C:ARRAY ,value共4701个 2021-10-14 00:02:52.480 INFO 30780 --- [QuartzScheduler_Worker-15] c.n.l.l.i.r.DeleteRedisDataNewService : delKeyByGlassForAC7710-delKeyByGlass-删除redis记录 result:glass_id key:result:A11A3051A W:ARRAY ,value共4707个 2021-10-14 00:02:52.488 INFO 30780 --- [QuartzScheduler_Worker-15] c.n.l.l.i.r.DeleteRedisDataNewService : delKeyByGlassForAC7710-delKeyByGlass-删除redis记录 result:glass_id key:result:A11A3051A C:ARRAY ,value共4705个 2021-10-14 00:02:52.495 INFO 30780 --- [QuartzScheduler_Worker-15] c.n.l.l.i.r.DeleteRedisDataNewService : delKeyByGlassForAC7710-delKeyByGlass-删除redis记录 result:glass_id key:result:A11A3052B Y:ARRAY ,value共4903个 2021-10-14 00:02:52.504 INFO 30780 --- [QuartzScheduler_Worker-15] c.n.l.l.i.r.DeleteRedisDataNewService : delKeyByGlassForAC7710-delKeyByGlass-删除redis记录 result:glass_id key:result:A11A200CA V:ARRAY ,value共5751个 2021-10-14 00:02:52.511 INFO 30780 --- [QuartzScheduler_Worker-15] c.n.l.l.i.r.DeleteRedisDataNewService : delKeyByGlassForAC7710-delKeyByGlass-删除redis记录 result:glass_id key:result:A119V04JB L:ARRAY ,value共6646个 2021-10-14 00:02:52.516 INFO 30780 --- [QuartzScheduler_Worker-15] c.n.l.l.i.r.DeleteRedisDataNewService : delKeyByGlassForAC7710-delKeyByGlass-删除redis记录 result:glass_id key:result:A119V045C F:ARRAY ,value共6278个 2021-10-14 00:02:52.526 INFO 30780 --- [QuartzScheduler_Worker-15] c.n.l.l.i.r.DeleteRedisDataNewService : delKeyByGlassForAC7710-delKeyByGlass-删除redis记录 result:glass_id key:result:A11A101BB S:ARRAY ,value共6648个 2021-10-14 00:02:52.536 INFO 30780 --- [QuartzScheduler_Worker-15] c.n.l.l.i.r.DeleteRedisDataNewService : delKeyByGlassForAC7710-delKeyByGlass-删除redis记录 result:glass_id key:result:A11A502DB F:ARRAY ,value共4812个 2021-10-14 00:02:52.542 INFO 30780 --- [QuartzScheduler_Worker-15] c.n.l.l.i.r.DeleteRedisDataNewService : delKeyByGlassForAC7710-delKeyByGlass-删除redis记录 result:glass_id key:result:A11A104NB J:ARRAY ,value共4920个 2021-10-14 00:02:52.551 INFO 30780 --- [QuartzScheduler_Worker-15] c.n.l.l.i.r.DeleteRedisDataNewService : delKeyByGlassForAC7710-delKeyByGlass-删除redis记录 result:glass_id key:result:A11A104NA N:ARRAY ,value共4920个 2021-10-14 00:02:52.558 INFO 30780 --- [QuartzScheduler_Worker-15] c.n.l.l.i.r.DeleteRedisDataNewService : delKeyByGlassForAC7710-delKeyByGlass-删除redis记录 result:glass_id key:result:A119U06GB M:ARRAY ,value共4917个 2021-10-14 00:02:52.566 INFO 30780 --- [QuartzScheduler_Worker-15] c.n.l.l.i.r.DeleteRedisDataNewService : delKeyByGlassForAC7710-delKeyByGlass-删除redis记录 result:glass_id key:result:A11A401XB A:ARRAY ,value共5769个 2021-10-14 00:02:52.573 INFO 30780 --- [QuartzScheduler_Worker-15] c.n.l.l.i.r.DeleteRedisDataNewService : delKeyByGlassForAC7710-delKeyByGlass-删除redis记录 result:glass_id key:result:A11A404GA X:ARRAY ,value共5762个 2021-10-14 00:02:52.581 INFO 30780 --- [QuartzScheduler_Worker-15] c.n.l.l.i.r.DeleteRedisDataNewService : delKeyByGlassForAC7710-delKeyByGlass-删除redis记录 result:glass_id key:result:A11A304NB X:ARRAY ,value共5764个 2021-10-14 00:02:52.588 INFO 30780 --- [QuartzScheduler_Worker-15] c.n.l.l.i.r.DeleteRedisDataNewService : delKeyByGlassForAC7710-delKeyByGlass-删除redis记录 result:glass_id key:result:A11A404GB D:ARRAY ,value共5762个 2021-10-14 00:02:52.596 INFO 30780 --- [QuartzScheduler_Worker-15] c.n.l.l.i.r.DeleteRedisDataNewService : delKeyByGlassForAC7710-delKeyByGlass-删除redis记录 result:glass_id key:result:A119W03FC K:ARRAY ,value共6268个 2021-10-14 00:02:52.602 INFO 30780 --- [QuartzScheduler_Worker-15] c.n.l.l.i.r.DeleteRedisDataNewService : delKeyByGlassForAC7710-delKeyByGlass-删除redis记录 result:glass_id key:result:A11A404GB W:ARRAY ,value共5764个 2021-10-14 00:02:52.609 INFO 30780 --- [QuartzScheduler_Worker-15] c.n.l.l.i.r.DeleteRedisDataNewService : delKeyByGlassForAC7710-delKeyByGlass-删除redis记录 result:glass_id key:result:A11A506SB D:ARRAY ,value共4812个 2021-10-14 00:02:52.615 INFO 30780 --- [QuartzScheduler_Worker-15] c.n.l.l.i.r.DeleteRedisDataNewService : delKeyByGlassForAC7710-delKeyByGlass-删除redis记录 result:glass_id key:result:A11A4075B U:ARRAY ,value共4811个 2021-10-14 00:02:52.622 INFO 30780 --- [QuartzScheduler_Worker-15] c.n.l.l.i.r.DeleteRedisDataNewService : delKeyByGlassForAC7710-delKeyByGlass-删除redis记录 result:glass_id key:result:A11A406VC K:ARRAY ,value共4796个
需求;提取每一行中的数字并求和
more info.log.2021-10-14.1.log|grep result:glass_i|awk -F " " {'print $12'}|pcregrep -o "\d+"|awk '{sum += $1};END {print sum}' 6194572
九、BASH_REMATCH的使用
BASH_REMATCH 含有与指定的正则表达式匹配的文本元素的数组
需求如下:
将如下文本中的id和name提取出来.
/home/vagrant# more test.txt <List> <Job id="1" name="abc"/> <Job id="2" name="zyz"/> <Job id="3" name="beew"/> </List>
结果:
abc | 1
zyz | 2
beew | 3
解决:
:/home/vagrant# while read line;do if [[ $line =~ id=\"([0-9]+).*name=\"([^\"]*) ]];then echo "${BASH_REMATCH[2]} | ${BASH_REMATCH[1]}";fi done <test.txt
=~的说明可以参见man bash
An additional binary operator, =~, is available, with the same precedence as == and !=. When it is used, the string to the right of the operator is considered an extended regular expression and matched accordingly (as in regex(3)). The return value is 0 if the string matches the pattern, and 1 otherwise. If the regular expression is syntactically incorrect, the conditional expression's return value is 2. If the shell option nocasematch is enabled, the match is performed without regard to the case of alphabetic characters. Any part of the pattern may be quoted to force the quoted portion to be matched as a string. Bracket expressions in regular expressions must be treated carefully, since normal quoting characters lose their meanings between brackets. If the pattern is stored in a shell variable, quoting the variable expansion forces the entire pattern to be matched as a string. Substrings matched by parenthesized subexpressions within the regular expression are saved in the array variable BASH_REMATCH. The element of BASH_REMATCH with index 0 is the portion of the string matching the entire regular expression. The element of BASH_REMATCH with index n is the portion of the string matching the nth parenthesized subexpression.
再看另外一个例子:
/home/vagrant# cat binlog.info ster-bin.000001 481 /home/vagrant# if [[ `cat binlog.info` =~ ^(.*?)[[:space:]]+(.*?)$ ]];then echo "${BASH_REMATCH[2]} | ${BASH_REMATCH[1]}"; fi 481 | ster-bin.000001