14、Linux Shell 笔记(7),正则表达式

简介: 正则表达式 A regular expression is a pattern template you define that a Linux utility(sed,gawk等) uses to filter text. 正则表达式由正则表达式引擎来实现(regular expression engine)。

正则表达式

A regular expression is a pattern template you define that a Linux utility(sed,gawk) uses to filter text.

wps_clip_image-16400

正则表达式由正则表达式引擎来实现(regular expression engine)

In the Linux world, there are two popular regular expression engines:

The POSIX Basic Regular Expression (BRE) engine

The POSIX Extended Regular Expression (ERE) engine

定义BRE模式

1、纯文本

$ echo "This is a test" | sed -n /test/p

This is a test

正则表达式并不关心模式出现在数据流中的位置,关键是匹配正则表达式模式与数据流文本。正则表达式模式区分大小写。空格像其它字符一样处理。

2、特殊字符

The special characters recognized by regular expressions are:

.*[]^${}\+?|()

不要在文本模式中单独使用这些字符。可以用转义字符(\)把这些字符当作普通字符。

3、定位符

1The caret character (^) defines a pattern that starts at the beginning of a line of text in the data stream. If the pattern is located any place other than the start of the line of text, the regular expression pattern fails.

$ echo "Books are great" | sed -n /^Book/p

Books are great

2The opposite of looking for a pattern at the start of a line is looking for it at the end of a line. The dollar sign ($) special character defines the end anchor. Add this special character after a text pattern to indicate that the line of data must end with the text pattern:

$ echo "This is a good book" | sed -n /book$/p

This is a good book

3The dot special character is used to match any single character except a newline character. The dot character must match a character though; if theres no character in the place of the dot, then the pattern will fail.

4)字符类

用方括号来定义字符类。

$ sed -n /[ch]at/pdata6

The cat is sleeping.

That is a very nice hat.

5)否定字符类

$ sed -n /[^ch]at/pdata6

This test is at line two.

6)使用范围

You can use a range of characters within a character class by using the dash symbol

Just specify the first character in the range, a dash, then the last character in the range. The regular expression includes any character thats within the specified character range

$ sed -n /^[0-9][0-9][0-9][0-9][0-9]$/pdata8

60633

46201

45902

7)特殊字符类

BRE Special Character Classes

Class

Description

[[:alpha:]]

 Match any alphabetical character, either upper or lower case.

[[:alnum:]]

Match any alphanumeric character 0–9, A–Z, or a–z.

[[:blank:]]

 Match a space or Tab character.

[[:digit:]]

Match a numerical digit from 0 through 9.

[[:lower:]]

Match any lower-case alphabetical character a–z.

[[:print:]]

Match any printable character.

[[:punct:]]

 Match a punctuation character.

[[:space:]]

 Match any whitespace character: space, Tab, NL, FF, VT, CR.

[[:upper:]]

Match any upper-case alphabetical character A–Z.

8)星号

Placing an asterisk after a character signifies that the character must appear zero or more times in the text to match the pattern:

$ echo "ik" | sed -n /ie*k/p

ik

扩展正则表达式

gawk支持,而sed不支持。

1)问号

The question mark is similar to the asterisk, but with a slight twist. The question mark indicates that the preceding character can appear zero or one time, but thats all. It doesnt match repeating occurrences of the character:

$ echo "bt" | gawk /be?t/{print $0}

Bt

2)加号

The plus sign is another pattern symbol thats similar to the asterisk, but with a different twist than the question mark. The plus sign indicates that the preceding character can appear one or more times, but must be present at least once. The pattern doesnt match if the character is not present:

$ echo "beeet" | gawk /be+t/{print $0}

beeet

3)大括号

Curly braces are available in ERE to allow you to specify a limit on a repeatable regular expression. This is often referred to as an interval. You can express the interval in two formats:

m: The regular expression appears exactly m times.

m,n: The regular expression appears at least m times, but no more than n times.

This feature allows you to fine-tune exactly how many times you allow a character (or character class) to appear in a pattern.

4)管道符号

The pipe symbol allows to you to specify two or more patterns that the regular expression engine uses in a logical OR formula when examining the data stream. If any of the patterns match the data stream text, the text passes. If none of the patterns match, the data stream text fails.

The format for using the pipe symbol is:

expr1|expr2|...

$ echo "The cat is asleep" | gawk /cat|dog/{print $0}

The cat is asleep

5)将表达式分组

Regular expression patterns can also be grouped by using parentheses. When you group a regular expression pattern, the group is treated like a standard character. You can apply a special character to the group just as you would to a regular character.

$ echo "Sat" | gawk /Sat(urday)?/{print $0}

Sat

$ echo "Saturday" | gawk /Sat(urday)?/{print $0}

Saturday

$

几个例子

1)计算文件目录

$ cat countfiles

#!/bin/bash

# count number of files in your PATH

mypath=`echo $PATH | sed s/:/ /g`

count=0

for directory in $mypath

do

check=`ls $directory`

for item in $check

do

count=$[ $count + 1 ]

done

echo "$directory - $count"

count=0

done

$ ./countfiles

/usr/local/bin - 79

/bin - 86

/usr/bin - 1502

/usr/X11R6/bin - 175

/usr/games - 2

/usr/java/j2sdk1.4.1 01/bin - 27

$

/usr/local/bin - 79

/bin - 86

/usr/bin - 1502

/usr/X11R6/bin - 175

2)验证电话号码

$ cat isphone

#!/bin/bash

# script to filter out bad phone numbers

gawk --re-interval /^\(?[2-9][0-9]{2}\)?(| |-|\.)

[0-9]{3}( |-|\.)[0-9]{4}/{print $0}

$

By default, the gawk program doesnt recognize regular expression intervals. You must specify the --re-interval command line option for the gawk program to recognize

regular expression intervals.

(123)456-7890

(123) 456-7890

123-456-7890

123.456.7890

3)解析电子邮件地址

^([a-zA-Z0-9 \-\.\+]+)@([a-zA-Z0-9 \-\.]+)\.([a-zA-Z]{2,5})$

参考:

1Linux命令行和SHELL脚本编程

目录
相关文章
|
1月前
|
Ubuntu Linux Python
Tkinter错误笔记(一):tkinter.Button在linux下出现乱码
在Linux系统中,使用Tkinter库时可能会遇到中文显示乱码的问题,这通常是由于字体支持问题导致的,可以通过更换支持中文的字体来解决。
130 0
Tkinter错误笔记(一):tkinter.Button在linux下出现乱码
|
3月前
|
Linux
Linux源码阅读笔记10-进程NICE案例分析2
Linux源码阅读笔记10-进程NICE案例分析2
|
3月前
|
Linux
Linux源码阅读笔记09-进程NICE案例分析1
Linux源码阅读笔记09-进程NICE案例分析1
|
1月前
|
Linux API 开发工具
FFmpeg开发笔记(五十九)Linux编译ijkplayer的Android平台so库
ijkplayer是由B站研发的移动端播放器,基于FFmpeg 3.4,支持Android和iOS。其源码托管于GitHub,截至2024年9月15日,获得了3.24万星标和0.81万分支,尽管已停止更新6年。本文档介绍了如何在Linux环境下编译ijkplayer的so库,以便在较新的开发环境中使用。首先需安装编译工具并调整/tmp分区大小,接着下载并安装Android SDK和NDK,最后下载ijkplayer源码并编译。详细步骤包括环境准备、工具安装及库编译等。更多FFmpeg开发知识可参考相关书籍。
92 0
FFmpeg开发笔记(五十九)Linux编译ijkplayer的Android平台so库
|
2月前
|
机器学习/深度学习 缓存 JavaScript
简单粗暴的正则表达式笔记
简单粗暴的正则表达式笔记
43 5
|
3月前
|
Unix Linux 开发工具
linux笔记 diff及patch的制作与使用
这篇文章是关于Linux系统中使用`diff`命令生成补丁文件以及使用`patch`命令应用这些补丁的详细教程和实战案例。
95 2
linux笔记 diff及patch的制作与使用
|
3月前
|
Java Shell Linux
【Linux入门技巧】新员工必看:用Shell脚本轻松解析应用服务日志
关于如何使用Shell脚本来解析Linux系统中的应用服务日志,提供了脚本实现的详细步骤和技巧,以及一些Shell编程的技能扩展。
56 0
【Linux入门技巧】新员工必看:用Shell脚本轻松解析应用服务日志
|
3月前
|
Linux
Linux源码阅读笔记13-进程通信组件中
Linux源码阅读笔记13-进程通信组件中
|
3月前
|
消息中间件 安全 Java
Linux源码阅读笔记13-进程通信组件上
Linux源码阅读笔记13-进程通信组件上
|
3月前
|
Linux 开发者
Linux源码阅读笔记18-插入模型及删除模块操作
Linux源码阅读笔记18-插入模型及删除模块操作
下一篇
无影云桌面