正则表达式
A regular expression is a pattern template you define that a Linux utility(sed,gawk等) uses to filter text.
正则表达式由正则表达式引擎来实现(regular expression engine)。
In the Linux world, there are two popular regular expression engines:
■ The POSIX Basic Regular Expression (BRE) engine
■ The POSIX Extended Regular Expression (ERE) engine
定义BRE模式
1、纯文本
$ echo "This is a test" | sed -n ’/test/p’
This is a test
正则表达式并不关心模式出现在数据流中的位置,关键是匹配正则表达式模式与数据流文本。正则表达式模式区分大小写。空格像其它字符一样处理。
2、特殊字符
The special characters recognized by regular expressions are:
.*[]^${}\+?|()
不要在文本模式中单独使用这些字符。可以用转义字符(\)把这些字符当作普通字符。
3、定位符
1)The caret character (^) defines a pattern that starts at the beginning of a line of text in the data stream. If the pattern is located any place other than the start of the line of text, the regular expression pattern fails.
$ echo "Books are great" | sed -n ’/^Book/p’
Books are great
2)The opposite of looking for a pattern at the start of a line is looking for it at the end of a line. The dollar sign ($) special character defines the end anchor. Add this special character after a text pattern to indicate that the line of data must end with the text pattern:
$ echo "This is a good book" | sed -n ’/book$/p’
This is a good book
3)The dot special character is used to match any single character except a newline character. The dot character must match a character though; if there’s no character in the place of the dot, then the pattern will fail.
4)字符类
用方括号来定义字符类。
$ sed -n ’/[ch]at/p’ data6
The cat is sleeping.
That is a very nice hat.
5)否定字符类
$ sed -n ’/[^ch]at/p’ data6
This test is at line two.
6)使用范围
You can use a range of characters within a character class by using the dash symbol。
Just specify the first character in the range, a dash, then the last character in the range. The regular expression includes any character that’s within the specified character range。
$ sed -n ’/^[0-9][0-9][0-9][0-9][0-9]$/p’ data8
60633
46201
45902
7)特殊字符类
BRE Special Character Classes
Class |
Description |
[[:alpha:]] |
Match any alphabetical character, either upper or lower case. |
[[:alnum:]] |
Match any alphanumeric character 0–9, A–Z, or a–z. |
[[:blank:]] |
Match a space or Tab character. |
[[:digit:]] |
Match a numerical digit from 0 through 9. |
[[:lower:]] |
Match any lower-case alphabetical character a–z. |
[[:print:]] |
Match any printable character. |
[[:punct:]] |
Match a punctuation character. |
[[:space:]] |
Match any whitespace character: space, Tab, NL, FF, VT, CR. |
[[:upper:]] |
Match any upper-case alphabetical character A–Z. |
8)星号
Placing an asterisk after a character signifies that the character must appear zero or more times in the text to match the pattern:
$ echo "ik" | sed -n ’/ie*k/p’
ik
扩展正则表达式
gawk支持,而sed不支持。
1)问号
The question mark is similar to the asterisk, but with a slight twist. The question mark indicates that the preceding character can appear zero or one time, but that’s all. It doesn’t match repeating occurrences of the character:
$ echo "bt" | gawk ’/be?t/{print $0}’
Bt
2)加号
The plus sign is another pattern symbol that’s similar to the asterisk, but with a different twist than the question mark. The plus sign indicates that the preceding character can appear one or more times, but must be present at least once. The pattern doesn’t match if the character is not present:
$ echo "beeet" | gawk ’/be+t/{print $0}’
beeet
3)大括号
Curly braces are available in ERE to allow you to specify a limit on a repeatable regular expression. This is often referred to as an interval. You can express the interval in two formats:
■ m: The regular expression appears exactly m times.
■ m,n: The regular expression appears at least m times, but no more than n times.
This feature allows you to fine-tune exactly how many times you allow a character (or character class) to appear in a pattern.
4)管道符号
The pipe symbol allows to you to specify two or more patterns that the regular expression engine uses in a logical OR formula when examining the data stream. If any of the patterns match the data stream text, the text passes. If none of the patterns match, the data stream text fails.
The format for using the pipe symbol is:
expr1|expr2|...
$ echo "The cat is asleep" | gawk ’/cat|dog/{print $0}’
The cat is asleep
5)将表达式分组
Regular expression patterns can also be grouped by using parentheses. When you group a regular expression pattern, the group is treated like a standard character. You can apply a special character to the group just as you would to a regular character.
$ echo "Sat" | gawk ’/Sat(urday)?/{print $0}’
Sat
$ echo "Saturday" | gawk ’/Sat(urday)?/{print $0}’
Saturday
$
几个例子
1)计算文件目录
$ cat countfiles
#!/bin/bash
# count number of files in your PATH
mypath=`echo $PATH | sed ’s/:/ /g’`
count=0
for directory in $mypath
do
check=`ls $directory`
for item in $check
do
count=$[ $count + 1 ]
done
echo "$directory - $count"
count=0
done
$ ./countfiles
/usr/local/bin - 79
/bin - 86
/usr/bin - 1502
/usr/X11R6/bin - 175
/usr/games - 2
/usr/java/j2sdk1.4.1 01/bin - 27
$
/usr/local/bin - 79
/bin - 86
/usr/bin - 1502
/usr/X11R6/bin - 175
2)验证电话号码
$ cat isphone
#!/bin/bash
# script to filter out bad phone numbers
gawk --re-interval ’/^\(?[2-9][0-9]{2}\)?(| |-|\.)
[0-9]{3}( |-|\.)[0-9]{4}/{print $0}’
$
By default, the gawk program doesn’t recognize regular expression intervals. You must specify the --re-interval command line option for the gawk program to recognize
regular expression intervals.
(123)456-7890
(123) 456-7890
123-456-7890
123.456.7890
3)解析电子邮件地址
^([a-zA-Z0-9 \-\.\+]+)@([a-zA-Z0-9 \-\.]+)\.([a-zA-Z]{2,5})$
参考: