使用正则表达式进行模式匹配
我们通过str_view()和str_view_all()函数来学习正则表达式。这两个函数接受一个字符向量和一个正则表达式,并显示出它们是如何匹配的。
基础匹配
str_view 是查看string是否匹配pattern,如果匹配就高亮显示:
x <- c("apple", "banana", "pear") str_view(x, "an")
另一个更复杂一些的模式是使用 .,它可以匹配任意字符(除了换行符):
str_view(x, ".a.") 
         
锚点
- ^ 从字符串开头进行匹配。
 - $ 从字符串末尾进行匹配。
 
x <- c("apple", "banana", "pear") str_view(x, "^a")
str_view(x, "a$") 
         
字符类与字符选项
很多特殊模式可以匹配多个字符:
.:匹配除换行符外的任意字符\d:匹配任意数字\s:匹配任意空白字符(如空格、制表符和换行符)[abc]:可以匹配 a、b 或 c[^abc]:可以匹配除 a、b、c 外的任意字符
注意:要想创建包含 \d 或 \s 的正则表达式,你需要在字符串中对 \ 进行转义,因此需 要输入 "\d" 或 "\s"。
重复
正则表达式的另一项强大功能是,其可以控制一个模式的匹配次数。
- ?:0 次或 1 次。
 - +:1 次或多次。
 - *:0 次或多次。
 
x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII" str_view(x, "CC?")
str_view(x, "CC+") 
         
str_view(x, 'C[LX]+') 
         还可以精确设置匹配的次数:
- {n}:匹配 n 次
 - {n , }:匹配 n 次或更多次
 - {, m}:最多匹配 m 次
 - {n, m}:匹配 n 到 m 次
 
str_view(x, "C{2}") 
         
str_view(x, "C{2,}") 
         
str_view(x, "C{2,3}") 
         
分组与回溯引用
以下的正则表达式可以找出名称中有重复的一对字母的所有水果:
str_view(fruit, "(.)\\1", match = TRUE) 
         
.:匹配任意字符(.):将匹配项括起来,并将其命名为\\1;如果有两个括号,就命名为\\1和\\2。\\1:表示回溯引用 因此,(.)\\1的意思是,匹配到了字符,后面还希望有个相同的字符。如果想要匹配abab类型:
str_view(fruit, "(..)\\1", match = TRUE) 
         如果想要匹配abba类型:
str_view(fruit, "(.)(.)\\2\\1", match = TRUE) 
         匹配检测
要想确定一个字符向量能否匹配一种模式,可以使用str_detect()函数。
x <- c("apple", "banana", "pear") str_detect(x, "e") #> [1] TRUE FALSE TRUE
str_detect() 函数的一种变体是str_count(),后者不是简单地返回是或否,而是返回字符串中匹配的数量:
x <- c("apple", "banana", "pear") str_count(x, "a") #> [1] 1 3 1 # 平均来看,每个单词中有多少个元音字母? mean(str_count(words, "[aeiou]")) #> [1] 1.99
提取匹配内容
要想提取匹配的实际文本,我们可以使用str_extract()函数。我们将使用维基百科上的 Harvard sentences数据集:
length(sentences) #> [1] 720 head(sentences) #> [1] "The birch canoe slid on the smooth planks." #> [2] "Glue the sheet to the dark blue background." #> [3] "It's easy to tell the depth of a well." #> [4] "These days a chicken leg is a rare dish." #> [5] "Rice is often served in round bowls." #> [6] "The juice of lemons makes fine punch."
假设我们想要找出包含一种颜色的所有句子。首先,我们需要创建一个颜色名称向量,然后将其转换成一个正则表达式:
colors <- c( "red", "orange", "yellow", "green", "blue", "purple" ) color_match <- str_c(colors, collapse = "|") color_match #> [1] "red|orange|yellow|green|blue|purple"
现在我们可以选取出包含一种颜色的句子,再从中提取出颜色,就可以知道有哪些颜色了:
has_color <- str_subset(sentences, color_match) matches <- str_extract(has_color, color_match) head(matches) #> [1] "blue" "blue" "red" "red" "red" "blue"
分组匹配
str_match()函数则可以给出每个独立分组。它返回的不是字符向量,而是一个矩阵,其中一列是完整匹配,后面的列是每个分组的匹配:
noun <- "(a|the) ([^ ]+)" has_noun <- sentences %>% str_subset(noun) %>% head(10) has_noun %>% str_match(noun) #> [,1] [,2] [,3] #> [1,] "the smooth" "the" "smooth" #> [2,] "the sheet" "the" "sheet" #> [3,] "the depth" "the" "depth" #> [4,] "a chicken" "a" "chicken" #> [5,] "the parked" "the" "parked" #> [6,] "the sun" "the" "sun" #> [7,] "the huge" "the" "huge" #> [8,] "the ball" "the" "ball" #> [9,] "the woman" "the" "woman" #> [10,] "a helps" "a" "helps"
替换匹配内容
str_replace()函数可以使用新字符串替换匹配内容。
x <- c("apple", "pear", "banana") str_replace(x, "[aeiou]", "-") #> [1] "-pple" "p-ar" "b-nana" str_replace_all(x, "[aeiou]", "-") #> [1] "-ppl-" "p--r" "b-n-n-"
通过提供一个命名向量,使用str_replace_all()函数可以同时执行多个替换:
x <- c("1 house", "2 cars", "3 people") str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three")) #> [1] "one house" "two cars" "three people"
除了使用固定字符串替换匹配内容,你还可以使用回溯引用来插入匹配中的分组。在下面的代码中,我们交换了第二个单词和第三个单词的顺序:
sentences %>% str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>% head(5) #> [1] "The canoe birch slid on the smooth planks." #> [2] "Glue sheet the to the dark blue background." #> [3] "It's to easy tell the depth of a well." #> [4] "These a days chicken leg is a rare dish." #> [5] "Rice often is served in round bowls."
拆分
str_split()函数可以将字符串拆分为多个片段。例如,我们可以将句子拆分成单词:
sentences %>% head(5) %>% str_split(" ") #> [[1]] #> [1] "The" "birch" "canoe" "slid" "on" "the" #> [7] "smooth" "planks." #> #> [[2]] #> [1] "Glue" "the" "sheet" "to" #> [5] "the" "dark" "blue" "background." #> #> [[3]] #> [1] "It's" "easy" "to" "tell" "the" "depth" "of" #> [8] "a" "well." #> #> [[4]] #> [1] "These" "days" "a" "chicken" "leg" "is" #> [7] "a" "rare" "dish." #> #> [[5]] #> [1] "Rice" "is" "often" "served" "in" "round"
定位匹配内容
str_locate()函数可以给出每个匹配的开始位置和结束位置。
x <- c("apple", "banana", "pear") str_locate(x,"a") start end [1,] 1 1 [2,] 2 2 [3,] 3 3


![2]{UOW6$QN@[90P%FXYUNLS.png 2]{UOW6$QN@[90P%FXYUNLS.png](https://ucc.alicdn.com/pic/developer-ecology/bc26bc6ee06a46b39877d0d648d625f0.png?x-oss-process=image/resize,w_1400/format,webp)


![X]8$11SNE1KXQ`1{A`IN%_P.png X]8$11SNE1KXQ`1{A`IN%_P.png](https://ucc.alicdn.com/pic/developer-ecology/411854814ee84ff98de9079430b87ba2.png?x-oss-process=image/resize,w_1400/format,webp)



![OZ_@]COY293Z)}~(Q1CJ3VE.png OZ_@]COY293Z)}~(Q1CJ3VE.png](https://ucc.alicdn.com/pic/developer-ecology/cf5020ec37d64f9998439046390a095a.png?x-oss-process=image/resize,w_1400/format,webp)


![O_O4]2{HGE~J2JIZ%I$59TB.png O_O4]2{HGE~J2JIZ%I$59TB.png](https://ucc.alicdn.com/pic/developer-ecology/fae1d251d91a41a29e946bba2fe3c64d.png?x-oss-process=image/resize,w_1400/format,webp)

