R数据科学|第十章内容介绍(二)-阿里云开发者社区

R数据科学|第十章内容介绍(二)

2022-05-20 398

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 本章通过学习字符串的处理，再结合正则表达式进行正确的模式匹配。

使用正则表达式进行模式匹配

我们通过str_view()和str_view_all()函数来学习正则表达式。这两个函数接受一个字符向量和一个正则表达式，并显示出它们是如何匹配的。

基础匹配

str_view 是查看string是否匹配pattern，如果匹配就高亮显示：

x <- c("apple", "banana", "pear")
str_view(x, "an")

$7[96(G_L4P1IVS7LW{46_BC.png$

另一个更复杂一些的模式是使用 .，它可以匹配任意字符（除了换行符）：

str_view(x, ".a.")

2]{UOW6$QN@[90P%FXYUNLS.png

锚点

^ 从字符串开头进行匹配。
$ 从字符串末尾进行匹配。

x <- c("apple", "banana", "pear")
str_view(x, "^a")

@E5~$1`CDB0YU0}MU~ED)WM.png

str_view(x, "a$")

$X]8$11SNE1KXQ`1{A`IN%_P.png$

字符类与字符选项

很多特殊模式可以匹配多个字符：

.：匹配除换行符外的任意字符
\d：匹配任意数字
\s：匹配任意空白字符（如空格、制表符和换行符）
[abc]：可以匹配 a、b 或 c
[^abc]：可以匹配除 a、b、c 外的任意字符

注意：要想创建包含 \d 或 \s 的正则表达式，你需要在字符串中对 \ 进行转义，因此需要输入 "\d" 或 "\s"。

重复

正则表达式的另一项强大功能是，其可以控制一个模式的匹配次数。

?：0 次或 1 次。
+：1 次或多次。
*：0 次或多次。

x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
str_view(x, "CC?")

1~SRBCUF7`X9D1[6Z1X`LKL.png

str_view(x, "CC+")

OZ_@]COY293Z)}~(Q1CJ3VE.png

str_view(x, 'C[LX]+')

还可以精确设置匹配的次数：

{n}：匹配 n 次
{n , }：匹配 n 次或更多次
{, m}：最多匹配 m 次
{n, m}：匹配 n 到 m 次

str_view(x, "C{2}")

2KB)E@Y}5LVD}E0@77Y29ZN.png

str_view(x, "C{2,}")

$O_O4]2{HGE~J2JIZ%I$59TB.png$

str_view(x, "C{2,3}")

$O_O4]2{HGE~J2JIZ%I$59TB.png$

分组与回溯引用

以下的正则表达式可以找出名称中有重复的一对字母的所有水果：

str_view(fruit, "(.)\\1", match = TRUE)

Z(KY@PCS~PSI5E~{HR}77B4.png

.：匹配任意字符
(.)：将匹配项括起来，并将其命名为\\1；如果有两个括号，就命名为\\1和\\2。
\\1：表示回溯引用因此，(.)\\1的意思是，匹配到了字符，后面还希望有个相同的字符。如果想要匹配abab类型：

str_view(fruit, "(..)\\1", match = TRUE)

如果想要匹配abba类型：

str_view(fruit, "(.)(.)\\2\\1", match = TRUE)

匹配检测

要想确定一个字符向量能否匹配一种模式，可以使用str_detect()函数。

x <- c("apple", "banana", "pear")
str_detect(x, "e")
#> [1] TRUE FALSE TRUE

str_detect() 函数的一种变体是str_count()，后者不是简单地返回是或否，而是返回字符串中匹配的数量：

x <- c("apple", "banana", "pear")
str_count(x, "a")
#> [1] 1 3 1
# 平均来看，每个单词中有多少个元音字母？
mean(str_count(words, "[aeiou]"))
#> [1] 1.99

提取匹配内容

要想提取匹配的实际文本，我们可以使用str_extract()函数。我们将使用维基百科上的 Harvard sentences数据集：

length(sentences)
#> [1] 720
head(sentences)
#> [1] "The birch canoe slid on the smooth planks."
#> [2] "Glue the sheet to the dark blue background."
#> [3] "It's easy to tell the depth of a well."
#> [4] "These days a chicken leg is a rare dish."
#> [5] "Rice is often served in round bowls."
#> [6] "The juice of lemons makes fine punch."

假设我们想要找出包含一种颜色的所有句子。首先，我们需要创建一个颜色名称向量，然后将其转换成一个正则表达式：

colors <- c(
"red", "orange", "yellow", "green", "blue", "purple"
)
color_match <- str_c(colors, collapse = "|")
color_match
#> [1] "red|orange|yellow|green|blue|purple"

现在我们可以选取出包含一种颜色的句子，再从中提取出颜色，就可以知道有哪些颜色了：

has_color <- str_subset(sentences, color_match)
matches <- str_extract(has_color, color_match)
head(matches)
#> [1] "blue" "blue" "red" "red" "red" "blue"

分组匹配

str_match()函数则可以给出每个独立分组。它返回的不是字符向量，而是一个矩阵，其中一列是完整匹配，后面的列是每个分组的匹配：

noun <- "(a|the) ([^ ]+)"
has_noun <- sentences %>%
str_subset(noun) %>%
head(10)
has_noun %>%
str_match(noun)
#> [,1] [,2] [,3]
#> [1,] "the smooth" "the" "smooth"
#> [2,] "the sheet" "the" "sheet"
#> [3,] "the depth" "the" "depth"
#> [4,] "a chicken" "a" "chicken"
#> [5,] "the parked" "the" "parked"
#> [6,] "the sun" "the" "sun"
#> [7,] "the huge" "the" "huge"
#> [8,] "the ball" "the" "ball"
#> [9,] "the woman" "the" "woman"
#> [10,] "a helps" "a" "helps"

替换匹配内容

str_replace()函数可以使用新字符串替换匹配内容。

x <- c("apple", "pear", "banana")
str_replace(x, "[aeiou]", "-")
#> [1] "-pple" "p-ar" "b-nana"
str_replace_all(x, "[aeiou]", "-")
#> [1] "-ppl-" "p--r" "b-n-n-"

通过提供一个命名向量，使用str_replace_all()函数可以同时执行多个替换：

x <- c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
#> [1] "one house" "two cars" "three people"

除了使用固定字符串替换匹配内容，你还可以使用回溯引用来插入匹配中的分组。在下面的代码中，我们交换了第二个单词和第三个单词的顺序：

sentences %>%
str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>%
head(5)
#> [1] "The canoe birch slid on the smooth planks."
#> [2] "Glue sheet the to the dark blue background."
#> [3] "It's to easy tell the depth of a well."
#> [4] "These a days chicken leg is a rare dish."
#> [5] "Rice often is served in round bowls."

拆分

str_split()函数可以将字符串拆分为多个片段。例如，我们可以将句子拆分成单词：

sentences %>%
head(5) %>%
str_split(" ")
#> [[1]]
#> [1] "The" "birch" "canoe" "slid" "on" "the"
#> [7] "smooth" "planks."
#>
#> [[2]]
#> [1] "Glue" "the" "sheet" "to"
#> [5] "the" "dark" "blue" "background."
#>
#> [[3]]
#> [1] "It's" "easy" "to" "tell" "the" "depth" "of"
#> [8] "a" "well."
#>
#> [[4]]
#> [1] "These" "days" "a" "chicken" "leg" "is"
#> [7] "a" "rare" "dish."
#>
#> [[5]]
#> [1] "Rice" "is" "often" "served" "in" "round"

定位匹配内容

str_locate()函数可以给出每个匹配的开始位置和结束位置。

x <- c("apple", "banana", "pear")
str_locate(x,"a")
     start end
[1,]     1   1
[2,]     2   2
[3,]     3   3

R数据科学|第十章内容介绍(二)

使用正则表达式进行模式匹配

基础匹配

锚点

字符类与字符选项

重复

分组与回溯引用

匹配检测

提取匹配内容

分组匹配

替换匹配内容

拆分

定位匹配内容

热门文章

最新文章

相关电子书

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

训练营

直播

乘风者计划

下载

镜像站

技术资料

R数据科学|第十章内容介绍(二)

使用正则表达式进行模式匹配

基础匹配

锚点

字符类与字符选项

重复

分组与回溯引用

匹配检测

提取匹配内容

分组匹配

替换匹配内容

拆分

定位匹配内容

热门文章

最新文章

相关电子书