R数据科学|3.4内容介绍及习题解答

2022-05-20 431

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 通过基于变量名的操作，select()函数可以让你生成一个有用的变量子集。

3.4 使用select()选择列

通过基于变量名的操作，select()函数可以让你生成一个有用的变量子集。基本用法如下：

按名称选择列

select(flights, year, month, day)
#> # A tibble: 336,776 × 3
#> year month day
#> <int> <int> <int>
#> 1 2013 1 1
#> 2 2013 1 1
#> 3 2013 1 1
#> 4 2013 1 1
#> 5 2013 1 1
#> 6 2013 1 1
#> # ... with 3.368e+05 more rows

选择“year”和“day”之间的所有列（包括“year”和“day”）

select(flights, year:day)
#> # A tibble: 336,776 × 3
#> year month day
#> <int> <int> <int>
#> 1 2013 1 1
#> 2 2013 1 1
#> 3 2013 1 1
#> 4 2013 1 1
#> 5 2013 1 1
#> 6 2013 1 1
#> # ... with 3.368e+05 more rows

选择不在“year”和“day”之间的所有列（不包括“year”和“day”）

select(flights, -(year:day))
#> # A tibble: 336,776 × 16
#> dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <dbl> <int> <int>
#> 1 517 515 2 830 819
#> 2 533 529 4 850 830
#> 3 542 540 2 923 850
#> 4 544 545 -1 1004 1022
#> 5 554 600 -6 812 837
#> 6 554 558 -4 740 728
#> # ... with 3.368e+05 more rows, and 12 more variables:
#> # arr_delay <dbl>, carrier <chr>, flight <int>,
#> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#> # distance <dbl>, hour <dbl>, minute <dbl>,
#> # time_hour <dttm>

还可以在select()函数中使用一些辅助函数。

starts_with("abc")：匹配以“abc”开头的名称。
ends_with("xyz")：匹配以“xyz”结尾的名称。
contains("ijk")：匹配包含“ijk”的名称。
matches("(.)\1")：选择匹配正则表达式的那些变量。这个正则表达式会匹配名称中有重复字符的变量。
num_range("x", 1:3)：匹配 x1、x2 和 x3。

3.4课后习题

答案仅供参考，如有误，望在评论区指正。

问题一

从flights 数据集中选择dep_time、dep_delay、arr_time和 arr_delay通过头脑风暴找出尽可能多的方法。

解答

方法一

select(flights, dep_time, dep_delay, arr_time, arr_delay)

方法二

select(flights, "dep_time", "dep_delay", "arr_time", "arr_delay")

方法三

select(flights, 4, 6, 7, 9)

方法四

select(flights, all_of(c("dep_time", "dep_delay", "arr_time", "arr_delay")))

方法五

select(flights, any_of(c("dep_time", "dep_delay", "arr_time", "arr_delay")))

方法六

variables <- c("dep_time", "dep_delay", "arr_time", "arr_delay")
select(flights, all_of(variables))

方法七

select(flights, starts_with("dep_"), starts_with("arr_"))

问题二

如果在select()函数中多次计入一个变量名，那么会发生什么情况？

解答

select()调用会忽略重复的内容。任何重复的变量只包含一次，即在它们出现的第一个位置。如果存在重复变量select()函数不会引发错误或警告，也不会输出任何消息。

select(flights, year, month, day, year, year)

这种机制非常有用，因为它意味着我们可以使用select()和everything()来轻松地更改列的顺序，而不必指定所有列的名称。

select(flights, arr_delay, everything())

问题三

one_of()函数的作用是什么？为什么它结合以下向量使用时非常有用？

vars <- c(
"year", "month", "day", "dep_delay", "arr_delay"
)

解答

one_of()函数的作用是:选择带有字符向量的变量，而不是不带引号的变量名参数。这个函数很有用，因为通过编程生成带有变量名的字符向量比生成不带引号的变量名更容易，后者更容易键入。

vars <- c("year", "month", "day", "dep_delay", "arr_delay")
select(flights, one_of(vars))

在dplyr的最新版本中，one_of()已被弃用，而支持两个函数:all_of()和any_of()。如果所有的变量都出现在数据框中，这些函数的作用是相似的。

select(flights, any_of(vars))

select(flights, all_of(vars))

这些函数在严格程度上有所不同。如果其中一个变量名不存在all_of()函数将报错，而any_of()将忽略它。如果遇到未知列,one_of()将发出警告。

问题四

以下代码的运行结果是否出乎意料？选择辅助函数处理大小写的默认方式是什么？如何改变默认方式？

select(flights, contains("TIME"))

解答

contains()的默认行为是忽略大小写。

select(flights, contains("TIME"))
#> # A tibble: 336,776 x 6
#>   dep_time sched_dep_time arr_time sched_arr_time air_time time_hour          
#>      <int>          <int>    <int>          <int>    <dbl> <dttm>             
#> 1      517            515      830            819      227 2013-01-01 05:00:00
#> 2      533            529      850            830      227 2013-01-01 05:00:00
#> 3      542            540      923            850      160 2013-01-01 05:00:00
#> 4      544            545     1004           1022      183 2013-01-01 05:00:00
#> 5      554            600      812            837      116 2013-01-01 06:00:00
#> 6      554            558      740            728      150 2013-01-01 05:00:00
#> # … with 336,770 more rows

要更改默认方式，添加参数ignore.case= FALSE。

select(flights, contains("TIME", ignore.case = FALSE))
#> # A tibble: 336,776 x 0

R数据科学|3.4内容介绍及习题解答

3.4 使用select()选择列

3.4课后习题

问题一

解答

问题二

解答

问题三

解答

问题四

解答

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

R数据科学|3.4内容介绍及习题解答

3.4 使用select()选择列

3.4课后习题

问题一

解答

问题二

解答

问题三

解答

问题四

解答

热门文章

最新文章

相关课程

相关电子书

相关实验场景