Merge with dplyr()
Dplyr提供了一种很好的、方便的组合数据集的方法。
Left_join()
right_join()
inner_join()
full_join()
首先,我们建立两个数据集。表1包含两个变量,ID和y,而表2包含了ID和z。在每种情况下,我们都需要有一个关键变量。在本例中,ID是关键变量。该函数将在两个表中寻找相同的值,并将返回值添加到表1的右侧。
library(dplyr) df_primary <- tribble( ~ID, ~y, "A", 5, "B", 5, "C", 8, "D", 0, "F", 9) df_secondary <- tribble( ~ID, ~z, "A", 30, "B", 21, "C", 22, "D", 25, "E", 29)
> df_primary # A tibble: 5 x 2 ID y <chr> <dbl> 1 A 5 2 B 5 3 C 8 4 D 0 5 F 9 > df_secondary # A tibble: 5 x 2 ID z <chr> <dbl> 1 A 30 2 B 21 3 C 22 4 D 25 5 E 29 left_join()
合并两个数据集的最常见方法是使用left_join()函数。我们可以从下面的图片中看到,关键变量完美地匹配了两个数据集中的行A、B、C和D。但是,剩下E和F。我们如何看待这两个观察结果?使用left_join(),我们将在原始表中保留所有变量,而不考虑目标表中没有匹配到的变量。在我们的示例中,变量E不存在于表1中。因此,该行将被删除。变量F来自原始表,因此它将保留在left_join()之后,并在列z显示为NA。如下图所示:
210510_1
left_join(df_primary, df_secondary, by ='ID')
> left_join(df_primary, df_secondary, by ='ID') # A tibble: 5 x 3 ID y z <chr> <dbl> <dbl> 1 A 5 30 2 B 5 21 3 C 8 22 4 D 0 25 5 F 9 NA
right_join()
right_join()函数的工作方式与left_join()完全相同。唯一的区别是被删除的行。right_join()将目标数据集中的变量E存在于新表中,并为列y取值NA。
210510_2
right_join(df_primary, df_secondary, by = 'ID')
> right_join(df_primary, df_secondary, by = 'ID') # A tibble: 5 x 3 ID y z <chr> <dbl> <dbl> 1 A 5 30 2 B 5 21 3 C 8 22 4 D 0 25 5 E NA 29
inner_join()
当两个数据集不匹配时,我们可以考虑只返回两个数据集中存在的行。inner_join()可以排除不匹配的行。
210510_3
inner_join(df_primary, df_secondary, by ='ID')
> inner_join(df_primary, df_secondary, by ='ID') # A tibble: 4 x 3 ID y z <chr> <dbl> <dbl> 1 A 5 30 2 B 5 21 3 C 8 22 4 D 0 25
full_join()
full_join()函数保留所有的观察结果,并用NA替换缺失的值。
210510_4
full_join(df_primary, df_secondary, by = 'ID')
> full_join(df_primary, df_secondary, by = 'ID') # A tibble: 6 x 3 ID y z.y <chr> <dbl> <dbl> 1 A 5 30 2 B 5 21 3 C 8 22 4 D 0 25 5 F 9 NA 6 E NA 29
Multiple keys pairs
如果有多个关键变量时,则需一起使用。
210510_5
df_primary <- tribble( ~ID, ~year, ~items, "A", 2015,3, "A", 2016,7, "A", 2017,6, "B", 2015,4, "B", 2016,8, "B", 2017,7, "C", 2015,4, "C", 2016,6, "C", 2017,6) df_secondary <- tribble( ~ID, ~year, ~prices, "A", 2015,9, "A", 2016,8, "A", 2017,12, "B", 2015,13, "B", 2016,14, "B", 2017,6, "C", 2015,15, "C", 2016,15, "C", 2017,13) left_join(df_primary, df_secondary, by = c('ID', 'year'))
> left_join(df_primary, df_secondary, by = c('ID', 'year')) # A tibble: 9 x 4 ID year items prices <chr> <dbl> <dbl> <dbl> 1 A 2015 3 9 2 A 2016 7 8 3 A 2017 6 12 4 B 2015 4 13 5 B 2016 8 14 6 B 2017 7 6 7 C 2015 4 15 8 C 2016 6 15 9 C 2017 6 13