是否在不同的电影中,总是能看到那些熟悉却叫不上名字的演员么,想知道他们之间相互的关系么?本文将带你一步一步地挖掘出他们的关系。想更了解自己的偶像么,那就试试吧。
以下为译文
我最近换了个工作,在入职之前,我看了很多电影来打发闲暇时光。然而,演员之间的联系引起了我的注意。我经常回想,为什么我总能看到他们在一起工作呢?然后,我就使用 IMDB 电影数据库来分析演员之间的关联。
# Load up the useful libraries for building and visualizing networks:
library(reshape2)
library(network)
library(sna)
library(ggplot2)
library(GGally)
library(readr)
system("ls ../input")
# Read in the data, and strip out the unnecessary details:
data <- read_csv("../input/movie_metadata.csv")
network: Classes for Relational Data
Version 1.13.0 created on 2015-08-31.
copyright (c) 2005, Carter T. Butts, University of California-Irvine
Mark S. Handcock, University of California -- Los Angeles
David R. Hunter, Penn State University
Martina Morris, University of Washington
Skye Bender-deMoll, University of Washington
For citation information, type citation("network").
Type help("network-package") to get started.
Loading required package: statnet.common
sna: Tools for Social Network Analysis
Version 2.4 created on 2016-07-23.
copyright (c) 2005, Carter T. Butts, University of California-Irvine
For citation information, type citation("sna").
Type help(package="sna") to get started.
Parsed with column specification:
cols(
.default = col_integer(),
color = col_character(),
director_name = col_character(),
actor_2_name = col_character(),
genres = col_character(),
actor_1_name = col_character(),
movie_title = col_character(),
actor_3_name = col_character(),
plot_keywords = col_character(),
movie_imdb_link = col_character(),
language = col_character(),
country = col_character(),
content_rating = col_character(),
imdb_score = col_double(),
aspect_ratio = col_double()
)
See spec(...) for full column specifications.
输出结果
Warning message:
“4 parsing failures.
row col expected actual
2324 budget an integer 2400000000
2989 budget an integer 12215500000
3006 budget an integer 2500000000
3860 budget an integer 4200000000
”
需要做的第一件事就是构建可生成网络图的对象。我们有很多库可以完成这件事情,而我选择了sna和network两个库。另外,我推荐包含GGally的GGplot2库可以将网络完美地可视化。我决定对这些进行一层简单的封装来获得我们需要的对象。
getIMDBGraph<-function(data,
firstyear =1,
lastyear = 3000,
genre = FALSE,
minscore = 0,
maxscore = 10){
if(is.character(genre)){
data<-data[grep(genre, data$genres),]}
data<-subset(data,
data$title_year >= firstyear &
data$title_year <= lastyear &
data$imdb_score <= maxscore &
data$imdb_score >= minscore)
df<- data.frame(data$movie_title,
data$actor_1_name,
data$actor_2_name,
data$actor_3_name)
df<- melt(df, id.vars = 'data.movie_title')
names(df)<-c('title',
'actornum',
'actor')
df<-df[,c(1,3)]
edges<-merge(x = df,
y = df,
by = 'title')
edges<-subset(edges,edges$actor.x != edges$actor.y)
edgelist<-as.matrix(edges[,c(2:3)])
graph<-network(edgelist)
return(graph)}
现在就可以传递之前导入的按时间排序的数据了。在这次案例中,我使用的数据是2000年到2009年的。
graph <- getIMDBGraph(data, firstyear = 2000, lastyear = 2009)
Warning message:
“attributes are not identical across measure variables; they will be dropped”
试用GGplot2 的扩展可以很容易地画出交互性好的网络图。下面是一些图中的参数。
p <- ggnet2(graph,
size = 'degree', # feature by which the nodes are scaled
size.min = 40, # lower bound of the nodes to be plotted
label = T, #plot labels?
# mode = 'kamadakawai', # plotting algo for node placement (defaults to FR)
label.size = 2.5,
node.size = 7,
node.color = 'grey70',
node.alpha = 0.7,
edge.alpha = 0.2,
edge.size = 0.3,
legend.size = FALSE, # I don't want a legend
legend.position = 'None')
# Now add a title and subtitle
p <- p + ggtitle('Network of actor connections',
subtitle = '2000\'s')
# And format the title
p <- p + theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
p
输出结果
Loading required package: scales
Attaching package: ‘scales’
The following object is masked from ‘package:readr’:
col_factor
size.min removed 3072 nodes out of 3144
这个图的密度由我们过滤节点的方法决定。在这个案例中,我们一般用节点的度来过滤,也就是说一个节点和其它节点关联的数量。
raph <- getIMDBGraph(data, firstyear = 1990, lastyear = 1999)
p <- ggnet2(graph,
size = 'degree',
size.min = 18,
label = T,
# mode = 'kamadakawai',
label.size = 2.5,
node.size = 7,
node.color = 'grey70',
node.alpha = 0.7,
edge.alpha = 0.2,
edge.size = 0.3,
legend.size = FALSE,
legend.position = 'None')
# Now add a title and subtitle
p <- p + ggtitle('Network of actor connections',
subtitle = '90\'s')
# And format the title
p <- p + theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
p
输出结果
Warning message:
“attributes are not identical across measure variables; they will be dropped”size.min removed 1256 nodes out of 1336
接下来让我们尝试90年代的动作电影:
genre <- 'Action'
firstyear <- 1990
lastyear <- 1999
graph <- getIMDBGraph(data, firstyear = firstyear, lastyear = lastyear, genre = genre)
p <- ggnet2(graph,
size = 'degree',
size.min = 6,
label = T,
# mode = 'kamadakawai',
label.size = 2.5,
node.size = 7,
node.color = 'grey70',
node.alpha = 0.7,
edge.alpha = 0.2,
edge.size = 0.3,
legend.size = FALSE,
legend.position = 'None')
# Now add a title and subtitle
p <- p + ggtitle('Network of actor connections',
subtitle = paste(genre, 'movies between', firstyear, '&', lastyear))
# And format the title
p <- p + theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
p
输出结果
Warning message:
“attributes are not identical across measure variables; they will be dropped”size.min removed 347 nodes out of 446
那么对于80年代以后且评分7.0分以上的电影又是怎样的情况呢?为了方便浏览,我将图着了色。
genre <- 'Action'
firstyear <- 1980
minscore <- 7
graph <- getIMDBGraph(data,
firstyear = firstyear,
genre = genre,
minscore = minscore)
p <- ggnet2(graph,
size = 'degree',
size.min = 6,
label = T,
label.color = 'white',
label.size = 3,
node.color = 'grey70',
node.alpha = 0.5,
edge.alpha = 0.3,
edge.size = 0.5,
legend.size = FALSE,
legend.position = 'None')
# Now add a title and subtitle
p <- p + ggtitle('Network of actor connections', subtitle = 'from the best action films:')
# And format the title
p <- p + theme(plot.title = element_text(hjust = 0.5, colour = 'white', size = 25),
plot.subtitle = element_text(hjust = 0.5, colour = 'white', size = 17),
plot.background = element_rect(fill = 'black'))
# And plot it
p
输出结果
Warning message:
“attributes are not identical across measure variables; they will be dropped”size.min removed 392 nodes out of 518
数十款阿里云产品限时折扣中,赶紧点击领劵开始云上实践吧!