用R读取PDF并进行数据挖掘

简介: 版权声明:本文为博主chszs的原创文章,未经博主允许不得转载。 https://blog.csdn.net/chszs/article/details/8035102 用R读取...
版权声明:本文为博主chszs的原创文章,未经博主允许不得转载。 https://blog.csdn.net/chszs/article/details/8035102

用R读取PDF并进行数据挖掘,例子如下:

# here is a pdf for mining
url <- "http://www.noisyroom.net/blog/RomneySpeech072912.pdf"
dest <- tempfile(fileext = ".pdf")
download.file(url, dest, mode = "wb")

# set path to pdftotxt.exe and convert pdf to text
exe <- "C:\\Program Files\\xpdfbin-win-3.03\\bin32\\pdftotext.exe"
system(paste("\"", exe, "\" \"", dest, "\"", sep = ""), wait = F)

# get txt-file name and open it
filetxt <- sub(".pdf", ".txt", dest)
shell.exec(filetxt); shell.exec(filetxt) # strangely the first try always throws an error..

# do something with it, i.e. a simple word cloud
library(tm)
library(wordcloud)
library(Rstem)

txt <- readLines(filetxt) # don't mind warning..

txt <- tolower(txt)
txt <- removeWords(txt, c("\\f", stopwords()))

corpus <- Corpus(VectorSource(txt))
corpus <- tm_map(corpus, removePunctuation)
tdm <- TermDocumentMatrix(corpus)
m <- as.matrix(tdm)
d <- data.frame(freq = sort(rowSums(m), decreasing = TRUE))

# Stem words
d$stem <- wordStem(row.names(d), language = "english")

# and put words to column, otherwise they would be lost when aggregating
d$word <- row.names(d)

# remove web address (very long string):
d <- d[nchar(row.names(d)) < 20, ]

# aggregate freqeuncy by word stem and
# keep first words..
agg_freq <- aggregate(freq ~ stem, data = d, sum)
agg_word <- aggregate(word ~ stem, data = d, function(x) x[1])

d <- cbind(freq = agg_freq[, 2], agg_word)

# sort by frequency
d <- d[order(d$freq, decreasing = T), ]

# print wordcloud:
wordcloud(d$word, d$freq)

# remove files
file.remove(dir(tempdir(), full.name=T)) # remove files


目录
相关文章
|
Java Unix Linux
知识分享之Golang——读取pdf中纯文本内容
知识分享之Golang篇是我在日常使用Golang时学习到的各种各样的知识的记录,将其整理出来以文章的形式分享给大家,来进行共同学习。欢迎大家进行持续关注。 知识分享系列目前包含Java、Golang、Linux、Docker等等。
1676 1
知识分享之Golang——读取pdf中纯文本内容
|
编解码 安全 Unix
数据导入与预处理-第4章-数据获取python读取pdf文档
数据导入与预处理-第4章-数据获取Python读取PDF文档 1 PDF简介 1.1 pdf是什么 2 Python操作PDF 2.1 pdfplumber库
数据导入与预处理-第4章-数据获取python读取pdf文档
|
自然语言处理 Python
Python 操作pdf文件(pdfplumber读取PDF写入Excel)
学习了解Python 操作pdf文件(pdfplumber读取PDF写入Excel)。
877 0
Python 操作pdf文件(pdfplumber读取PDF写入Excel)
UIWebView 读取pdf,word,excel
UIWebView 读取pdf,word,excel
104 0
|
存储 Linux Python
Python编程:读取pdf、pptx、docx、xlsx文件的页数
Python编程:读取pdf、pptx、docx、xlsx文件的页数
831 0
|
Python
python通过pdfminer或pdfminer3k读取pdf文件
python通过pdfminer或pdfminer3k读取pdf文件
313 0
|
Python
python通过pdfminer或pdfminer3k读取pdf文件
python通过pdfminer或pdfminer3k读取pdf文件
524 0