RAKE（快速自动关键字抽取）算法原理与实现

2023-08-01 498

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： RAKE算法用来做关键词(key word)的提取，实际上提取的是关键的短语(phrase)，并且倾向于较长的短语，在英文中，关键词通常包括多个单词，但很少包含标点符号和停用词，例如and，the，of等，以及其他不包含语义信息的单词。

RAKE算法

RAKE算法用来做关键词(key word)的提取，实际上提取的是关键的短语(phrase)，并且倾向于较长的短语，在英文中，关键词通常包括多个单词，但很少包含标点符号和停用词，例如and，the，of等，以及其他不包含语义信息的单词。

算法思想

1.RAKE算法首先使用标点符号（如半角的句号、问号、感叹号、逗号等）将一篇文档分成若干分句。

2.然后对于每一个分句，使用停用词作为分隔符将分句分为若干短语，这些短语作为最终提取出的关键词的候选词。

3.最后，每个短语可以再通过空格分为若干个单词，可以通过给每个单词赋予一个得分，通过累加得到每个短语的得分。

一个关键点在于将这个短语中每个单词的共现关系考虑进去。最终定义的公式是:

词共现矩阵的两种不同含义：

1.词共现矩阵的每一列的值即为该词的度deg（是一个网络中的概念，每与一个单词共现在一个短语中，度就加1，考虑该单词本身）,每个词在文本中出现的次数即为频率freq。

2.通过统计一个事先指定大小的窗口内的word共现次数，以word周边的共现词的次数做为当前word的vector。具体来说，我们通过从大量的语料文本中构建一个共现矩阵来定义word representation。

对于第一解释存在以下示例：

共词矩阵对角线上元素赋值为它自身在所有文章出现次数。循环遍历特征词列表，构建全部两个词之间的组合,再遍历每一篇文章的切词结果，如果该两个词在同一片文章中出现，则该两词的权重+1，再将其存入共词矩阵的对应位置中。

实现可以参照[4]或[7]。

对于第二种解释有以下示例：

I like deep learning.
I like NLP.
I enjoy flying.

假设窗口大小为2，其词共现矩阵如下表所示：

表1 词共现矩阵

count	I	like	enjoy	deep	learning	NLP	flying	.
I	0	2	1	0	0	0	0	0
like	2	0	0	1	0	1	0	0

enjoy	1	0	0	0	1	0
deep	0	1	0	1	0	0
learning	0	0	1	0	0	1

NLP	1	0	0	0	0	1
flying	0	1	0	0	0	1
.	0	0	1	1	1	0

其中（I，like)=2是因为窗口等于3，“I like deep”和“I like NLP”中各出现了一次，其他的依此类推。

Reference

[1] python 共现矩阵构建_这是一个死肥宅的博客-CSDN博客_python共现矩阵

[2] https://mp.weixin.qq.com/s/nX_idrMlI_cv9aufnHoRwQ

[3] 共现矩阵_贾世林jiashilin的博客-CSDN博客_共现矩阵

[4] https://github.com/zelandiya/RAKE-tutorial

[5] Rose, S. J. , et al. “Automatic Keyword Extraction from Individual Documents.” (2010).

[6] 英文关键词提取之RAKE算法_lizzy05的博客-CSDN博客_rake算法

[7] laserwave/keywords_extraction_rake: A python implementation of the Rapid Automatic Keyword Extraction (github.com)

附页

代码实现

import re
import operator
import argparse
import codecs
def isNumber(s):
    try:
        float(s) if '.' in s else int(s)
        return True
    except ValueError:
        return False
class Rake:
    def __init__(self, inputFilePath, stopwordsFilePath, outputFilePath, minPhraseChar, maxPhraseLength):
        self.outputFilePath = outputFilePath
        self.minPhraseChar = minPhraseChar
        self.maxPhraseLength = maxPhraseLength
        # read documents
        self.docs = []
        for document in codecs.open(inputFilePath, 'r', 'utf-8'):
            self.docs.append(document)
        # read stopwords
        stopwords = []
        for word in codecs.open(stopwordsFilePath, 'r', 'utf-8'):
            stopwords.append(word.strip())
        stopwordsRegex = []
        for word in stopwords:
            regex = r'\b' + word + r'(?![\w-])'
            stopwordsRegex.append(regex)
        self.stopwordsPattern = re.compile('|'.join(stopwordsRegex), re.IGNORECASE)
    def separateWords(self, text):
        splitter = re.compile('[^a-zA-Z0-9_\\+\\-/]')
        words = []
        for word in splitter.split(text):
            word = word.strip().lower()
            # leave numbers in phrase, but don't count as words, since they tend to invalidate scores of their phrases
            if len(word) > 0 and word != '' and not isNumber(word):
                words.append(word)
        return words
    def calculatePhraseScore(self, phrases):
        # calculate wordFrequency and wordDegree
        wordFrequency = {}
        wordDegree = {}
        for phrase in phrases:
            wordList = self.separateWords(phrase)
            wordListLength = len(wordList)
            wordListDegree = wordListLength - 1
            for word in wordList:
                wordFrequency.setdefault(word, 0)
                wordFrequency[word] += 1
                wordDegree.setdefault(word, 0)
                wordDegree[word] += wordListDegree
        for item in wordFrequency:
            wordDegree[item] = wordDegree[item] + wordFrequency[item]
        # calculate wordScore = wordDegree(w)/wordFrequency(w)
        wordScore = {}
        for item in wordFrequency:
            wordScore.setdefault(item, 0)
            wordScore[item] = wordDegree[item] * 1.0 / wordFrequency[item]
        # calculate phraseScore
        phraseScore = {}
        for phrase in phrases:
            phraseScore.setdefault(phrase, 0)
            wordList = self.separateWords(phrase)
            candidateScore = 0
            for word in wordList:
                candidateScore += wordScore[word]
            phraseScore[phrase] = candidateScore
        return phraseScore
    def execute(self):
        file = codecs.open(self.outputFilePath,'w','utf-8')
        for document in self.docs:
            # split a document into sentences
            sentenceDelimiters = re.compile(u'[.!?,;:\t\\\\"\\(\\)\\\'\u2019\u2013]|\\s\\-\\s')
            sentences = sentenceDelimiters.split(document)
            # generate all valid phrases
            phrases = []
            for s in sentences:
                tmp = re.sub(self.stopwordsPattern, '|', s.strip())
                phrasesOfSentence = tmp.split("|")
                for phrase in phrasesOfSentence:
                    phrase = phrase.strip().lower()
                    if phrase != "" and len(phrase) >= self.minPhraseChar and len(phrase.split()) <= self.maxPhraseLength:
                        phrases.append(phrase)
            # calculate phrase score
            phraseScore = self.calculatePhraseScore(phrases)
            keywords = sorted(phraseScore.items(), key = operator.itemgetter(1), reverse=True)
            file.write(str(keywords[0:int(len(keywords)/3)]) + "\n")
        file.close()
def readParamsFromCmd():
    parser = argparse.ArgumentParser(description = "This is a python implementation of rake(rapid automatic keyword extraction).")
    parser.add_argument('inputFilePath', help = 'The file path of input document(s). One line represents a document.')
    parser.add_argument('stopwordsFilePath', help = 'The file path of stopwords, each line represents a word.')
    parser.add_argument('-o', '--outputFilePath', help = 'The file path of output. (default output.txt in current dir).', default = 'output.txt')
    parser.add_argument('-m', '--minPhraseChar', type = int, help = 'The minimum number of characters of a phrase.(default 1)', default = 1)
    parser.add_argument('-a', '--maxPhraseLength', type = int, help = 'The maximum length of a phrase.(default 3)', default = 3)
    return parser.parse_args()
params = readParamsFromCmd().__dict__
rake = Rake(params['inputFilePath'], params['stopwordsFilePath'], params['outputFilePath'], params['minPhraseChar'], params['maxPhraseLength'])
rake.execute()

RAKE（快速自动关键字抽取）算法原理与实现

RAKE算法

Reference

附页

热门文章

最新文章

相关课程

相关电子书

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

训练营

直播

乘风者计划

下载

镜像站

技术资料

RAKE（快速自动关键字抽取）算法原理与实现

RAKE算法

Reference

附页

热门文章

最新文章

相关课程

相关电子书