1. 词干提取

1.beautiful和beautifully的词干同为beauti
2.Good,better和best 的词干分别为good,better和best。



Porter2算法做词干提取的代码：

#!pip install stemmingfrom stemming.porter2 import stemstem("casually")


2. 词形还原

1.beautiful和beautifully被分别还原为beautiful和beautifully。
2.good, better和best被分别还原为good, good和good

相关论文1: 这篇文章详细讨论了词形还原的不同方法。想要了解传统词形还原的工作原理必读。(http://www.ijrat.org/downloads/icatest2015/ICATEST-2015127.pdf)



#!pip install spacy
import spacy
doc="good better best"
for token in nlp(doc):
print(token,token.lemma_)


3. 词向量化

相关博文：这篇文章详细解释了词向量化。
(https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/)

(https://ronxin.github.io/wevi/)


#!pip install gensim
fromgensim.models.keyedvectors import KeyedVectors
word_vectors['human']


sentence=[['first','sentence'],['second','sentence']]
model = gensim.models.Word2Vec(sentence, min_count=1,size=300,worke- rs=4)


4. 词性标注

• Ashok 代词
• killed 动词
• the 限定词
• snake 名词
• with 连词
• a 限定词
• stick 名词
• . 标点
论文1：
choi aptly的这篇《The Last Gist to theState-of-the-Art 》介绍了一种叫动态特征归纳的新方法。这是目前词性标注最先进的方法。(https://aclweb.org/anthology/N16-1031.pdf)



#!pip install spacy
sentence="Ashok killed the snake with a stick"
for token in nlp(sentence):
print(token,token.pos_)


5. 命名实体消歧

论文1：Huang的这篇论文运用了基于深度神经网络和知识库的深层语义关联模型，在命名实体消岐上达到了领先水平。
(https://arxiv.org/pdf/1504.07678.pdf)



6. 命名实体识别

“Ram of Apple Inc. travelled to Sydney on 5th October 2017”


Ram
of
Apple ORG
Inc. ORG
travelled
to
Sydney GPE
on
5th DATE
October DATE
2017 DATE



论文：这篇优秀的论文使用双向LSTM（长短期记忆网络）神经网络结合监督学习和非监督学习方法，在4种语言领域实现了命名实体识别的最新成果。(https://arxiv.org/pdf/1603.01360.pdf)


import spacy
nlp=spacy.load('en')sentence="Ram of Apple Inc. travelled to Sydney on 5th October 2017"
for token in nlp(sentence):
print(token, token.ent_type_)


7. 情感分析

“我不喜欢巧克力冰淇淋”—是对该冰淇淋的负面评价。

“我并不讨厌巧克力冰激凌”—可以被认为是一种中性的评价。

博文1：本文重点对电影推文进行情感分析(https://www.analyticsvidhya.com/blog/2016/02/step-step-guide-building-sentiment-analysis-model-graphlab/)



8. 语义文本相似度

论文1：本文详细介绍了文本相似度测量的不同方法。是一篇可以一站式了解目前所有方法的必读文章。(https://pdfs.semanticscholar.org/5b5c/a878c534aee3882a038ef9e82f46e102131b.pdf)



博文：在这篇由fastText撰写的博文中介绍了一种新的工具，其可以在1MB的内存使用情况下识别170种语言。(https://fasttext.cc/blog/2017/10/02/blog-post.html)



10. 文本摘要

论文1：本文描述了基于神经注意模型的抽象语句梗概方法。(https://arxiv.org/pdf/1509.00685.pdf)



fromgensim.summarization import summarize
sentence="Automatic summarization is the process of shortening a text document with software, in order to create a summary with the major points of the original document. Technologies that can make a coherent summary take into account variables such as length, writing style and syntax.Automatic data summarization is part of machine learning and data mining. The main idea of summarization is to find a subset of data which contains the information of the entire set. Such techniques are widely used in industry today. Search engines are an example; others include summarization of documents, image collections and videos. Document summarization tries to create a representative summary or abstract of the entire document, by finding the most informative sentences, while in image summarization the system finds the most representative and important (i.e. salient) images. For surveillance videos, one might want to extract the important events from the uneventful context.There are two general approaches to automatic summarization: extraction and abstraction. Extractive methods work by selecting a subset of existing words, phrases, or sentences in the original text to form the summary. In contrast, abstractive methods build an internal semantic representation and then use natural language generation techniques to create a summary that is closer to what a human might express. Such a summary might include verbal innovations. Research to date has focused primarily on extractive methods, which are appropriate for image collection summarization and video summarization."
summarize(sentence)


