假新闻的词云:
from wordcloud import WordCloudfake_data = data[data["target"] == "fake"] all_words = ' '.join([text for text in fake_data.text])wordcloud = WordCloud(width= 800, height= 500, max_font_size = 110, collocations = False).generate(all_words)plt.figure(figsize=(10,7)) plt.imshow(wordcloud, interpolation='bilinear') plt.axis("off") plt.show()
真新闻的词云:
from wordcloud import WordCloudreal_data = data[data[“target”] == “true”] all_words = ‘ ‘.join([text for text in fake_data.text])wordcloud = WordCloud(width= 800, height= 500, max_font_size = 110, collocations = False).generate(all_words)plt.figure(figsize=(10,7)) plt.imshow(wordcloud, interpolation=’bilinear’) plt.axis(“off”) plt.show()
词频统计:
# Most frequent words counter (Code adapted from https://www.kaggle.com/rodolfoluna/fake-news-detector) from nltk import tokenizetoken_space = tokenize.WhitespaceTokenizer()def counter(text, column_text, quantity): all_words = ' '.join([text for text in text[column_text]]) token_phrase = token_space.tokenize(all_words) frequency = nltk.FreqDist(token_phrase) df_frequency = pd.DataFrame({"Word": list(frequency.keys()), "Frequency": list(frequency.values())}) df_frequency = df_frequency.nlargest(columns = "Frequency", n = quantity) plt.figure(figsize=(12,8)) ax = sns.barplot(data = df_frequency, x = "Word", y = "Frequency", color = 'blue') ax.set(ylabel = "Count") plt.xticks(rotation='vertical') plt.show()
假新闻中出现频率最高的词汇:
counter(data[data[“target”] == “fake”], “text”, 20)
真新闻中出现频率最高的词汇:
counter(data[data[“target”] == “true”], “text”, 20)
建模
建模过程将包括对存储在“text”列中的语料库进行向量化,然后应用TF-IDF,最后使用分类机器学习算法。都是非常标准的文本分析和NLP操作。
对于建模,我们有这个函数来绘制模型的混淆矩阵:
# Function to plot the confusion matrix (code from https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html) from sklearn import metrics import itertoolsdef plot_confusion_matrix(cm, classes, normalize=False, title='Confusion matrix', cmap=plt.cm.Blues): plt.imshow(cm, interpolation='nearest', cmap=cmap) plt.title(title) plt.colorbar() tick_marks = np.arange(len(classes)) plt.xticks(tick_marks, classes, rotation=45) plt.yticks(tick_marks, classes)if normalize: cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] print("Normalized confusion matrix") else: print('Confusion matrix, without normalization')thresh = cm.max() / 2. for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])): plt.text(j, i, cm[i, j], horizontalalignment="center", color="white" if cm[i, j] > thresh else "black")plt.tight_layout() plt.ylabel('True label') plt.xlabel('Predicted label')
分割数据
X_train,X_test,y_train,y_test = train_test_split(data['text'], data.target, test_size=0.2, random_state=42)
逻辑回归
# Vectorizing and applying TF-IDF from sklearn.linear_model import LogisticRegressionpipe = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('model', LogisticRegression())])# Fitting the model model = pipe.fit(X_train, y_train)# Accuracy prediction = model.predict(X_test) print("accuracy: {}%".format(round(accuracy_score(y_test, prediction)*100,2)))
准确率是98.76%,混淆矩阵如下:
cm = metrics.confusion_matrix(y_test, prediction) plot_confusion_matrix(cm, classes=['Fake', 'Real'])
决策树
from sklearn.tree import DecisionTreeClassifier# Vectorizing and applying TF-IDF pipe = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('model', DecisionTreeClassifier(criterion= 'entropy', max_depth = 20, splitter='best', random_state=42))]) # Fitting the model model = pipe.fit(X_train, y_train)# Accuracy prediction = model.predict(X_test) print("accuracy: {}%".format(round(accuracy_score(y_test, prediction)*100,2)))
准确率是99.71 ,混淆矩阵如下:
cm = metrics.confusion_matrix(y_test, prediction) plot_confusion_matrix(cm, classes=['Fake', 'Real'])
随机森林
from sklearn.ensemble import RandomForestClassifierpipe = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('model', RandomForestClassifier(n_estimators=50, criterion="entropy"))])model = pipe.fit(X_train, y_train) prediction = model.predict(X_test) print("accuracy: {}%".format(round(accuracy_score(y_test, prediction)*100,2)))
准确率是98.98 % ,混淆矩阵如下:
cm = metrics.confusion_matrix(y_test, prediction) plot_confusion_matrix(cm, classes=['Fake', 'Real'])
结论
文本分析和自然语言处理可以用来解决假新闻这一非常重要的问题。我们已经看到了它们对人们的观点、世界对一个话题的思考方式所产生的巨大影响。
我们已经建立了一个机器学习模型,使用样本数据来检测虚假文章,使用Python构建模型,并且比较不同分类模型的准确率。
感谢阅读这篇文章,希望它能对您当前的工作或对数据科学的调查和理解有所帮助。