文本特征提取-TfidfVectorizer和CountVectorizer

2022-05-17 271

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 文本特征提取-TfidfVectorizer和CountVectorizer

Bag of words(词袋)

统计每个词在文档中出现的次数

from sklearn.feature_extraction.text import CountVectorizer
documents = ['我 爱 北京 天安门，天安门 很 壮观',
             '我 经常 在 广场 拍照']
count_vec = CountVectorizer()
count_data = count_vec.fit_transform(documents)
print(count_data, count_data.shape, type(count_data))
count_array = count_data.toarray()
print(count_array, count_array.shape, type(count_data))
print('词汇表为：\n', count_vec.vocabulary_)

输出为：

(0, 1)    1  # 这里=>0代表第一个文档，也就是我们语料中第一句话，1地表词汇的索引，在这里是壮观
  (0, 2)    2
  (0, 0)    1
  (1, 4)    1
  (1, 3)    1
  (1, 5)    1 (2, 6) <class 'scipy.sparse.csr.csr_matrix'>
[[1 1 2 0 0 0] =>第一个文档对应的单词索引
 [0 0 0 1 1 1]] (2, 6) <class 'scipy.sparse.csr.csr_matrix'>
词汇表为：
 {'北京': 0, '天安门': 2, '壮观': 1, '经常': 5, '广场': 3, '拍照': 4}

tfidf

计算文档中每个词的tfidf值

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vecc = TfidfVectorizer()
count_data = tfidf_vecc.fit_transform(documents)
print(count_data, count_data.shape, type(count_data))
count_array = count_data.toarray()
print(count_array, count_array.shape, type(count_data))
print('词汇表为：\n', tfidf_vecc.vocabulary_)

(0, 0)    0.408248290463863
  (0, 2)    0.816496580927726
  (0, 1)    0.408248290463863
  (1, 5)    0.5773502691896257
  (1, 3)    0.5773502691896257
  (1, 4)    0.5773502691896257 (2, 6) <class 'scipy.sparse.csr.csr_matrix'>
[[0.40824829 0.40824829 0.81649658 0.         0.         0.        ]
 [0.         0.         0.         0.57735027 0.57735027 0.57735027]] (2, 6) <class 'scipy.sparse.csr.csr_matrix'>
词汇表为：
 {'北京': 0, '天安门': 2, '壮观': 1, '经常': 5, '广场': 3, '拍照': 4}

csr_matrix

其实我比较疑惑的地方是toarray()这个方法，count_data 为什么可以通过这个方法可以转化成那个样子，后来查了一下资料：

下面是一个关于csr_matrix的实例：

import numpy as np
from scipy.sparse import csr_matrix
arr = np.array([[0, 1, 0, 2, 0], [1, 1, 0, 2, 0], [2, 0, 5, 0, 0]])
b = csr_matrix(arr)
print(b.shape)
# 非零个数
print(b.nnz)
# 非零值
print(b.data)
# 稀疏矩阵非0元素对应的列索引值所组成数组
print(b.indices)
# 第一个元素0，之后每个元素表示稀疏矩阵中每行元素(非零元素)个数累计结果
print(b.indptr)
print(b)

输出为：

(3, 5)
7
[1 2 1 1 2 2 5]
[1 3 0 1 3 0 2]
[0 2 5 7]  =>是因为[0, 1, 0, 2, 0]有两个非0元素，[1, 1, 0, 2, 0]有3个非0，默认第一行为0，其次累加：2,2+3=5,5+2=7
  (0, 1)    1
  (0, 3)    2
  (1, 0)    1
  (1, 1)    1
  (1, 3)    2
  (2, 0)    2
  (2, 2)    5

我们看下toarray的结果

b为：
  (0, 1)    1
  (0, 3)    2
  (1, 0)    1
  (1, 1)    1
  (1, 3)    2
  (2, 0)    2
  (2, 2)    5

print(b.toarray())
[[0 1 0 2 0]
 [1 1 0 2 0]
 [2 0 5 0 0]]

我们可以看书，[0 1 0 2 0]就是将(0, 1) 1 (0, 3) 2，相对应的值，填到索引位置上。

这里比较绕，然我整理下思绪O(∩_∩)O哈哈~。。。。

参考资料：

csr_matrix矩阵

sparse.csr_matrix矩阵的压缩存储

CountVectorizer与TfidfVectorizer的参数设置

文章标签：

索引

存储

文本特征提取-TfidfVectorizer和CountVectorizer

Bag of words(词袋)

tfidf

csr_matrix

热门文章

最新文章

相关电子书

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

文本特征提取-TfidfVectorizer和CountVectorizer

Bag of words(词袋)

tfidf

csr_matrix

热门文章

最新文章

相关电子书