ML之NB：利用朴素贝叶斯NB算法(TfidfVectorizer+不去除停用词)对20类新闻文本数据集进行分类预测、评估-阿里云开发者社区

ML之NB：利用朴素贝叶斯NB算法(TfidfVectorizer+不去除停用词)对20类新闻文本数据集进行分类预测、评估

2021-10-30 237

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： ML之NB：利用朴素贝叶斯NB算法(TfidfVectorizer+不去除停用词)对20类新闻文本数据集进行分类预测、评估

输出结果

设计思路

核心代码

class TfidfVectorizer Found at: sklearn.feature_extraction.text

class TfidfVectorizer(CountVectorizer):

"""Convert a collection of raw documents to a matrix of TF-IDF features.

Equivalent to CountVectorizer followed by TfidfTransformer.

Read more in the :ref:`User Guide <text_feature_extraction>`.

Parameters

----------

input : string {'filename', 'file', 'content'}

If 'filename', the sequence passed as an argument to fit is

expected to be a list of filenames that need reading to fetch

the raw content to analyze.

If 'file', the sequence items must have a 'read' method (file-like

object) that is called to fetch the bytes in memory.

Otherwise the input is expected to be the sequence strings or

bytes items are expected to be analyzed directly.

encoding : string, 'utf-8' by default.

If bytes or files are given to analyze, this encoding is used to

decode.

decode_error : {'strict', 'ignore', 'replace'}

Instruction on what to do if a byte sequence is given to analyze that

contains characters not of the given `encoding`. By default, it is

'strict', meaning that a UnicodeDecodeError will be raised. Other

values are 'ignore' and 'replace'.

strip_accents : {'ascii', 'unicode', None}

Remove accents during the preprocessing step.

'ascii' is a fast method that only works on characters that have

an direct ASCII mapping.

'unicode' is a slightly slower method that works on any characters.

None (default) does nothing.

analyzer : string, {'word', 'char'} or callable

Whether the feature should be made of word or character n-grams.

If a callable is passed it is used to extract the sequence of features

out of the raw, unprocessed input.

preprocessor : callable or None (default)

Override the preprocessing (string transformation) stage while

preserving the tokenizing and n-grams generation steps.

tokenizer : callable or None (default)

Override the string tokenization step while preserving the

preprocessing and n-grams generation steps.

Only applies if ``analyzer == 'word'``.

ngram_range : tuple (min_n, max_n)

The lower and upper boundary of the range of n-values for different

n-grams to be extracted. All values of n such that min_n <= n <= max_n

will be used.

stop_words : string {'english'}, list, or None (default)

If a string, it is passed to _check_stop_list and the appropriate stop

list is returned. 'english' is currently the only supported string

value.

If a list, that list is assumed to contain stop words, all of which

will be removed from the resulting tokens.

Only applies if ``analyzer == 'word'``.

If None, no stop words will be used. max_df can be set to a value

in the range [0.7, 1.0) to automatically detect and filter stop

words based on intra corpus document frequency of terms.

lowercase : boolean, default True

Convert all characters to lowercase before tokenizing.

token_pattern : string

Regular expression denoting what constitutes a "token", only used

if ``analyzer == 'word'``. The default regexp selects tokens of 2

or more alphanumeric characters (punctuation is completely ignored

and always treated as a token separator).

max_df : float in range [0.0, 1.0] or int, default=1.0

When building the vocabulary ignore terms that have a document

frequency strictly higher than the given threshold (corpus-specific

stop words).

If float, the parameter represents a proportion of documents, integer

absolute counts.

This parameter is ignored if vocabulary is not None.

min_df : float in range [0.0, 1.0] or int, default=1

When building the vocabulary ignore terms that have a document

frequency strictly lower than the given threshold. This value is also

called cut-off in the literature.

If float, the parameter represents a proportion of documents, integer

absolute counts.

This parameter is ignored if vocabulary is not None.

max_features : int or None, default=None

If not None, build a vocabulary that only consider the top

max_features ordered by term frequency across the corpus.

This parameter is ignored if vocabulary is not None.

vocabulary : Mapping or iterable, optional

Either a Mapping (e.g., a dict) where keys are terms and values are

indices in the feature matrix, or an iterable over terms. If not

given, a vocabulary is determined from the input documents.

binary : boolean, default=False

If True, all non-zero term counts are set to 1. This does not mean

outputs will have only 0/1 values, only that the tf term in tf-idf

is binary. (Set idf and normalization to False to get 0/1 outputs.)

dtype : type, optional

Type of the matrix returned by fit_transform() or transform().

norm : 'l1', 'l2' or None, optional

Norm used to normalize term vectors. None for no normalization.

use_idf : boolean, default=True

Enable inverse-document-frequency reweighting.

smooth_idf : boolean, default=True

Smooth idf weights by adding one to document frequencies, as if an

extra document was seen containing every term in the collection

exactly once. Prevents zero divisions.

sublinear_tf : boolean, default=False

Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).

Attributes

----------

vocabulary_ : dict

A mapping of terms to feature indices.

idf_ : array, shape = [n_features], or None

The learned idf vector (global term weights)

when ``use_idf`` is set to True, None otherwise.

stop_words_ : set

Terms that were ignored because they either:

- occurred in too many documents (`max_df`)

- occurred in too few documents (`min_df`)

- were cut off by feature selection (`max_features`).

This is only available if no vocabulary was given.

ML之NB：利用朴素贝叶斯NB算法(TfidfVectorizer+不去除停用词)对20类新闻文本数据集进行分类预测、评估

输出结果

设计思路

核心代码

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

ML之NB：利用朴素贝叶斯NB算法(TfidfVectorizer+不去除停用词)对20类新闻文本数据集进行分类预测、评估

输出结果

设计思路

核心代码

热门文章

最新文章

相关课程

相关电子书

相关实验场景