ML之NB:利用朴素贝叶斯NB算法(TfidfVectorizer+不去除停用词)对20类新闻文本数据集进行分类预测、评估

简介: ML之NB:利用朴素贝叶斯NB算法(TfidfVectorizer+不去除停用词)对20类新闻文本数据集进行分类预测、评估

输出结果

image.png


设计思路

image.png


核心代码

class TfidfVectorizer Found at: sklearn.feature_extraction.text

class TfidfVectorizer(CountVectorizer):

   """Convert a collection of raw documents to a matrix of TF-IDF features.

 

   Equivalent to CountVectorizer followed by TfidfTransformer.

 

   Read more in the :ref:`User Guide <text_feature_extraction>`.

 

   Parameters

   ----------

   input : string {'filename', 'file', 'content'}

   If 'filename', the sequence passed as an argument to fit is

   expected to be a list of filenames that need reading to fetch

   the raw content to analyze.

 

   If 'file', the sequence items must have a 'read' method (file-like

   object) that is called to fetch the bytes in memory.

 

   Otherwise the input is expected to be the sequence strings or

   bytes items are expected to be analyzed directly.

 

   encoding : string, 'utf-8' by default.

   If bytes or files are given to analyze, this encoding is used to

   decode.

 

   decode_error : {'strict', 'ignore', 'replace'}

   Instruction on what to do if a byte sequence is given to analyze that

   contains characters not of the given `encoding`. By default, it is

   'strict', meaning that a UnicodeDecodeError will be raised. Other

   values are 'ignore' and 'replace'.

 

   strip_accents : {'ascii', 'unicode', None}

   Remove accents during the preprocessing step.

   'ascii' is a fast method that only works on characters that have

   an direct ASCII mapping.

   'unicode' is a slightly slower method that works on any characters.

   None (default) does nothing.

 

   analyzer : string, {'word', 'char'} or callable

   Whether the feature should be made of word or character n-grams.

 

   If a callable is passed it is used to extract the sequence of features

   out of the raw, unprocessed input.

 

   preprocessor : callable or None (default)

   Override the preprocessing (string transformation) stage while

   preserving the tokenizing and n-grams generation steps.

 

   tokenizer : callable or None (default)

   Override the string tokenization step while preserving the

   preprocessing and n-grams generation steps.

   Only applies if ``analyzer == 'word'``.

 

   ngram_range : tuple (min_n, max_n)

   The lower and upper boundary of the range of n-values for different

   n-grams to be extracted. All values of n such that min_n <= n <= max_n

   will be used.

 

   stop_words : string {'english'}, list, or None (default)

   If a string, it is passed to _check_stop_list and the appropriate stop

   list is returned. 'english' is currently the only supported string

   value.

 

   If a list, that list is assumed to contain stop words, all of which

   will be removed from the resulting tokens.

   Only applies if ``analyzer == 'word'``.

 

   If None, no stop words will be used. max_df can be set to a value

   in the range [0.7, 1.0) to automatically detect and filter stop

   words based on intra corpus document frequency of terms.

 

   lowercase : boolean, default True

   Convert all characters to lowercase before tokenizing.

 

   token_pattern : string

   Regular expression denoting what constitutes a "token", only used

   if ``analyzer == 'word'``. The default regexp selects tokens of 2

   or more alphanumeric characters (punctuation is completely ignored

   and always treated as a token separator).

 

   max_df : float in range [0.0, 1.0] or int, default=1.0

   When building the vocabulary ignore terms that have a document

   frequency strictly higher than the given threshold (corpus-specific

   stop words).

   If float, the parameter represents a proportion of documents, integer

   absolute counts.

   This parameter is ignored if vocabulary is not None.

 

   min_df : float in range [0.0, 1.0] or int, default=1

   When building the vocabulary ignore terms that have a document

   frequency strictly lower than the given threshold. This value is also

   called cut-off in the literature.

   If float, the parameter represents a proportion of documents, integer

   absolute counts.

   This parameter is ignored if vocabulary is not None.

 

   max_features : int or None, default=None

   If not None, build a vocabulary that only consider the top

   max_features ordered by term frequency across the corpus.

 

   This parameter is ignored if vocabulary is not None.

 

   vocabulary : Mapping or iterable, optional

   Either a Mapping (e.g., a dict) where keys are terms and values are

   indices in the feature matrix, or an iterable over terms. If not

   given, a vocabulary is determined from the input documents.

 

   binary : boolean, default=False

   If True, all non-zero term counts are set to 1. This does not mean

   outputs will have only 0/1 values, only that the tf term in tf-idf

   is binary. (Set idf and normalization to False to get 0/1 outputs.)

 

   dtype : type, optional

   Type of the matrix returned by fit_transform() or transform().

 

   norm : 'l1', 'l2' or None, optional

   Norm used to normalize term vectors. None for no normalization.

 

   use_idf : boolean, default=True

   Enable inverse-document-frequency reweighting.

 

   smooth_idf : boolean, default=True

   Smooth idf weights by adding one to document frequencies, as if an

   extra document was seen containing every term in the collection

   exactly once. Prevents zero divisions.

 

   sublinear_tf : boolean, default=False

   Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).

 

   Attributes

   ----------

   vocabulary_ : dict

   A mapping of terms to feature indices.

 

   idf_ : array, shape = [n_features], or None

   The learned idf vector (global term weights)

   when ``use_idf`` is set to True, None otherwise.

 

   stop_words_ : set

   Terms that were ignored because they either:

 

   - occurred in too many documents (`max_df`)

   - occurred in too few documents (`min_df`)

   - were cut off by feature selection (`max_features`).

 

   This is only available if no vocabulary was given.

 

   See also

   --------

   CountVectorizer

   Tokenize the documents and count the occurrences of token and

    return

   them as a sparse matrix

 

   TfidfTransformer

   Apply Term Frequency Inverse Document Frequency normalization to a

   sparse matrix of occurrence counts.

 

   Notes

   -----

   The ``stop_words_`` attribute can get large and increase the model size

   when pickling. This attribute is provided only for introspection and can

   be safely removed using delattr or set to None before pickling.

   """

   def __init__(self, input='content', encoding='utf-8',

       decode_error='strict', strip_accents=None, lowercase=True,

       preprocessor=None, tokenizer=None, analyzer='word',

       stop_words=None, token_pattern=r"(?u)\b\w\w+\b",

       ngram_range=(1, 1), max_df=1.0, min_df=1,

       max_features=None, vocabulary=None, binary=False,

       dtype=np.int64, norm='l2', use_idf=True, smooth_idf=True,

       sublinear_tf=False):

       super(TfidfVectorizer, self).__init__(input=input, encoding=encoding,

        decode_error=decode_error, strip_accents=strip_accents,

        lowercase=lowercase, preprocessor=preprocessor, tokenizer=tokenizer,

        analyzer=analyzer, stop_words=stop_words,

        token_pattern=token_pattern, ngram_range=ngram_range,

        max_df=max_df, min_df=min_df, max_features=max_features,

        vocabulary=vocabulary, binary=binary, dtype=dtype)

       self._tfidf = TfidfTransformer(norm=norm, use_idf=use_idf,

           smooth_idf=smooth_idf,

           sublinear_tf=sublinear_tf)

 

   # Broadcast the TF-IDF parameters to the underlying transformer

    instance

   # for easy grid search and repr

   @property

   def norm(self):

       return self._tfidf.norm

 

   @norm.setter

   def norm(self, value):

       self._tfidf.norm = value

 

   @property

   def use_idf(self):

       return self._tfidf.use_idf

 

   @use_idf.setter

   def use_idf(self, value):

       self._tfidf.use_idf = value

 

   @property

   def smooth_idf(self):

       return self._tfidf.smooth_idf

 

   @smooth_idf.setter

   def smooth_idf(self, value):

       self._tfidf.smooth_idf = value

 

   @property

   def sublinear_tf(self):

       return self._tfidf.sublinear_tf

 

   @sublinear_tf.setter

   def sublinear_tf(self, value):

       self._tfidf.sublinear_tf = value

 

   @property

   def idf_(self):

       return self._tfidf.idf_

 

   def fit(self, raw_documents, y=None):

       """Learn vocabulary and idf from training set.

       Parameters

       ----------

       raw_documents : iterable

           an iterable which yields either str, unicode or file objects

       Returns

       -------

       self : TfidfVectorizer

       """

       X = super(TfidfVectorizer, self).fit_transform(raw_documents)

       self._tfidf.fit(X)

       return self

 

   def fit_transform(self, raw_documents, y=None):

       """Learn vocabulary and idf, return term-document matrix.

       This is equivalent to fit followed by transform, but more efficiently

       implemented.

       Parameters

       ----------

       raw_documents : iterable

           an iterable which yields either str, unicode or file objects

       Returns

       -------

       X : sparse matrix, [n_samples, n_features]

           Tf-idf-weighted document-term matrix.

       """

       X = super(TfidfVectorizer, self).fit_transform(raw_documents)

       self._tfidf.fit(X)

       # X is already a transformed view of raw_documents so

       # we set copy to False

       return self._tfidf.transform(X, copy=False)

 

   def transform(self, raw_documents, copy=True):

       """Transform documents to document-term matrix.

       Uses the vocabulary and document frequencies (df) learned by fit (or

       fit_transform).

       Parameters

       ----------

       raw_documents : iterable

           an iterable which yields either str, unicode or file objects

       copy : boolean, default True

           Whether to copy X and operate on the copy or perform in-place

           operations.

       Returns

       -------

       X : sparse matrix, [n_samples, n_features]

           Tf-idf-weighted document-term matrix.

       """

       check_is_fitted(self, '_tfidf', 'The tfidf vector is not fitted')

       X = super(TfidfVectorizer, self).transform(raw_documents)

       return self._tfidf.transform(X, copy=False)


相关文章
|
8天前
|
开发框架 算法 .NET
基于ADMM无穷范数检测算法的MIMO通信系统信号检测MATLAB仿真,对比ML,MMSE,ZF以及LAMA
简介:本文介绍基于ADMM的MIMO信号检测算法,结合无穷范数优化与交替方向乘子法,降低计算复杂度并提升检测性能。涵盖MATLAB 2024b实现效果图、核心代码及详细注释,并对比ML、MMSE、ZF、OCD_MMSE与LAMA等算法。重点分析LAMA基于消息传递的低复杂度优势,适用于大规模MIMO系统,为通信系统检测提供理论支持与实践方案。(238字)
|
数据采集 算法 数据可视化
基于Python的k-means聚类分析算法的实现与应用,可以用在电商评论、招聘信息等各个领域的文本聚类及指标聚类,效果很好
本文介绍了基于Python实现的k-means聚类分析算法,并通过微博考研话题的数据清洗、聚类数量评估、聚类分析实现与结果可视化等步骤,展示了该算法在文本聚类领域的应用效果。
515 1
|
机器学习/深度学习 存储 人工智能
文本情感识别分析系统Python+SVM分类算法+机器学习人工智能+计算机毕业设计
使用Python作为开发语言,基于文本数据集(一个积极的xls文本格式和一个消极的xls文本格式文件),使用Word2vec对文本进行处理。通过支持向量机SVM算法训练情绪分类模型。实现对文本消极情感和文本积极情感的识别。并基于Django框架开发网页平台实现对用户的可视化操作和数据存储。
307 0
文本情感识别分析系统Python+SVM分类算法+机器学习人工智能+计算机毕业设计
|
机器学习/深度学习 数据采集 算法
Python基于KMeans算法进行文本聚类项目实战
Python基于KMeans算法进行文本聚类项目实战
|
数据采集 自然语言处理 数据可视化
基于Python的社交媒体评论数据挖掘,使用LDA主题分析、文本聚类算法、情感分析实现
本文介绍了基于Python的社交媒体评论数据挖掘方法,使用LDA主题分析、文本聚类算法和情感分析技术,对数据进行深入分析和可视化,以揭示文本数据中的潜在主题、模式和情感倾向。
1619 0
|
算法 数据可视化 搜索推荐
基于python的k-means聚类分析算法,对文本、数据等进行聚类,有轮廓系数和手肘法检验
本文详细介绍了基于Python实现的k-means聚类分析算法,包括数据准备、预处理、标准化、聚类数目确定、聚类分析、降维可视化以及结果输出的完整流程,并应用该算法对文本数据进行聚类分析,展示了轮廓系数法和手肘法检验确定最佳聚类数目的方法。
477 0
|
14天前
|
传感器 机器学习/深度学习 编解码
MATLAB|主动噪声和振动控制算法——对较大的次级路径变化具有鲁棒性
MATLAB|主动噪声和振动控制算法——对较大的次级路径变化具有鲁棒性
124 3
|
19天前
|
存储 编解码 算法
【多光谱滤波器阵列设计的最优球体填充】使用MSFA设计方法进行各种重建算法时,图像质量可以提高至多2 dB,并在光谱相似性方面实现了显著提升(Matlab代码实现)
【多光谱滤波器阵列设计的最优球体填充】使用MSFA设计方法进行各种重建算法时,图像质量可以提高至多2 dB,并在光谱相似性方面实现了显著提升(Matlab代码实现)
|
8天前
|
机器学习/深度学习 算法 数据可视化
基于MVO多元宇宙优化的DBSCAN聚类算法matlab仿真
本程序基于MATLAB实现MVO优化的DBSCAN聚类算法,通过多元宇宙优化自动搜索最优参数Eps与MinPts,提升聚类精度。对比传统DBSCAN,MVO-DBSCAN有效克服参数依赖问题,适应复杂数据分布,增强鲁棒性,适用于非均匀密度数据集的高效聚类分析。
|
19天前
|
机器学习/深度学习 传感器 算法
【高创新】基于优化的自适应差分导纳算法的改进最大功率点跟踪研究(Matlab代码实现)
【高创新】基于优化的自适应差分导纳算法的改进最大功率点跟踪研究(Matlab代码实现)
134 14

热门文章

最新文章