ML之NB:利用朴素贝叶斯NB算法(TfidfVectorizer+不去除停用词)对20类新闻文本数据集进行分类预测、评估

简介: ML之NB:利用朴素贝叶斯NB算法(TfidfVectorizer+不去除停用词)对20类新闻文本数据集进行分类预测、评估

输出结果

image.png


设计思路

image.png


核心代码

class TfidfVectorizer Found at: sklearn.feature_extraction.text

class TfidfVectorizer(CountVectorizer):

   """Convert a collection of raw documents to a matrix of TF-IDF features.

 

   Equivalent to CountVectorizer followed by TfidfTransformer.

 

   Read more in the :ref:`User Guide <text_feature_extraction>`.

 

   Parameters

   ----------

   input : string {'filename', 'file', 'content'}

   If 'filename', the sequence passed as an argument to fit is

   expected to be a list of filenames that need reading to fetch

   the raw content to analyze.

 

   If 'file', the sequence items must have a 'read' method (file-like

   object) that is called to fetch the bytes in memory.

 

   Otherwise the input is expected to be the sequence strings or

   bytes items are expected to be analyzed directly.

 

   encoding : string, 'utf-8' by default.

   If bytes or files are given to analyze, this encoding is used to

   decode.

 

   decode_error : {'strict', 'ignore', 'replace'}

   Instruction on what to do if a byte sequence is given to analyze that

   contains characters not of the given `encoding`. By default, it is

   'strict', meaning that a UnicodeDecodeError will be raised. Other

   values are 'ignore' and 'replace'.

 

   strip_accents : {'ascii', 'unicode', None}

   Remove accents during the preprocessing step.

   'ascii' is a fast method that only works on characters that have

   an direct ASCII mapping.

   'unicode' is a slightly slower method that works on any characters.

   None (default) does nothing.

 

   analyzer : string, {'word', 'char'} or callable

   Whether the feature should be made of word or character n-grams.

 

   If a callable is passed it is used to extract the sequence of features

   out of the raw, unprocessed input.

 

   preprocessor : callable or None (default)

   Override the preprocessing (string transformation) stage while

   preserving the tokenizing and n-grams generation steps.

 

   tokenizer : callable or None (default)

   Override the string tokenization step while preserving the

   preprocessing and n-grams generation steps.

   Only applies if ``analyzer == 'word'``.

 

   ngram_range : tuple (min_n, max_n)

   The lower and upper boundary of the range of n-values for different

   n-grams to be extracted. All values of n such that min_n <= n <= max_n

   will be used.

 

   stop_words : string {'english'}, list, or None (default)

   If a string, it is passed to _check_stop_list and the appropriate stop

   list is returned. 'english' is currently the only supported string

   value.

 

   If a list, that list is assumed to contain stop words, all of which

   will be removed from the resulting tokens.

   Only applies if ``analyzer == 'word'``.

 

   If None, no stop words will be used. max_df can be set to a value

   in the range [0.7, 1.0) to automatically detect and filter stop

   words based on intra corpus document frequency of terms.

 

   lowercase : boolean, default True

   Convert all characters to lowercase before tokenizing.

 

   token_pattern : string

   Regular expression denoting what constitutes a "token", only used

   if ``analyzer == 'word'``. The default regexp selects tokens of 2

   or more alphanumeric characters (punctuation is completely ignored

   and always treated as a token separator).

 

   max_df : float in range [0.0, 1.0] or int, default=1.0

   When building the vocabulary ignore terms that have a document

   frequency strictly higher than the given threshold (corpus-specific

   stop words).

   If float, the parameter represents a proportion of documents, integer

   absolute counts.

   This parameter is ignored if vocabulary is not None.

 

   min_df : float in range [0.0, 1.0] or int, default=1

   When building the vocabulary ignore terms that have a document

   frequency strictly lower than the given threshold. This value is also

   called cut-off in the literature.

   If float, the parameter represents a proportion of documents, integer

   absolute counts.

   This parameter is ignored if vocabulary is not None.

 

   max_features : int or None, default=None

   If not None, build a vocabulary that only consider the top

   max_features ordered by term frequency across the corpus.

 

   This parameter is ignored if vocabulary is not None.

 

   vocabulary : Mapping or iterable, optional

   Either a Mapping (e.g., a dict) where keys are terms and values are

   indices in the feature matrix, or an iterable over terms. If not

   given, a vocabulary is determined from the input documents.

 

   binary : boolean, default=False

   If True, all non-zero term counts are set to 1. This does not mean

   outputs will have only 0/1 values, only that the tf term in tf-idf

   is binary. (Set idf and normalization to False to get 0/1 outputs.)

 

   dtype : type, optional

   Type of the matrix returned by fit_transform() or transform().

 

   norm : 'l1', 'l2' or None, optional

   Norm used to normalize term vectors. None for no normalization.

 

   use_idf : boolean, default=True

   Enable inverse-document-frequency reweighting.

 

   smooth_idf : boolean, default=True

   Smooth idf weights by adding one to document frequencies, as if an

   extra document was seen containing every term in the collection

   exactly once. Prevents zero divisions.

 

   sublinear_tf : boolean, default=False

   Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).

 

   Attributes

   ----------

   vocabulary_ : dict

   A mapping of terms to feature indices.

 

   idf_ : array, shape = [n_features], or None

   The learned idf vector (global term weights)

   when ``use_idf`` is set to True, None otherwise.

 

   stop_words_ : set

   Terms that were ignored because they either:

 

   - occurred in too many documents (`max_df`)

   - occurred in too few documents (`min_df`)

   - were cut off by feature selection (`max_features`).

 

   This is only available if no vocabulary was given.

 

   See also

   --------

   CountVectorizer

   Tokenize the documents and count the occurrences of token and

    return

   them as a sparse matrix

 

   TfidfTransformer

   Apply Term Frequency Inverse Document Frequency normalization to a

   sparse matrix of occurrence counts.

 

   Notes

   -----

   The ``stop_words_`` attribute can get large and increase the model size

   when pickling. This attribute is provided only for introspection and can

   be safely removed using delattr or set to None before pickling.

   """

   def __init__(self, input='content', encoding='utf-8',

       decode_error='strict', strip_accents=None, lowercase=True,

       preprocessor=None, tokenizer=None, analyzer='word',

       stop_words=None, token_pattern=r"(?u)\b\w\w+\b",

       ngram_range=(1, 1), max_df=1.0, min_df=1,

       max_features=None, vocabulary=None, binary=False,

       dtype=np.int64, norm='l2', use_idf=True, smooth_idf=True,

       sublinear_tf=False):

       super(TfidfVectorizer, self).__init__(input=input, encoding=encoding,

        decode_error=decode_error, strip_accents=strip_accents,

        lowercase=lowercase, preprocessor=preprocessor, tokenizer=tokenizer,

        analyzer=analyzer, stop_words=stop_words,

        token_pattern=token_pattern, ngram_range=ngram_range,

        max_df=max_df, min_df=min_df, max_features=max_features,

        vocabulary=vocabulary, binary=binary, dtype=dtype)

       self._tfidf = TfidfTransformer(norm=norm, use_idf=use_idf,

           smooth_idf=smooth_idf,

           sublinear_tf=sublinear_tf)

 

   # Broadcast the TF-IDF parameters to the underlying transformer

    instance

   # for easy grid search and repr

   @property

   def norm(self):

       return self._tfidf.norm

 

   @norm.setter

   def norm(self, value):

       self._tfidf.norm = value

 

   @property

   def use_idf(self):

       return self._tfidf.use_idf

 

   @use_idf.setter

   def use_idf(self, value):

       self._tfidf.use_idf = value

 

   @property

   def smooth_idf(self):

       return self._tfidf.smooth_idf

 

   @smooth_idf.setter

   def smooth_idf(self, value):

       self._tfidf.smooth_idf = value

 

   @property

   def sublinear_tf(self):

       return self._tfidf.sublinear_tf

 

   @sublinear_tf.setter

   def sublinear_tf(self, value):

       self._tfidf.sublinear_tf = value

 

   @property

   def idf_(self):

       return self._tfidf.idf_

 

   def fit(self, raw_documents, y=None):

       """Learn vocabulary and idf from training set.

       Parameters

       ----------

       raw_documents : iterable

           an iterable which yields either str, unicode or file objects

       Returns

       -------

       self : TfidfVectorizer

       """

       X = super(TfidfVectorizer, self).fit_transform(raw_documents)

       self._tfidf.fit(X)

       return self

 

   def fit_transform(self, raw_documents, y=None):

       """Learn vocabulary and idf, return term-document matrix.

       This is equivalent to fit followed by transform, but more efficiently

       implemented.

       Parameters

       ----------

       raw_documents : iterable

           an iterable which yields either str, unicode or file objects

       Returns

       -------

       X : sparse matrix, [n_samples, n_features]

           Tf-idf-weighted document-term matrix.

       """

       X = super(TfidfVectorizer, self).fit_transform(raw_documents)

       self._tfidf.fit(X)

       # X is already a transformed view of raw_documents so

       # we set copy to False

       return self._tfidf.transform(X, copy=False)

 

   def transform(self, raw_documents, copy=True):

       """Transform documents to document-term matrix.

       Uses the vocabulary and document frequencies (df) learned by fit (or

       fit_transform).

       Parameters

       ----------

       raw_documents : iterable

           an iterable which yields either str, unicode or file objects

       copy : boolean, default True

           Whether to copy X and operate on the copy or perform in-place

           operations.

       Returns

       -------

       X : sparse matrix, [n_samples, n_features]

           Tf-idf-weighted document-term matrix.

       """

       check_is_fitted(self, '_tfidf', 'The tfidf vector is not fitted')

       X = super(TfidfVectorizer, self).transform(raw_documents)

       return self._tfidf.transform(X, copy=False)


相关文章
|
4月前
|
机器学习/深度学习 人工智能 算法
【新闻文本分类识别系统】Python+卷积神经网络算法+人工智能+深度学习+计算机毕设项目+Django网页界面平台
文本分类识别系统。本系统使用Python作为主要开发语言,首先收集了10种中文文本数据集("体育类", "财经类", "房产类", "家居类", "教育类", "科技类", "时尚类", "时政类", "游戏类", "娱乐类"),然后基于TensorFlow搭建CNN卷积神经网络算法模型。通过对数据集进行多轮迭代训练,最后得到一个识别精度较高的模型,并保存为本地的h5格式。然后使用Django开发Web网页端操作界面,实现用户上传一段文本识别其所属的类别。
128 1
【新闻文本分类识别系统】Python+卷积神经网络算法+人工智能+深度学习+计算机毕设项目+Django网页界面平台
|
3月前
|
存储 缓存 分布式计算
数据结构与算法学习一:学习前的准备,数据结构的分类,数据结构与算法的关系,实际编程中遇到的问题,几个经典算法问题
这篇文章是关于数据结构与算法的学习指南,涵盖了数据结构的分类、数据结构与算法的关系、实际编程中遇到的问题以及几个经典的算法面试题。
45 0
数据结构与算法学习一:学习前的准备,数据结构的分类,数据结构与算法的关系,实际编程中遇到的问题,几个经典算法问题
|
3月前
|
移动开发 算法 前端开发
前端常用算法全解:特征梳理、复杂度比较、分类解读与示例展示
前端常用算法全解:特征梳理、复杂度比较、分类解读与示例展示
41 0
|
4月前
|
机器学习/深度学习 算法 数据挖掘
决策树算法大揭秘:Python让你秒懂分支逻辑,精准分类不再难
【9月更文挑战第12天】决策树算法作为机器学习领域的一颗明珠,凭借其直观易懂和强大的解释能力,在分类与回归任务中表现出色。相比传统统计方法,决策树通过简单的分支逻辑实现了数据的精准分类。本文将借助Python和scikit-learn库,以鸢尾花数据集为例,展示如何使用决策树进行分类,并探讨其优势与局限。通过构建一系列条件判断,决策树不仅模拟了人类决策过程,还确保了结果的可追溯性和可解释性。无论您是新手还是专家,都能轻松上手,享受机器学习的乐趣。
59 9
|
2天前
|
算法 数据安全/隐私保护
室内障碍物射线追踪算法matlab模拟仿真
### 简介 本项目展示了室内障碍物射线追踪算法在无线通信中的应用。通过Matlab 2022a实现,包含完整程序运行效果(无水印),支持增加发射点和室内墙壁设置。核心代码配有详细中文注释及操作视频。该算法基于几何光学原理,模拟信号在复杂室内环境中的传播路径与强度,涵盖场景建模、射线发射、传播及接收点场强计算等步骤,为无线网络规划提供重要依据。
|
15天前
|
机器学习/深度学习 算法
基于改进遗传优化的BP神经网络金融序列预测算法matlab仿真
本项目基于改进遗传优化的BP神经网络进行金融序列预测,使用MATLAB2022A实现。通过对比BP神经网络、遗传优化BP神经网络及改进遗传优化BP神经网络,展示了三者的误差和预测曲线差异。核心程序结合遗传算法(GA)与BP神经网络,利用GA优化BP网络的初始权重和阈值,提高预测精度。GA通过选择、交叉、变异操作迭代优化,防止局部收敛,增强模型对金融市场复杂性和不确定性的适应能力。
150 80
|
3天前
|
机器学习/深度学习 数据采集 算法
基于GA遗传优化的CNN-GRU-SAM网络时间序列回归预测算法matlab仿真
本项目基于MATLAB2022a实现时间序列预测,采用CNN-GRU-SAM网络结构。卷积层提取局部特征,GRU层处理长期依赖,自注意力机制捕捉全局特征。完整代码含中文注释和操作视频,运行效果无水印展示。算法通过数据归一化、种群初始化、适应度计算、个体更新等步骤优化网络参数,最终输出预测结果。适用于金融市场、气象预报等领域。
基于GA遗传优化的CNN-GRU-SAM网络时间序列回归预测算法matlab仿真
|
3天前
|
算法
基于龙格库塔算法的锅炉单相受热管建模与matlab数值仿真
本设计基于龙格库塔算法对锅炉单相受热管进行建模与MATLAB数值仿真,简化为喷水减温器和末级过热器组合,考虑均匀传热及静态烟气处理。使用MATLAB2022A版本运行,展示自编与内置四阶龙格库塔法的精度对比及误差分析。模型涉及热传递和流体动力学原理,适用于优化锅炉效率。
|
1天前
|
移动开发 算法 计算机视觉
基于分块贝叶斯非局部均值优化(OBNLM)的图像去噪算法matlab仿真
本项目基于分块贝叶斯非局部均值优化(OBNLM)算法实现图像去噪,使用MATLAB2022A进行仿真。通过调整块大小和窗口大小等参数,研究其对去噪效果的影响。OBNLM结合了经典NLM算法与贝叶斯统计理论,利用块匹配和概率模型优化相似块的加权融合,提高去噪效率和保真度。实验展示了不同参数设置下的去噪结果,验证了算法的有效性。
|
9天前
|
机器学习/深度学习 算法
基于遗传优化的双BP神经网络金融序列预测算法matlab仿真
本项目基于遗传优化的双BP神经网络实现金融序列预测,使用MATLAB2022A进行仿真。算法通过两个初始学习率不同的BP神经网络(e1, e2)协同工作,结合遗传算法优化,提高预测精度。实验展示了三个算法的误差对比结果,验证了该方法的有效性。

热门文章

最新文章

下一篇
开通oss服务