ML之NB:利用朴素贝叶斯NB算法(TfidfVectorizer+不去除停用词)对20类新闻文本数据集进行分类预测、评估

简介: ML之NB:利用朴素贝叶斯NB算法(TfidfVectorizer+不去除停用词)对20类新闻文本数据集进行分类预测、评估

输出结果

image.png


设计思路

image.png


核心代码

class TfidfVectorizer Found at: sklearn.feature_extraction.text

class TfidfVectorizer(CountVectorizer):

   """Convert a collection of raw documents to a matrix of TF-IDF features.

 

   Equivalent to CountVectorizer followed by TfidfTransformer.

 

   Read more in the :ref:`User Guide <text_feature_extraction>`.

 

   Parameters

   ----------

   input : string {'filename', 'file', 'content'}

   If 'filename', the sequence passed as an argument to fit is

   expected to be a list of filenames that need reading to fetch

   the raw content to analyze.

 

   If 'file', the sequence items must have a 'read' method (file-like

   object) that is called to fetch the bytes in memory.

 

   Otherwise the input is expected to be the sequence strings or

   bytes items are expected to be analyzed directly.

 

   encoding : string, 'utf-8' by default.

   If bytes or files are given to analyze, this encoding is used to

   decode.

 

   decode_error : {'strict', 'ignore', 'replace'}

   Instruction on what to do if a byte sequence is given to analyze that

   contains characters not of the given `encoding`. By default, it is

   'strict', meaning that a UnicodeDecodeError will be raised. Other

   values are 'ignore' and 'replace'.

 

   strip_accents : {'ascii', 'unicode', None}

   Remove accents during the preprocessing step.

   'ascii' is a fast method that only works on characters that have

   an direct ASCII mapping.

   'unicode' is a slightly slower method that works on any characters.

   None (default) does nothing.

 

   analyzer : string, {'word', 'char'} or callable

   Whether the feature should be made of word or character n-grams.

 

   If a callable is passed it is used to extract the sequence of features

   out of the raw, unprocessed input.

 

   preprocessor : callable or None (default)

   Override the preprocessing (string transformation) stage while

   preserving the tokenizing and n-grams generation steps.

 

   tokenizer : callable or None (default)

   Override the string tokenization step while preserving the

   preprocessing and n-grams generation steps.

   Only applies if ``analyzer == 'word'``.

 

   ngram_range : tuple (min_n, max_n)

   The lower and upper boundary of the range of n-values for different

   n-grams to be extracted. All values of n such that min_n <= n <= max_n

   will be used.

 

   stop_words : string {'english'}, list, or None (default)

   If a string, it is passed to _check_stop_list and the appropriate stop

   list is returned. 'english' is currently the only supported string

   value.

 

   If a list, that list is assumed to contain stop words, all of which

   will be removed from the resulting tokens.

   Only applies if ``analyzer == 'word'``.

 

   If None, no stop words will be used. max_df can be set to a value

   in the range [0.7, 1.0) to automatically detect and filter stop

   words based on intra corpus document frequency of terms.

 

   lowercase : boolean, default True

   Convert all characters to lowercase before tokenizing.

 

   token_pattern : string

   Regular expression denoting what constitutes a "token", only used

   if ``analyzer == 'word'``. The default regexp selects tokens of 2

   or more alphanumeric characters (punctuation is completely ignored

   and always treated as a token separator).

 

   max_df : float in range [0.0, 1.0] or int, default=1.0

   When building the vocabulary ignore terms that have a document

   frequency strictly higher than the given threshold (corpus-specific

   stop words).

   If float, the parameter represents a proportion of documents, integer

   absolute counts.

   This parameter is ignored if vocabulary is not None.

 

   min_df : float in range [0.0, 1.0] or int, default=1

   When building the vocabulary ignore terms that have a document

   frequency strictly lower than the given threshold. This value is also

   called cut-off in the literature.

   If float, the parameter represents a proportion of documents, integer

   absolute counts.

   This parameter is ignored if vocabulary is not None.

 

   max_features : int or None, default=None

   If not None, build a vocabulary that only consider the top

   max_features ordered by term frequency across the corpus.

 

   This parameter is ignored if vocabulary is not None.

 

   vocabulary : Mapping or iterable, optional

   Either a Mapping (e.g., a dict) where keys are terms and values are

   indices in the feature matrix, or an iterable over terms. If not

   given, a vocabulary is determined from the input documents.

 

   binary : boolean, default=False

   If True, all non-zero term counts are set to 1. This does not mean

   outputs will have only 0/1 values, only that the tf term in tf-idf

   is binary. (Set idf and normalization to False to get 0/1 outputs.)

 

   dtype : type, optional

   Type of the matrix returned by fit_transform() or transform().

 

   norm : 'l1', 'l2' or None, optional

   Norm used to normalize term vectors. None for no normalization.

 

   use_idf : boolean, default=True

   Enable inverse-document-frequency reweighting.

 

   smooth_idf : boolean, default=True

   Smooth idf weights by adding one to document frequencies, as if an

   extra document was seen containing every term in the collection

   exactly once. Prevents zero divisions.

 

   sublinear_tf : boolean, default=False

   Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).

 

   Attributes

   ----------

   vocabulary_ : dict

   A mapping of terms to feature indices.

 

   idf_ : array, shape = [n_features], or None

   The learned idf vector (global term weights)

   when ``use_idf`` is set to True, None otherwise.

 

   stop_words_ : set

   Terms that were ignored because they either:

 

   - occurred in too many documents (`max_df`)

   - occurred in too few documents (`min_df`)

   - were cut off by feature selection (`max_features`).

 

   This is only available if no vocabulary was given.

 

   See also

   --------

   CountVectorizer

   Tokenize the documents and count the occurrences of token and

    return

   them as a sparse matrix

 

   TfidfTransformer

   Apply Term Frequency Inverse Document Frequency normalization to a

   sparse matrix of occurrence counts.

 

   Notes

   -----

   The ``stop_words_`` attribute can get large and increase the model size

   when pickling. This attribute is provided only for introspection and can

   be safely removed using delattr or set to None before pickling.

   """

   def __init__(self, input='content', encoding='utf-8',

       decode_error='strict', strip_accents=None, lowercase=True,

       preprocessor=None, tokenizer=None, analyzer='word',

       stop_words=None, token_pattern=r"(?u)\b\w\w+\b",

       ngram_range=(1, 1), max_df=1.0, min_df=1,

       max_features=None, vocabulary=None, binary=False,

       dtype=np.int64, norm='l2', use_idf=True, smooth_idf=True,

       sublinear_tf=False):

       super(TfidfVectorizer, self).__init__(input=input, encoding=encoding,

        decode_error=decode_error, strip_accents=strip_accents,

        lowercase=lowercase, preprocessor=preprocessor, tokenizer=tokenizer,

        analyzer=analyzer, stop_words=stop_words,

        token_pattern=token_pattern, ngram_range=ngram_range,

        max_df=max_df, min_df=min_df, max_features=max_features,

        vocabulary=vocabulary, binary=binary, dtype=dtype)

       self._tfidf = TfidfTransformer(norm=norm, use_idf=use_idf,

           smooth_idf=smooth_idf,

           sublinear_tf=sublinear_tf)

 

   # Broadcast the TF-IDF parameters to the underlying transformer

    instance

   # for easy grid search and repr

   @property

   def norm(self):

       return self._tfidf.norm

 

   @norm.setter

   def norm(self, value):

       self._tfidf.norm = value

 

   @property

   def use_idf(self):

       return self._tfidf.use_idf

 

   @use_idf.setter

   def use_idf(self, value):

       self._tfidf.use_idf = value

 

   @property

   def smooth_idf(self):

       return self._tfidf.smooth_idf

 

   @smooth_idf.setter

   def smooth_idf(self, value):

       self._tfidf.smooth_idf = value

 

   @property

   def sublinear_tf(self):

       return self._tfidf.sublinear_tf

 

   @sublinear_tf.setter

   def sublinear_tf(self, value):

       self._tfidf.sublinear_tf = value

 

   @property

   def idf_(self):

       return self._tfidf.idf_

 

   def fit(self, raw_documents, y=None):

       """Learn vocabulary and idf from training set.

       Parameters

       ----------

       raw_documents : iterable

           an iterable which yields either str, unicode or file objects

       Returns

       -------

       self : TfidfVectorizer

       """

       X = super(TfidfVectorizer, self).fit_transform(raw_documents)

       self._tfidf.fit(X)

       return self

 

   def fit_transform(self, raw_documents, y=None):

       """Learn vocabulary and idf, return term-document matrix.

       This is equivalent to fit followed by transform, but more efficiently

       implemented.

       Parameters

       ----------

       raw_documents : iterable

           an iterable which yields either str, unicode or file objects

       Returns

       -------

       X : sparse matrix, [n_samples, n_features]

           Tf-idf-weighted document-term matrix.

       """

       X = super(TfidfVectorizer, self).fit_transform(raw_documents)

       self._tfidf.fit(X)

       # X is already a transformed view of raw_documents so

       # we set copy to False

       return self._tfidf.transform(X, copy=False)

 

   def transform(self, raw_documents, copy=True):

       """Transform documents to document-term matrix.

       Uses the vocabulary and document frequencies (df) learned by fit (or

       fit_transform).

       Parameters

       ----------

       raw_documents : iterable

           an iterable which yields either str, unicode or file objects

       copy : boolean, default True

           Whether to copy X and operate on the copy or perform in-place

           operations.

       Returns

       -------

       X : sparse matrix, [n_samples, n_features]

           Tf-idf-weighted document-term matrix.

       """

       check_is_fitted(self, '_tfidf', 'The tfidf vector is not fitted')

       X = super(TfidfVectorizer, self).transform(raw_documents)

       return self._tfidf.transform(X, copy=False)


相关文章
|
21天前
|
机器学习/深度学习 算法 数据可视化
利用SVM(支持向量机)分类算法对鸢尾花数据集进行分类
本文介绍了如何使用支持向量机(SVM)算法对鸢尾花数据集进行分类。作者通过Python的sklearn库加载数据,并利用pandas、matplotlib等工具进行数据分析和可视化。
138 70
|
4月前
|
移动开发 算法 前端开发
前端常用算法全解:特征梳理、复杂度比较、分类解读与示例展示
前端常用算法全解:特征梳理、复杂度比较、分类解读与示例展示
58 0
|
4月前
|
存储 缓存 分布式计算
数据结构与算法学习一:学习前的准备,数据结构的分类,数据结构与算法的关系,实际编程中遇到的问题,几个经典算法问题
这篇文章是关于数据结构与算法的学习指南,涵盖了数据结构的分类、数据结构与算法的关系、实际编程中遇到的问题以及几个经典的算法面试题。
57 0
数据结构与算法学习一:学习前的准备,数据结构的分类,数据结构与算法的关系,实际编程中遇到的问题,几个经典算法问题
|
5月前
|
机器学习/深度学习 人工智能 算法
【新闻文本分类识别系统】Python+卷积神经网络算法+人工智能+深度学习+计算机毕设项目+Django网页界面平台
文本分类识别系统。本系统使用Python作为主要开发语言,首先收集了10种中文文本数据集("体育类", "财经类", "房产类", "家居类", "教育类", "科技类", "时尚类", "时政类", "游戏类", "娱乐类"),然后基于TensorFlow搭建CNN卷积神经网络算法模型。通过对数据集进行多轮迭代训练,最后得到一个识别精度较高的模型,并保存为本地的h5格式。然后使用Django开发Web网页端操作界面,实现用户上传一段文本识别其所属的类别。
164 1
【新闻文本分类识别系统】Python+卷积神经网络算法+人工智能+深度学习+计算机毕设项目+Django网页界面平台
|
5月前
|
机器学习/深度学习 存储 人工智能
文本情感识别分析系统Python+SVM分类算法+机器学习人工智能+计算机毕业设计
使用Python作为开发语言,基于文本数据集(一个积极的xls文本格式和一个消极的xls文本格式文件),使用Word2vec对文本进行处理。通过支持向量机SVM算法训练情绪分类模型。实现对文本消极情感和文本积极情感的识别。并基于Django框架开发网页平台实现对用户的可视化操作和数据存储。
82 0
文本情感识别分析系统Python+SVM分类算法+机器学习人工智能+计算机毕业设计
|
5月前
|
机器学习/深度学习 算法 数据挖掘
决策树算法大揭秘:Python让你秒懂分支逻辑,精准分类不再难
【9月更文挑战第12天】决策树算法作为机器学习领域的一颗明珠,凭借其直观易懂和强大的解释能力,在分类与回归任务中表现出色。相比传统统计方法,决策树通过简单的分支逻辑实现了数据的精准分类。本文将借助Python和scikit-learn库,以鸢尾花数据集为例,展示如何使用决策树进行分类,并探讨其优势与局限。通过构建一系列条件判断,决策树不仅模拟了人类决策过程,还确保了结果的可追溯性和可解释性。无论您是新手还是专家,都能轻松上手,享受机器学习的乐趣。
69 9
|
1天前
|
传感器 算法
基于GA遗传算法的多机无源定位系统GDOP优化matlab仿真
本项目基于遗传算法(GA)优化多机无源定位系统的GDOP,使用MATLAB2022A进行仿真。通过遗传算法的选择、交叉和变异操作,迭代优化传感器配置,最小化GDOP值,提高定位精度。仿真输出包括GDOP优化结果、遗传算法收敛曲线及三维空间坐标点分布图。核心程序实现了染色体编码、适应度评估、遗传操作等关键步骤,最终展示优化后的传感器布局及其性能。
|
1天前
|
机器学习/深度学习 算法 安全
基于深度学习的路面裂缝检测算法matlab仿真
本项目基于YOLOv2算法实现高效的路面裂缝检测,使用Matlab 2022a开发。完整程序运行效果无水印,核心代码配有详细中文注释及操作视频。通过深度学习技术,将目标检测转化为回归问题,直接预测裂缝位置和类别,大幅提升检测效率与准确性。适用于实时检测任务,确保道路安全维护。 简介涵盖了算法理论、数据集准备、网络训练及检测过程,采用Darknet-19卷积神经网络结构,结合随机梯度下降算法进行训练。
|
2天前
|
算法 数据可视化 数据安全/隐私保护
一级倒立摆平衡控制系统MATLAB仿真,可显示倒立摆平衡动画,对比极点配置,线性二次型,PID,PI及PD五种算法
本课题基于MATLAB对一级倒立摆控制系统进行升级仿真,增加了PI、PD控制器,并对比了极点配置、线性二次型、PID、PI及PD五种算法的控制效果。通过GUI界面显示倒立摆动画和控制输出曲线,展示了不同控制器在偏转角和小车位移变化上的性能差异。理论部分介绍了倒立摆系统的力学模型,包括小车和杆的动力学方程。核心程序实现了不同控制算法的选择与仿真结果的可视化。
31 15
|
2天前
|
算法
基于SOA海鸥优化算法的三维曲面最高点搜索matlab仿真
本程序基于海鸥优化算法(SOA)进行三维曲面最高点搜索的MATLAB仿真,输出收敛曲线和搜索结果。使用MATLAB2022A版本运行,核心代码实现种群初始化、适应度计算、交叉变异等操作。SOA模拟海鸥觅食行为,通过搜索飞行、跟随飞行和掠食飞行三种策略高效探索解空间,找到全局最优解。

热门文章

最新文章