ML之NB:利用朴素贝叶斯NB算法(TfidfVectorizer+不去除停用词)对20类新闻文本数据集进行分类预测、评估

简介: ML之NB:利用朴素贝叶斯NB算法(TfidfVectorizer+不去除停用词)对20类新闻文本数据集进行分类预测、评估

输出结果

image.png


设计思路

image.png


核心代码

class TfidfVectorizer Found at: sklearn.feature_extraction.text

class TfidfVectorizer(CountVectorizer):

   """Convert a collection of raw documents to a matrix of TF-IDF features.

 

   Equivalent to CountVectorizer followed by TfidfTransformer.

 

   Read more in the :ref:`User Guide <text_feature_extraction>`.

 

   Parameters

   ----------

   input : string {'filename', 'file', 'content'}

   If 'filename', the sequence passed as an argument to fit is

   expected to be a list of filenames that need reading to fetch

   the raw content to analyze.

 

   If 'file', the sequence items must have a 'read' method (file-like

   object) that is called to fetch the bytes in memory.

 

   Otherwise the input is expected to be the sequence strings or

   bytes items are expected to be analyzed directly.

 

   encoding : string, 'utf-8' by default.

   If bytes or files are given to analyze, this encoding is used to

   decode.

 

   decode_error : {'strict', 'ignore', 'replace'}

   Instruction on what to do if a byte sequence is given to analyze that

   contains characters not of the given `encoding`. By default, it is

   'strict', meaning that a UnicodeDecodeError will be raised. Other

   values are 'ignore' and 'replace'.

 

   strip_accents : {'ascii', 'unicode', None}

   Remove accents during the preprocessing step.

   'ascii' is a fast method that only works on characters that have

   an direct ASCII mapping.

   'unicode' is a slightly slower method that works on any characters.

   None (default) does nothing.

 

   analyzer : string, {'word', 'char'} or callable

   Whether the feature should be made of word or character n-grams.

 

   If a callable is passed it is used to extract the sequence of features

   out of the raw, unprocessed input.

 

   preprocessor : callable or None (default)

   Override the preprocessing (string transformation) stage while

   preserving the tokenizing and n-grams generation steps.

 

   tokenizer : callable or None (default)

   Override the string tokenization step while preserving the

   preprocessing and n-grams generation steps.

   Only applies if ``analyzer == 'word'``.

 

   ngram_range : tuple (min_n, max_n)

   The lower and upper boundary of the range of n-values for different

   n-grams to be extracted. All values of n such that min_n <= n <= max_n

   will be used.

 

   stop_words : string {'english'}, list, or None (default)

   If a string, it is passed to _check_stop_list and the appropriate stop

   list is returned. 'english' is currently the only supported string

   value.

 

   If a list, that list is assumed to contain stop words, all of which

   will be removed from the resulting tokens.

   Only applies if ``analyzer == 'word'``.

 

   If None, no stop words will be used. max_df can be set to a value

   in the range [0.7, 1.0) to automatically detect and filter stop

   words based on intra corpus document frequency of terms.

 

   lowercase : boolean, default True

   Convert all characters to lowercase before tokenizing.

 

   token_pattern : string

   Regular expression denoting what constitutes a "token", only used

   if ``analyzer == 'word'``. The default regexp selects tokens of 2

   or more alphanumeric characters (punctuation is completely ignored

   and always treated as a token separator).

 

   max_df : float in range [0.0, 1.0] or int, default=1.0

   When building the vocabulary ignore terms that have a document

   frequency strictly higher than the given threshold (corpus-specific

   stop words).

   If float, the parameter represents a proportion of documents, integer

   absolute counts.

   This parameter is ignored if vocabulary is not None.

 

   min_df : float in range [0.0, 1.0] or int, default=1

   When building the vocabulary ignore terms that have a document

   frequency strictly lower than the given threshold. This value is also

   called cut-off in the literature.

   If float, the parameter represents a proportion of documents, integer

   absolute counts.

   This parameter is ignored if vocabulary is not None.

 

   max_features : int or None, default=None

   If not None, build a vocabulary that only consider the top

   max_features ordered by term frequency across the corpus.

 

   This parameter is ignored if vocabulary is not None.

 

   vocabulary : Mapping or iterable, optional

   Either a Mapping (e.g., a dict) where keys are terms and values are

   indices in the feature matrix, or an iterable over terms. If not

   given, a vocabulary is determined from the input documents.

 

   binary : boolean, default=False

   If True, all non-zero term counts are set to 1. This does not mean

   outputs will have only 0/1 values, only that the tf term in tf-idf

   is binary. (Set idf and normalization to False to get 0/1 outputs.)

 

   dtype : type, optional

   Type of the matrix returned by fit_transform() or transform().

 

   norm : 'l1', 'l2' or None, optional

   Norm used to normalize term vectors. None for no normalization.

 

   use_idf : boolean, default=True

   Enable inverse-document-frequency reweighting.

 

   smooth_idf : boolean, default=True

   Smooth idf weights by adding one to document frequencies, as if an

   extra document was seen containing every term in the collection

   exactly once. Prevents zero divisions.

 

   sublinear_tf : boolean, default=False

   Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).

 

   Attributes

   ----------

   vocabulary_ : dict

   A mapping of terms to feature indices.

 

   idf_ : array, shape = [n_features], or None

   The learned idf vector (global term weights)

   when ``use_idf`` is set to True, None otherwise.

 

   stop_words_ : set

   Terms that were ignored because they either:

 

   - occurred in too many documents (`max_df`)

   - occurred in too few documents (`min_df`)

   - were cut off by feature selection (`max_features`).

 

   This is only available if no vocabulary was given.

 

   See also

   --------

   CountVectorizer

   Tokenize the documents and count the occurrences of token and

    return

   them as a sparse matrix

 

   TfidfTransformer

   Apply Term Frequency Inverse Document Frequency normalization to a

   sparse matrix of occurrence counts.

 

   Notes

   -----

   The ``stop_words_`` attribute can get large and increase the model size

   when pickling. This attribute is provided only for introspection and can

   be safely removed using delattr or set to None before pickling.

   """

   def __init__(self, input='content', encoding='utf-8',

       decode_error='strict', strip_accents=None, lowercase=True,

       preprocessor=None, tokenizer=None, analyzer='word',

       stop_words=None, token_pattern=r"(?u)\b\w\w+\b",

       ngram_range=(1, 1), max_df=1.0, min_df=1,

       max_features=None, vocabulary=None, binary=False,

       dtype=np.int64, norm='l2', use_idf=True, smooth_idf=True,

       sublinear_tf=False):

       super(TfidfVectorizer, self).__init__(input=input, encoding=encoding,

        decode_error=decode_error, strip_accents=strip_accents,

        lowercase=lowercase, preprocessor=preprocessor, tokenizer=tokenizer,

        analyzer=analyzer, stop_words=stop_words,

        token_pattern=token_pattern, ngram_range=ngram_range,

        max_df=max_df, min_df=min_df, max_features=max_features,

        vocabulary=vocabulary, binary=binary, dtype=dtype)

       self._tfidf = TfidfTransformer(norm=norm, use_idf=use_idf,

           smooth_idf=smooth_idf,

           sublinear_tf=sublinear_tf)

 

   # Broadcast the TF-IDF parameters to the underlying transformer

    instance

   # for easy grid search and repr

   @property

   def norm(self):

       return self._tfidf.norm

 

   @norm.setter

   def norm(self, value):

       self._tfidf.norm = value

 

   @property

   def use_idf(self):

       return self._tfidf.use_idf

 

   @use_idf.setter

   def use_idf(self, value):

       self._tfidf.use_idf = value

 

   @property

   def smooth_idf(self):

       return self._tfidf.smooth_idf

 

   @smooth_idf.setter

   def smooth_idf(self, value):

       self._tfidf.smooth_idf = value

 

   @property

   def sublinear_tf(self):

       return self._tfidf.sublinear_tf

 

   @sublinear_tf.setter

   def sublinear_tf(self, value):

       self._tfidf.sublinear_tf = value

 

   @property

   def idf_(self):

       return self._tfidf.idf_

 

   def fit(self, raw_documents, y=None):

       """Learn vocabulary and idf from training set.

       Parameters

       ----------

       raw_documents : iterable

           an iterable which yields either str, unicode or file objects

       Returns

       -------

       self : TfidfVectorizer

       """

       X = super(TfidfVectorizer, self).fit_transform(raw_documents)

       self._tfidf.fit(X)

       return self

 

   def fit_transform(self, raw_documents, y=None):

       """Learn vocabulary and idf, return term-document matrix.

       This is equivalent to fit followed by transform, but more efficiently

       implemented.

       Parameters

       ----------

       raw_documents : iterable

           an iterable which yields either str, unicode or file objects

       Returns

       -------

       X : sparse matrix, [n_samples, n_features]

           Tf-idf-weighted document-term matrix.

       """

       X = super(TfidfVectorizer, self).fit_transform(raw_documents)

       self._tfidf.fit(X)

       # X is already a transformed view of raw_documents so

       # we set copy to False

       return self._tfidf.transform(X, copy=False)

 

   def transform(self, raw_documents, copy=True):

       """Transform documents to document-term matrix.

       Uses the vocabulary and document frequencies (df) learned by fit (or

       fit_transform).

       Parameters

       ----------

       raw_documents : iterable

           an iterable which yields either str, unicode or file objects

       copy : boolean, default True

           Whether to copy X and operate on the copy or perform in-place

           operations.

       Returns

       -------

       X : sparse matrix, [n_samples, n_features]

           Tf-idf-weighted document-term matrix.

       """

       check_is_fitted(self, '_tfidf', 'The tfidf vector is not fitted')

       X = super(TfidfVectorizer, self).transform(raw_documents)

       return self._tfidf.transform(X, copy=False)


相关文章
|
2月前
|
机器学习/深度学习 算法 数据库
KNN和SVM实现对LFW人像图像数据集的分类应用
KNN和SVM实现对LFW人像图像数据集的分类应用
36 0
|
25天前
|
算法 Python
使用Python实现朴素贝叶斯算法
使用Python实现朴素贝叶斯算法
17 0
|
2月前
|
XML 机器学习/深度学习 算法
目标检测算法训练数据准备——Penn-Fudan数据集预处理实例说明(附代码)
目标检测算法训练数据准备——Penn-Fudan数据集预处理实例说明(附代码)
40 1
|
4月前
|
人工智能 算法 安全
训练数据集污染与模型算法攻击将成为AI新的棘手问题
【1月更文挑战第11天】训练数据集污染与模型算法攻击将成为AI新的棘手问题
72 3
训练数据集污染与模型算法攻击将成为AI新的棘手问题
|
4月前
|
算法
朴素贝叶斯算法应用
朴素贝叶斯算法应用
36 4
|
5月前
|
算法
朴素贝叶斯典型的三种算法
朴素贝叶斯主要有三种算法:贝努利朴素贝叶斯、高斯贝叶斯和多项式贝叶斯三种算法
|
15天前
|
机器学习/深度学习 人工智能 算法
基于DCT和扩频的音频水印嵌入提取算法matlab仿真
本文介绍了结合DCT和扩频技术的音频水印算法,用于在不降低音质的情况下嵌入版权信息。在matlab2022a中实现,算法利用DCT进行频域处理,通过扩频增强水印的隐蔽性和抗攻击性。核心程序展示了水印的嵌入与提取过程,包括DCT变换、水印扩频及反变换步骤。该方法有效且专业,未来研究将侧重于提高实用性和安全性。
|
2天前
|
存储 算法
m基于LDPC编译码的matlab误码率仿真,对比SP,MS,NMS以及OMS四种译码算法
MATLAB 2022a仿真实现了LDPC译码算法比较,包括Sum-Product (SP),Min-Sum (MS),Normalized Min-Sum (NMS)和Offset Min-Sum (OMS)。四种算法在不同通信场景有各自优势:SP最准确但计算复杂度高;MS计算复杂度最低但性能略逊;NMS通过归一化提升低SNR性能;OMS引入偏置优化高SNR表现。适用于资源有限或高性能需求的场景。提供的MATLAB代码用于仿真并绘制不同SNR下的误码率曲线。
17 3
|
5天前
|
算法 数据安全/隐私保护 计算机视觉
基于DCT变换的彩色图像双重水印嵌入和提取算法matlab仿真
**算法摘要:** - 图形展示:展示灰度与彩色图像水印应用,主辅水印嵌入。 - 软件环境:MATLAB 2022a。 - 算法原理:双重水印,转换至YCbCr/YIQ,仅影响亮度;图像分割为M×N块,DCT变换后嵌入水印。 - 流程概览:两步水印嵌入,每步对应不同图示表示。 - 核心代码未提供。
|
5天前
|
机器学习/深度学习 算法 数据可视化
Matlab决策树、模糊C-均值聚类算法分析高校教师职称学历评分可视化
Matlab决策树、模糊C-均值聚类算法分析高校教师职称学历评分可视化
10 0