ML之NB:基于news新闻文本数据集利用纯统计法、kNN、朴素贝叶斯(高斯/多元伯努利/多项式)、线性判别分析LDA、感知器等算法实现文本分类预测daiding(二)

简介: ML之NB:基于news新闻文本数据集利用纯统计法、kNN、朴素贝叶斯(高斯/多元伯努利/多项式)、线性判别分析LDA、感知器等算法实现文本分类预测daiding

核心代码

class GaussianNB Found at: sklearn.naive_bayes

class GaussianNB(_BaseNB):

   """

   Gaussian Naive Bayes (GaussianNB)

 

   Can perform online updates to model parameters via :meth:`partial_fit`.

   For details on algorithm used to update feature means and variance online,

   see Stanford CS tech report STAN-CS-79-773 by Chan, Golub, and LeVeque:

 

  http://i.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdf

 

   Read more in the :ref:`User Guide <gaussian_naive_bayes>`.

 

   Parameters

   ----------

   priors : array-like of shape (n_classes,)

   Prior probabilities of the classes. If specified the priors are not

   adjusted according to the data.

 

   var_smoothing : float, default=1e-9

   Portion of the largest variance of all features that is added to

   variances for calculation stability.

 

   .. versionadded:: 0.20

 

   Attributes

   ----------

   class_count_ : ndarray of shape (n_classes,)

   number of training samples observed in each class.

 

   class_prior_ : ndarray of shape (n_classes,)

   probability of each class.

 

   classes_ : ndarray of shape (n_classes,)

   class labels known to the classifier

 

   epsilon_ : float

   absolute additive value to variances

 

   sigma_ : ndarray of shape (n_classes, n_features)

   variance of each feature per class

 

   theta_ : ndarray of shape (n_classes, n_features)

   mean of each feature per class

 

   Examples

   --------

   >>> import numpy as np

   >>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])

   >>> Y = np.array([1, 1, 1, 2, 2, 2])

   >>> from sklearn.naive_bayes import GaussianNB

   >>> clf = GaussianNB()

   >>> clf.fit(X, Y)

   GaussianNB()

   >>> print(clf.predict([[-0.8, -1]]))

   [1]

   >>> clf_pf = GaussianNB()

   >>> clf_pf.partial_fit(X, Y, np.unique(Y))

   GaussianNB()

   >>> print(clf_pf.predict([[-0.8, -1]]))

   [1]

   """

   @_deprecate_positional_args

   def __init__(self, *, priors=None, var_smoothing=1e-9):

       self.priors = priors

       self.var_smoothing = var_smoothing

 

   def fit(self, X, y, sample_weight=None):

       """Fit Gaussian Naive Bayes according to X, y

       Parameters

       ----------

       X : array-like of shape (n_samples, n_features)

           Training vectors, where n_samples is the number of samples

           and n_features is the number of features.

       y : array-like of shape (n_samples,)

           Target values.

       sample_weight : array-like of shape (n_samples,), default=None

           Weights applied to individual samples (1. for unweighted).

           .. versionadded:: 0.17

              Gaussian Naive Bayes supports fitting with *sample_weight*.

       Returns

       -------

       self : object

       """

       X, y = self._validate_data(X, y)

       y = column_or_1d(y, warn=True)

       return self._partial_fit(X, y, np.unique(y), _refit=True,

           sample_weight=sample_weight)

 

   def _check_X(self, X):

       return check_array(X)

 

   @staticmethod

   def _update_mean_variance(n_past, mu, var, X, sample_weight=None):

       """Compute online update of Gaussian mean and variance.

       Given starting sample count, mean, and variance, a new set of

       points X, and optionally sample weights, return the updated mean and

       variance. (NB - each dimension (column) in X is treated as independent

       -- you get variance, not covariance).

       Can take scalar mean and variance, or vector mean and variance to

       simultaneously update a number of independent Gaussians.

       See Stanford CS tech report STAN-CS-79-773 by Chan, Golub, and

        LeVeque:

      http://i.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdf

       Parameters

       ----------

       n_past : int

           Number of samples represented in old mean and variance. If sample

           weights were given, this should contain the sum of sample

           weights represented in old mean and variance.

       mu : array-like of shape (number of Gaussians,)

           Means for Gaussians in original set.

       var : array-like of shape (number of Gaussians,)

           Variances for Gaussians in original set.

       sample_weight : array-like of shape (n_samples,), default=None

           Weights applied to individual samples (1. for unweighted).

       Returns

       -------

       total_mu : array-like of shape (number of Gaussians,)

           Updated mean for each Gaussian over the combined set.

       total_var : array-like of shape (number of Gaussians,)

           Updated variance for each Gaussian over the combined set.

       """

       if X.shape[0] == 0:

           return mu, var

       # Compute (potentially weighted) mean and variance of new datapoints

       if sample_weight is not None:

           n_new = float(sample_weight.sum())

           new_mu = np.average(X, axis=0, weights=sample_weight)

           new_var = np.average((X - new_mu) ** 2, axis=0,

            weights=sample_weight)

       else:

           n_new = X.shape[0]

           new_var = np.var(X, axis=0)

           new_mu = np.mean(X, axis=0)

       if n_past == 0:

           return new_mu, new_var

       n_total = float(n_past + n_new)

       # Combine mean of old and new data, taking into consideration

       # (weighted) number of observations

       total_mu = (n_new * new_mu + n_past * mu) / n_total

       # Combine variance of old and new data, taking into consideration

       # (weighted) number of observations. This is achieved by combining

       # the sum-of-squared-differences (ssd)

       old_ssd = n_past * var

       new_ssd = n_new * new_var

       total_ssd = old_ssd + new_ssd + (n_new * n_past / n_total) * (mu -

        new_mu) ** 2

       total_var = total_ssd / n_total

       return total_mu, total_var

 

   def partial_fit(self, X, y, classes=None, sample_weight=None):

       """Incremental fit on a batch of samples.

       This method is expected to be called several times consecutively

       on different chunks of a dataset so as to implement out-of-core

       or online learning.

       This is especially useful when the whole dataset is too big to fit in

       memory at once.

       This method has some performance and numerical stability overhead,

       hence it is better to call partial_fit on chunks of data that are

       as large as possible (as long as fitting in the memory budget) to

       hide the overhead.

       Parameters

       ----------

       X : array-like of shape (n_samples, n_features)

           Training vectors, where n_samples is the number of samples and

           n_features is the number of features.

       y : array-like of shape (n_samples,)

           Target values.

       classes : array-like of shape (n_classes,), default=None

           List of all the classes that can possibly appear in the y vector.

           Must be provided at the first call to partial_fit, can be omitted

           in subsequent calls.

       sample_weight : array-like of shape (n_samples,), default=None

           Weights applied to individual samples (1. for unweighted).

           .. versionadded:: 0.17

       Returns

       -------

       self : object

       """

       return self._partial_fit(X, y, classes, _refit=False,

           sample_weight=sample_weight)

 

   def _partial_fit(self, X, y, classes=None, _refit=False,

       sample_weight=None):

       """Actual implementation of Gaussian NB fitting.

       Parameters

       ----------

       X : array-like of shape (n_samples, n_features)

           Training vectors, where n_samples is the number of samples and

           n_features is the number of features.

       y : array-like of shape (n_samples,)

           Target values.

       classes : array-like of shape (n_classes,), default=None

           List of all the classes that can possibly appear in the y vector.

           Must be provided at the first call to partial_fit, can be omitted

           in subsequent calls.

       _refit : bool, default=False

           If true, act as though this were the first time we called

           _partial_fit (ie, throw away any past fitting and start over).

       sample_weight : array-like of shape (n_samples,), default=None

           Weights applied to individual samples (1. for unweighted).

       Returns

       -------

       self : object

       """

       X, y = check_X_y(X, y)

       if sample_weight is not None:

           sample_weight = _check_sample_weight(sample_weight, X)

       # If the ratio of data variance between dimensions is too small, it

       # will cause numerical errors. To address this, we artificially

       # boost the variance by epsilon, a small fraction of the standard

       # deviation of the largest dimension.

       self.epsilon_ = self.var_smoothing * np.var(X, axis=0).max()

       if _refit:

           self.classes_ = None

       if _check_partial_fit_first_call(self, classes):

           # This is the first call to partial_fit:

           # initialize various cumulative counters

           n_features = X.shape[1]

           n_classes = len(self.classes_)

           self.theta_ = np.zeros((n_classes, n_features))

           self.sigma_ = np.zeros((n_classes, n_features))

           self.class_count_ = np.zeros(n_classes, dtype=np.float64)

           # Initialise the class prior

           # Take into account the priors

           if self.priors is not None:

               priors = np.asarray(self.priors)

               # Check that the provide prior match the number of classes

               if len(priors) != n_classes:

                   raise ValueError('Number of priors must match number of'

                       ' classes.')

               # Check that the sum is 1

               if not np.isclose(priors.sum(), 1.0):

                   raise ValueError('The sum of the priors should be 1.') # Check that

                    the prior are non-negative

               if (priors < 0).any():

                   raise ValueError('Priors must be non-negative.')

               self.class_prior_ = priors

           else:

               self.class_prior_ = np.zeros(len(self.classes_),

                   dtype=np.float64) # Initialize the priors to zeros for each class

       else:

           if X.shape[1] != self.theta_.shape[1]:

               msg = "Number of features %d does not match previous data %d."

               raise ValueError(msg % (X.shape[1], self.theta_.shape[1]))

           # Put epsilon back in each time

           ::]self.epsilon_

       self.sigma_[ -=

       classes = self.classes_

       unique_y = np.unique(y)

       unique_y_in_classes = np.in1d(unique_y, classes)

       if not np.all(unique_y_in_classes):

           raise ValueError("The target label(s) %s in y do not exist in the "

               "initial classes %s" %

               (unique_y[~unique_y_in_classes], classes))

       for y_i in unique_y:

           i = classes.searchsorted(y_i)

           X_i = X[y == y_i:]

           if sample_weight is not None:

               sw_i = sample_weight[y == y_i]

               N_i = sw_i.sum()

           else:

               sw_i = None

               N_i = X_i.shape[0]

           new_theta, new_sigma = self._update_mean_variance(

               self.class_count_[i], self.theta_[i:], self.sigma_[i:],

               X_i, sw_i)

           self.theta_[i:] = new_theta

           self.sigma_[i:] = new_sigma

           self.class_count_[i] += N_i

     

       self.sigma_[::] += self.epsilon_

       # Update if only no priors is provided

       if self.priors is None:

           # Empirical prior, with sample_weight taken into account

           self.class_prior_ = self.class_count_ / self.class_count_.sum()

       return self

 

   def _joint_log_likelihood(self, X):

       joint_log_likelihood = []

       for i in range(np.size(self.classes_)):

           jointi = np.log(self.class_prior_[i])

           n_ij = -0.5 * np.sum(np.log(2. * np.pi * self.sigma_[i:]))

           n_ij -= 0.5 * np.sum(((X - self.theta_[i:]) ** 2) /

               (self.sigma_[i:]), 1)

           joint_log_likelihood.append(jointi + n_ij)

     

       joint_log_likelihood = np.array(joint_log_likelihood).T

       return joint_log_likelihood

class MultinomialNB Found at: sklearn.naive_bayes

class MultinomialNB(_BaseDiscreteNB):

   """

   Naive Bayes classifier for multinomial models

 

   The multinomial Naive Bayes classifier is suitable for classification with

   discrete features (e.g., word counts for text classification). The

   multinomial distribution normally requires integer feature counts. However,

   in practice, fractional counts such as tf-idf may also work.

 

   Read more in the :ref:`User Guide <multinomial_naive_bayes>`.

 

   Parameters

   ----------

   alpha : float, default=1.0

   Additive (Laplace/Lidstone) smoothing parameter

   (0 for no smoothing).

 

   fit_prior : bool, default=True

   Whether to learn class prior probabilities or not.

   If false, a uniform prior will be used.

 

   class_prior : array-like of shape (n_classes,), default=None

   Prior probabilities of the classes. If specified the priors are not

   adjusted according to the data.

 

   Attributes

   ----------

   class_count_ : ndarray of shape (n_classes,)

   Number of samples encountered for each class during fitting. This

   value is weighted by the sample weight when provided.

 

   class_log_prior_ : ndarray of shape (n_classes, )

   Smoothed empirical log probability for each class.

 

   classes_ : ndarray of shape (n_classes,)

   Class labels known to the classifier

 

   coef_ : ndarray of shape (n_classes, n_features)

   Mirrors ``feature_log_prob_`` for interpreting MultinomialNB

   as a linear model.

 

   feature_count_ : ndarray of shape (n_classes, n_features)

   Number of samples encountered for each (class, feature)

   during fitting. This value is weighted by the sample weight when

   provided.

 

   feature_log_prob_ : ndarray of shape (n_classes, n_features)

   Empirical log probability of features

   given a class, ``P(x_i|y)``.

 

   intercept_ : ndarray of shape (n_classes, )

   Mirrors ``class_log_prior_`` for interpreting MultinomialNB

   as a linear model.

 

   n_features_ : int

   Number of features of each sample.

 

   Examples

   --------

   >>> import numpy as np

   >>> rng = np.random.RandomState(1)

   >>> X = rng.randint(5, size=(6, 100))

   >>> y = np.array([1, 2, 3, 4, 5, 6])

   >>> from sklearn.naive_bayes import MultinomialNB

   >>> clf = MultinomialNB()

   >>> clf.fit(X, y)

   MultinomialNB()

   >>> print(clf.predict(X[2:3]))

   [3]

 

   Notes

   -----

   For the rationale behind the names `coef_` and `intercept_`, i.e.

   naive Bayes as a linear classifier, see J. Rennie et al. (2003),

   Tackling the poor assumptions of naive Bayes text classifiers, ICML.

 

   References

   ----------

   C.D. Manning, P. Raghavan and H. Schuetze (2008). Introduction to

   Information Retrieval. Cambridge University Press, pp. 234-265.

  https://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-

    classification-1.html

   """

   @_deprecate_positional_args

   def __init__(self, *, alpha=1.0, fit_prior=True, class_prior=None):

       self.alpha = alpha

       self.fit_prior = fit_prior

       self.class_prior = class_prior

 

   def _more_tags(self):

       return {'requires_positive_X':True}

 

   def _count(self, X, Y):

       """Count and smooth feature occurrences."""

       check_non_negative(X, "MultinomialNB (input X)")

       self.feature_count_ += safe_sparse_dot(Y.T, X)

       self.class_count_ += Y.sum(axis=0)

 

   def _update_feature_log_prob(self, alpha):

       """Apply smoothing to raw counts and recompute log probabilities"""

       smoothed_fc = self.feature_count_ + alpha

       smoothed_cc = smoothed_fc.sum(axis=1)

       self.feature_log_prob_ = np.log(smoothed_fc) - np.log(smoothed_cc.

        reshape(-1, 1))

 

   def _joint_log_likelihood(self, X):

       """Calculate the posterior log probability of the samples X"""

       return safe_sparse_dot(X, self.feature_log_prob_.T) + self.class_log_prior_

class BernoulliNB Found at: sklearn.naive_bayes

class BernoulliNB(_BaseDiscreteNB):

   """Naive Bayes classifier for multivariate Bernoulli models.

 

   Like MultinomialNB, this classifier is suitable for discrete data. The

   difference is that while MultinomialNB works with occurrence counts,

   BernoulliNB is designed for binary/boolean features.

 

   Read more in the :ref:`User Guide <bernoulli_naive_bayes>`.

 

   Parameters

   ----------

   alpha : float, default=1.0

   Additive (Laplace/Lidstone) smoothing parameter

   (0 for no smoothing).

 

   binarize : float or None, default=0.0

   Threshold for binarizing (mapping to booleans) of sample features.

   If None, input is presumed to already consist of binary vectors.

 

   fit_prior : bool, default=True

   Whether to learn class prior probabilities or not.

   If false, a uniform prior will be used.

 

   class_prior : array-like of shape (n_classes,), default=None

   Prior probabilities of the classes. If specified the priors are not

   adjusted according to the data.

 

   Attributes

   ----------

   class_count_ : ndarray of shape (n_classes)

   Number of samples encountered for each class during fitting. This

   value is weighted by the sample weight when provided.

 

   class_log_prior_ : ndarray of shape (n_classes)

   Log probability of each class (smoothed).

 

   classes_ : ndarray of shape (n_classes,)

   Class labels known to the classifier

 

   feature_count_ : ndarray of shape (n_classes, n_features)

   Number of samples encountered for each (class, feature)

   during fitting. This value is weighted by the sample weight when

   provided.

 

   feature_log_prob_ : ndarray of shape (n_classes, n_features)

   Empirical log probability of features given a class, P(x_i|y).

 

   n_features_ : int

   Number of features of each sample.

 

   Examples

   --------

   >>> import numpy as np

   >>> rng = np.random.RandomState(1)

   >>> X = rng.randint(5, size=(6, 100))

   >>> Y = np.array([1, 2, 3, 4, 4, 5])

   >>> from sklearn.naive_bayes import BernoulliNB

   >>> clf = BernoulliNB()

   >>> clf.fit(X, Y)

   BernoulliNB()

   >>> print(clf.predict(X[2:3]))

   [3]

 

   References

   ----------

   C.D. Manning, P. Raghavan and H. Schuetze (2008). Introduction to

   Information Retrieval. Cambridge University Press, pp. 234-265.

  https://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-

    model-1.html

 

   A. McCallum and K. Nigam (1998). A comparison of event models

    for naive

   Bayes text classification. Proc. AAAI/ICML-98 Workshop on Learning

    for

   Text Categorization, pp. 41-48.

 

   V. Metsis, I. Androutsopoulos and G. Paliouras (2006). Spam filtering

    with

   naive Bayes -- Which naive Bayes? 3rd Conf. on Email and Anti-Spam

    (CEAS).

   """

   @_deprecate_positional_args

   def __init__(self, *, alpha=1.0, binarize=.0, fit_prior=True,

       class_prior=None):

       self.alpha = alpha

       self.binarize = binarize

       self.fit_prior = fit_prior

       self.class_prior = class_prior

 

   def _check_X(self, X):

       X = super()._check_X(X)

       if self.binarize is not None:

           X = binarize(X, threshold=self.binarize)

       return X

 

   def _check_X_y(self, X, y):

       X, y = super()._check_X_y(X, y)

       if self.binarize is not None:

           X = binarize(X, threshold=self.binarize)

       return X, y

 

   def _count(self, X, Y):

       """Count and smooth feature occurrences."""

       self.feature_count_ += safe_sparse_dot(Y.T, X)

       self.class_count_ += Y.sum(axis=0)

 

   def _update_feature_log_prob(self, alpha):

       """Apply smoothing to raw counts and recompute log

        probabilities"""

       smoothed_fc = self.feature_count_ + alpha

       smoothed_cc = self.class_count_ + alpha * 2

       self.feature_log_prob_ = np.log(smoothed_fc) - np.log

        (smoothed_cc.reshape(-1, 1))

 

   def _joint_log_likelihood(self, X):

       """Calculate the posterior log probability of the samples X"""

       n_classes, n_features = self.feature_log_prob_.shape

       n_samples, n_features_X = X.shape

       if n_features_X != n_features:

           raise ValueError(

               "Expected input with %d features, got %d instead" %

                (n_features, n_features_X))

       neg_prob = np.log(1 - np.exp(self.feature_log_prob_))

       # Compute  neg_prob · (1 - X).T  as  ∑neg_prob - X · neg_prob

       jll = safe_sparse_dot(X, (self.feature_log_prob_ - neg_prob).T)

       jll += self.class_log_prior_ + neg_prob.sum(axis=1)

       return jll


相关实践学习
【AI破次元壁合照】少年白马醉春风,函数计算一键部署AI绘画平台
本次实验基于阿里云函数计算产品能力开发AI绘画平台,可让您实现“破次元壁”与角色合照,为角色换背景效果,用AI绘图技术绘出属于自己的少年江湖。
从 0 入门函数计算
在函数计算的架构中,开发者只需要编写业务代码,并监控业务运行情况就可以了。这将开发者从繁重的运维工作中解放出来,将精力投入到更有意义的开发任务上。
相关文章
|
3月前
|
开发框架 算法 .NET
基于ADMM无穷范数检测算法的MIMO通信系统信号检测MATLAB仿真,对比ML,MMSE,ZF以及LAMA
简介:本文介绍基于ADMM的MIMO信号检测算法,结合无穷范数优化与交替方向乘子法,降低计算复杂度并提升检测性能。涵盖MATLAB 2024b实现效果图、核心代码及详细注释,并对比ML、MMSE、ZF、OCD_MMSE与LAMA等算法。重点分析LAMA基于消息传递的低复杂度优势,适用于大规模MIMO系统,为通信系统检测提供理论支持与实践方案。(238字)
|
数据采集 算法 数据可视化
基于Python的k-means聚类分析算法的实现与应用,可以用在电商评论、招聘信息等各个领域的文本聚类及指标聚类,效果很好
本文介绍了基于Python实现的k-means聚类分析算法,并通过微博考研话题的数据清洗、聚类数量评估、聚类分析实现与结果可视化等步骤,展示了该算法在文本聚类领域的应用效果。
678 1
|
机器学习/深度学习 人工智能 算法
【新闻文本分类识别系统】Python+卷积神经网络算法+人工智能+深度学习+计算机毕设项目+Django网页界面平台
文本分类识别系统。本系统使用Python作为主要开发语言,首先收集了10种中文文本数据集("体育类", "财经类", "房产类", "家居类", "教育类", "科技类", "时尚类", "时政类", "游戏类", "娱乐类"),然后基于TensorFlow搭建CNN卷积神经网络算法模型。通过对数据集进行多轮迭代训练,最后得到一个识别精度较高的模型,并保存为本地的h5格式。然后使用Django开发Web网页端操作界面,实现用户上传一段文本识别其所属的类别。
486 1
【新闻文本分类识别系统】Python+卷积神经网络算法+人工智能+深度学习+计算机毕设项目+Django网页界面平台
|
机器学习/深度学习 存储 人工智能
文本情感识别分析系统Python+SVM分类算法+机器学习人工智能+计算机毕业设计
使用Python作为开发语言,基于文本数据集(一个积极的xls文本格式和一个消极的xls文本格式文件),使用Word2vec对文本进行处理。通过支持向量机SVM算法训练情绪分类模型。实现对文本消极情感和文本积极情感的识别。并基于Django框架开发网页平台实现对用户的可视化操作和数据存储。
430 0
文本情感识别分析系统Python+SVM分类算法+机器学习人工智能+计算机毕业设计
|
机器学习/深度学习 数据采集 算法
Python基于KMeans算法进行文本聚类项目实战
Python基于KMeans算法进行文本聚类项目实战
|
数据采集 自然语言处理 数据可视化
基于Python的社交媒体评论数据挖掘,使用LDA主题分析、文本聚类算法、情感分析实现
本文介绍了基于Python的社交媒体评论数据挖掘方法,使用LDA主题分析、文本聚类算法和情感分析技术,对数据进行深入分析和可视化,以揭示文本数据中的潜在主题、模式和情感倾向。
2063 0
|
算法 数据可视化 搜索推荐
基于python的k-means聚类分析算法,对文本、数据等进行聚类,有轮廓系数和手肘法检验
本文详细介绍了基于Python实现的k-means聚类分析算法,包括数据准备、预处理、标准化、聚类数目确定、聚类分析、降维可视化以及结果输出的完整流程,并应用该算法对文本数据进行聚类分析,展示了轮廓系数法和手肘法检验确定最佳聚类数目的方法。
717 0
|
算法 JavaScript
「AIGC算法」将word文档转换为纯文本
使用Node.js模块`mammoth`和`html-to-text`,该代码示例演示了如何将Word文档(.docx格式)转换为纯文本以适应AIGC的文本识别。流程包括将Word文档转化为HTML,然后进一步转换为纯文本,进行格式调整,并输出到控制台。转换过程中考虑了错误处理。提供的代码片段展示了具体的实现细节,包括关键库的导入和转换函数的调用。
366 0
|
3月前
|
机器学习/深度学习 算法 机器人
【水下图像增强融合算法】基于融合的水下图像与视频增强研究(Matlab代码实现)
【水下图像增强融合算法】基于融合的水下图像与视频增强研究(Matlab代码实现)
391 0
|
3月前
|
数据采集 分布式计算 并行计算
mRMR算法实现特征选择-MATLAB
mRMR算法实现特征选择-MATLAB
263 2

热门文章

最新文章