Deep Dive into Bayesian Classification Algorithm

简介: A lot of our online experience depends significantly on machine learning algorithms. One such algorithm is Bayesian Classification Algorithm.

1. Introduction:

Bayesian classification is the general term for a type of classification algorithm based on Bayes theorem. This article will start by introducing the classification problem and will then give its definition. Then we will introduce Bayesian theorem as it pertains to the Bayes classification algorithm. Finally, we will discuss the simplest type of Bayesian classification - naive Bayesian classification - and explain it along with application cases.

2. Bayesian Classification

2.1 Summary - The Issue of Classification

Most of us are already familiar with the concept of classification as we, more or less, use them every day. For example, when you see a stranger, your brain subconsciously determines that this person is either a man or a woman. In fact, this is a sort of classification operation.

From a mathematical standpoint, we can define classification problems as follows:

Already Known Sets: {C=y1,y2,...,yn}and I={x1,x2,...,xm,...}, determine the mapping rules y=f(x), 0 is arbitrarily defined as any xi∈I, and only one item yi∈C is yi=f(xi)true.

Where C is called a collection of categories, each element is a category, and I is the name by which the item set is called with each element an item to be sorted, and f is called a classifier. The task of the classification algorithm is to construct the classifier f.

We will emphasize here that classification problems often use empirical methods to construct mapping rules. That is to say; general classification problems frequently lack enough information to construct a 100% correct mapping rule. However, through the experience of data learning, it is possible to achieve a certain probability of getting the correct classification. Therefore, the trained classifier does not necessarily need to accurately map each item to its classification. The quality of the classifier is related to the classification construction method, the characteristics of the classifiable data, and the number of samples provided.

For example, a doctor's diagnosis of a patient is a typical sort of classification process. No doctor can directly see the patient's condition, he or she can only observe the patient's visible symptoms and use various test data to infer the likely conditions. Therefore, we can see that the doctor is a classifier and that the doctor's diagnosis accuracy rate is closely related to:

•The method of his/her education (structured),
•Whether the patient's symptoms were prominent (the characteristics of the data being categorized), and
•How much experience the doctor has had (the number of training samples).

2.2 Bayesian Classification Based on Bayes' Theorem

Bayes' Theorem solves many problems that we often encounter in real life. We can generalize most of these problems as follows:

"Given that, an event has occurred (by assumption, presumption, assertion or evidence) what is the probability that a second event will occur?"

We will express the already known condition as P(A|B) and the event which is likely to potentially occur as P(B|A). We will begin by explaining what conditional probability is:

P(A|B) indicates that event B has already occurred. The probability of event A is called the conditional probability of event A in event B. The basic solution formula is:

1

Bayes' Theorem is useful because we often encounter this sort of thing in our daily lives: we can easily draw a direct conclusion P(A|B), but P(B|A) is harder to infer directly. However, we're more concerned about P(B|A). Bayes Theorem is the path that lets us travel from P(A|B) to P(B|A).
The below is Bayes' Theorem without proofs:

2

2.3 Naive Bayesian classification: Principles and Processes

Naive Bayes (classifier) is a type of generative model that models each possible category based on training samples. We call it as "Naive Bayes" because of the assumption that each attribute has conditional independence. This assumes that each attribute has an independent effect on the eventual classification result. The following formula is available to us:

3

The next thing to take particular note of is whether or not a probability value of 0 will affect the subsequent estimations. Therefore, we set a very small value for the probability of an attribute that will not appear. This value is not 0. We call it as Laplacian correction.

4

Laplacian correction assumes an even distribution of attribute values and categories and introduces additional prior knowledge into the learning process.

2.3.1 Three Stages of the Naive Bayesian Classification

Stage 1: Preparatory Work Stage
In this stage, we do the necessary preparation for Naive Bayesian classification. The main task is to determine the characteristics of the attributes according to the attributes' specific conditions and to perform the appropriate division of each feature attribute. Then, we select a portion of the classification for sampling and use it to form the training sample set. The input for this stage is all the classifiable data, and the output is the characteristic attributes and training samples. This stage is the only stage in the whole of naive Bayesian classification where the quality of the work performed will have a heavy influence on the whole process. The characteristics of the attributes, classification of the characteristic attributes, and the quality of the training samples decide the quality of the classifiers.

Stage 2: Classifier Training Stage
In this stage, we generate a classifier. The main task is to calculate the frequency of appearance of each category in the training samples and the conditional probability estimation of each category for each feature attribute and to record the results. The inputs are feature attributes and the training sample; the outputs are classifiers. This stage is a mechanical one; it is based on the formula discussed above, and we can automatically calculate it completely by a program.

Stage 3: Application Stage
The task in this stage is to classify the classification items using the classifier. The inputs are the classifier and the classifiable items, and the output is the mapping between the classifiable items and the categories. This stage is also a mechanical and we can complete it through the program.

2.4 Semi-naive Bayesian Classification

In Naive Bayesian Classification, we assumed the independence of the various attributes. We did this for convenience in computation and to prevent an excessively large number of computations caused by the dependence of excessive attributes. Therefore, we call it "naïve." Although the overall effect achieved by native Bayesian classification is not bad, this is possible because the attributes are related to each other. The properties of one attribute depend on the properties of another. Therefore, these are semi-naive Bayesian classifiers.

To limit the size of calculations, we must assume that each attribute relies only on one other attribute. In this way, a more accurate description of the real situation is possible.

The formula becomes:

5

pai Attributes for xi the dependent attribute become the xi parent attribute.

We can use following methods to determine the parent attribute:

SOPDE Method: This approach assumes that all attributes depend on a common parent property.
TAN Method: A max weighted spanning tree determines every additional attribute that each attribute depends on
•We first evaluate the mutual information between each attribute as the weight between them.
•The complete graph of the components. Weight is the mutual information just obtained. Then we use the max weighted spanning tree algorithm to get this graph's maximum spanning tree.
•Find a root variable, then turn the diagram into a forward graph.
•Add classification y to the directed edge of each attribute.

6

Image 1: Three methods' property dependencies

3. Application Examples of a Bayesian Algorithm

Here are some practical examples to illustrate the universality of the Bayesian approach. All of the below examples focus on machine learning.

3.1 Chinese Word Segmentation

Wu Jun, a researcher at Google, describes in an article in "The Beauty of Mathematics" series how we can do Chinese word segmentation. Here, we will only introduce the basics behind the core of the idea.

Because the Chinese language does not have spaces, there can be considerable difficulty in determining where a word ends and another word begins. For example, in the given sentence/word string - 南京市长江大桥 (Nanjing City Yangtze River Bridge) - there is no clear differentiation. What is the most reliable way to perform segmentation on this word string so that we can parse it correctly? We have the following choices available to us:

•Nanjing City / Yangtze River Bridge
•Nanjing / Mayor / River Bridge

Which one of these choices is more correct?

We use the Bayesian formula to formalize the problem, making X a string (sentence) and Y a word string (a particular hypothesis for correct word segmentation). We just need to find the Y that maximizes P(Y|X):

7

To describe this in natural language, we will multiply the possibility of this word segmentation (word string) with the possibility of us generating our sentence by use of this word string. Going to the next step, we can easily see: P(X|Y) approximates equal to 1 because any imaginary way of generating a word under our sentence is always generated accurately (just throw away the demarcation sign between the participles). Therefore, we have achieved the maximization of P(Y) which is to find a word string (sentence) with maximum probability of being correct. How is the word string computed: What is the possibility of W1, W2, W3, W4...?

According to the expansion of the formula of joint probability, we know that: P(W1, W2, W3, W4 ..) = P(W1) P(W2|W1) P(W3|W2, W1) P(W4|W1, W2, W3) .. Therefore we can find the whole joint probability by the product of a series of conditional probabilities (right). Unfortunately, as the number of conditions increases to n-1 conditions P(Wn|Wn-1, Wn-2, .., W1), the problem of sparse data will become progressively more serious. Datasets which are too large also cannot count on getting a reliable P(Wn|Wn-1, Wn-2, .., W1).

To alleviate this problem, many computer scientists always used the "naive" hypothesis. We assume that the probability of a word appearing in a sentence depends only on the preceding words, which is a finite number k. (k usually does not exceed 3, if it relies only on one of the preceding words. This is the 2-gram language model. We can use the same idea for 3-gram and 4-gram models).

This is the so-called "limited horizon" hypothesis. Although it is likely that you will initially find this assumption somewhat idealistic, the results are often very strong. The naive Bayesian method used in the following assumptions is completely consistent with these results. We will explain further why such an idealized hypothesis can lead to powerful results.

So far as we know, with this assumption, we can rewrite the final product as follows: P(W1) P(W2|W1) P(W3|W2) * P(W4|W3)...(Suppose each word relies only on one word in front of it). Statistically, P(W2|W1) will no longer be plagued by the problems of sparse data. Regarding our example mentioned above 南京市长江大桥. If we are reading from right to left, our result is not Nanjing City / Yangtze River Bridge but is, instead, Nanjing / Mayor / River Bridge. However, if we use Bayes to divide the words (assuming the use of the 3-gram model), because Nanjing Mayor and River Bridge have a corpus frequency of 0, the algorithm will judge whole sentence's probability to be 0. As a result, the correct word string Nanjing City / Yangtze River Bridge wins out.

3.2 Bayes Spam Filters

For a given email, determine whether or not the system should send it to the spam box. According to precedent, we will still use the variable D to express this given email, note that D is made up of N words. We will use h+ to represent spam and h- to represent normal mail which we want. We can describe the problem formally as the following request:

P(h+|D) = P(h+) * P(D|h+) / P(D)
P(h-|D) = P(h-) * P(D|h-) / P(D)

The two possibilities of P(h+) and P(h-) are easy enough to find, and as a result, we can calculate the proportion of spam to normal messages in an inbox. However, P(D|h+) is not easy to find, because D contains N words d1, d2, d3, and so on. Therefore, P(D|h+) = P(d1, d2, ..., dn|h+) Once again, we have encountered the problem of data sparsity. Why do we say that?

P(d1,d2,..,dn|h+) means that the probability of a given message already in spam being the same as an incoming message is very small. This is because each message is different and there are countless emails constantly coming in. This is data sparsity because it is certain that the training database you use, no matter how many emails it contains, will not be large enough to have incoming mail identical to other incoming mail. The result? How do we compute P (d1, d2, .., dn|h+)?

We extend P (d1, d2, .., dn|h+) to: P(d1|h+) P(D2|D1, h) P(D3|D2,D1, h) ., We certainly are not no familiar to this formula, and here we will use a more radical assumption. We will assume that di is completely unrelated to di-1 and that the equation is reduced to P(d1|h+) P(d2|h+) P(d3|h+) .. This is the so-called conditional independent hypothesis, and it is also the naive Bayesian approach's simplicity. However, when calculating P(d1|h+) P(d2|h+) P(d3|h+) * .. the problem becomes simple enough, as long as the statistical word di frequently appears in previously identified spam.

From studying the above, we discover that Bayesian inference basically cannot give a positive result because of its inability to raise all possibilities. However, after a lot of testing, if the test results are correct, we will also have confidence in our algorithm (even if we have not confirmed the accuracy of the algorithm). In fact, with the new test results, the number of errors produced by the algorithm gradually decreases, and the algorithm becomes more trustworthy.

In our daily lives, we often intentionally or unintentionally make use of the Bayes theorem. From the point of view of the algorithm, if you want to understand the principle, you also need to learn more in-depth mathematical statistics and practices.

4. Conclusion

We have seen how Bayesian Classification Algorithm helps us in solving some real-world problems. The applications built on this algorithm can become progressively better with time. Hopefully, you have liked this article. To know more about Alibaba Cloud and its offerings, please visit www.alibabacloud.com

目录
相关文章
|
3月前
|
存储 算法 计算机视觉
【博士每天一篇文献-模型】Meta-Learning Based Tasks Similarity Representation for Cross Domain Lifelong Learning
本文提出了一种基于元学习的跨域终身学习框架,通过跨域三元组网络(CDTN)学习任务间的相似性表示,并结合自注意模块与软注意网络(SAN)来增强特征提取和任务权重分配,以提高学习效率并减少对先前知识的遗忘。
49 1
【博士每天一篇文献-模型】Meta-Learning Based Tasks Similarity Representation for Cross Domain Lifelong Learning
|
5月前
|
机器学习/深度学习 算法 关系型数据库
Hierarchical Attention-Based Age Estimation and Bias Analysis
【6月更文挑战第8天】Hierarchical Attention-Based Age Estimation论文提出了一种深度学习方法,利用层次注意力和图像增强来估计面部年龄。通过Transformer和CNN,它学习局部特征并进行序数分类和回归,提高在CACD和MORPH II数据集上的准确性。论文还包括对种族和性别偏倚的分析。方法包括自我注意的图像嵌入和层次概率年龄回归,优化多损失函数。实验表明,该方法在RS和SE协议下表现优越,且在消融研究中验证了增强聚合和编码器设计的有效性。
37 2
|
6月前
|
算法 计算机视觉
2017cvpr论文解读——Nasal Patches and Curves for Expression-Robust 3D Face Recognition
2017cvpr论文解读——Nasal Patches and Curves for Expression-Robust 3D Face Recognition
41 1
|
机器学习/深度学习 数据挖掘
PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation
PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation
57 1
PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation
|
编解码 计算机视觉
NeRF系列(3): Semantic-aware Occlusion Filtering Neural Radiance Fields in the Wild 论文解读
NeRF系列(3): Semantic-aware Occlusion Filtering Neural Radiance Fields in the Wild 论文解读
222 2
|
机器学习/深度学习 资源调度 算法
【RLchina第四讲】Model-Based Reinforcement Learning(上)
【RLchina第四讲】Model-Based Reinforcement Learning(上)
675 0
|
机器学习/深度学习 算法
【RLchina第四讲】Model-Based Reinforcement Learning(下)
【RLchina第四讲】Model-Based Reinforcement Learning(下)
179 0
|
机器学习/深度学习 开发框架 数据建模
HiCLRE: A Hierarchical Contrastive Learning Framework for Distantly Supervised Relation Extraction
远程监督假设任何包含相同实体对的句子都反映了相同的关系。先前的远程监督关系抽取(DSRE)任务通常独立地关注sentence-level或bag-level去噪技术
176 0
|
机器学习/深度学习 人工智能 算法
【5分钟 Paper】Reinforcement Learning with Deep Energy-Based Policies
【5分钟 Paper】Reinforcement Learning with Deep Energy-Based Policies
123 0
|
人工智能 自然语言处理 算法
TripleRank An unsupervised keyphrase extraction algorithm
Triple是2021年吉林大学提出的一种无监督关键词抽取算法,在四个数据集上实现了SOTA。其实也就是模型集成。(EmbedRank、TopicRank、PositionRank)
113 0