ID3决策树与C4.5决策树分类算法简述-阿里云开发者社区

ID3决策树与C4.5决策树分类算法简述

2017-04-05 1645

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： Let’s begin with ID3 decision tree: The ID3 algorithm tries to get the most information gain when grow the decision trees. The information gain is defined as Gain(A)=I(s1,s2,…,sm)−E(A)\

Let’s begin with ID3 decision tree:
The ID3 algorithm tries to get the most information gain when grow the decision trees. The information gain is defined as

Gain (A) = I (s 1, s 2, \dots, s m) - E (A)

where

I is the information entropy of a given sample setting,

I (s 1, s 2, \dots, s m) = - \sum i = 1 m p i log 2 (p i)

E(A) is the information entropy of the subset classified by attribute

A=(a1,a2,…,av),

E (A) = \sum j = 1 v s i j + s 2 j + \dots + s m j s I (s 1, s 2, \dots, s m)

Moreover,

pi is the probability of an sample belonging to class

Ci, which can be estimated as

pi=si|S| and

pij is the probability an sample belonging to class

Ci with attribute

A=aj, i.e.

pij+sij|Sj|.
ID3 algorithm can be simplified as follows:

For every attribute A, we calculate its information gain E(A).
Pick up the attribute who is of the largest E(A) as the root node or internal node.
Get rid of the grown attribute A, and for every value aj of attribute A, calculate the next node to be grown.
Keep steps 1~3 until each subset has only one label/class Ci.

ID3 algorithm is an old machine learning algorithm created in 1979 based on information entropy, however, there are several problems of it:

ID3 prefers the attribute with more values, though it turns out not to be the optimal one.
ID3 has to calculate the information entropy of every value of every attribute. Hence it always leads to many levels and branches with very little probability, as a result of which it tends to overfit classification in the test set.

C4.5 decision tree
C4,.5 algorithm makes use of Grain Ratio instead of Gain to select attributes.

GainRatio (S, A) = Gain ( S , A ) SplitInfo ( S , A )

where

Gain(S,A) is nothing more than

Gain(A) in ID3, and

SplitInfo(S,A) is defined as

SplitInfo (S, A) = - \sum i = 1 c | s i | | S | log 2 (| S | | s i |)

in which

si to

sc are the sample sets divided by

c values of attribute

ID3决策树与C4.5决策树分类算法简述

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

ID3决策树与C4.5决策树分类算法简述

热门文章

最新文章

相关课程

相关电子书

相关实验场景