ML：LGBMClassifier、XGBClassifier和CatBoostClassifier的feature_importances_计算方法源代码解读之详细攻略

2022-07-25 577

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： ML：LGBMClassifier、XGBClassifier和CatBoostClassifier的feature_importances_计算方法源代码解读之详细攻略

LGBMClassifier、XGBClassifier和CatBoostClassifier的feature_importances_计算方法源代码解读之详细攻略

LGBMClassifier

XGBClassifier

CatBoostClassifier

LGBMClassifier、XGBClassifier和CatBoostClassifier的feature_importances_计算方法源代码解读之详细攻略

LGBMClassifier

LGBMClassifier.feature_importances_函数，采用split方式计算

LGBMC.feature_importances_	importance_type='split', def feature_importances_(self): """Get feature importances. Note ---- Feature importance in sklearn interface used to normalize to 1,it's deprecated after 2.0.4 and is the same as Booster.feature_importance() now. ``importance_type`` attribute is passed to the function to configure the type of importance values to be extracted. """ if self._n_features is None: raise LGBMNotFittedError('No feature_importances found. Need to call fit beforehand.') return self.booster_.feature_importance(importance_type=self.importance_type)
	@property def booster_(self): """Get the underlying lightgbm Booster of this model.""" if self._Booster is None: raise LGBMNotFittedError('No booster found. Need to call fit beforehand.') return self._Booster
	def num_feature(self): """Get number of features. Returns ------- num_feature : int The number of features. """ out_num_feature = ctypes.c_int(0) _safe_call(_LIB.LGBM_BoosterGetNumFeature( self.handle, ctypes.byref(out_num_feature))) return out_num_feature.value
self.booster_.feature_importance (importance_type= self.importance_type)	def feature_importance(self, importance_type='split', iteration=None): """Get feature importances. Parameters ---------- importance_type : string, optional (default="split"). How the importance is calculated. 字符串，可选（默认值=“split”）。如何计算重要性。 If "split", result contains numbers of times the feature is used in a model. 如果“split”，则结果包含该特征在模型中使用的次数。 If "gain", result contains total gains of splits which use the feature.如果“gain”，则结果包含使用该特征的拆分的总增益。 iteration : int or None, optional (default=None).Limit number of iterations in the feature importance calculation. If None, if the best iteration exists, it is used; otherwise, all trees are used. If <= 0, all trees are used (no limits). Returns ------- result : numpy array Array with feature importances. """ if iteration is None: iteration = self.best_iteration if importance_type == "split": importance_type_int = 0 elif importance_type == "gain": importance_type_int = 1 else: importance_type_int = -1 result = np.zeros(self.num_feature(), dtype=np.float64) _safe_call(_LIB.LGBM_BoosterFeatureImportance( self.handle, ctypes.c_int(iteration), ctypes.c_int(importance_type_int), result.ctypes.data_as(ctypes.POINTER(ctypes.c_double)))) if importance_type_int == 0: return result.astype(int) else: return result

XGBClassifier

XGBClassifier.feature_importances_函数，采用weight方式计算

XGBC.

feature_importances_

importance_type="weight" # 默认 gain、weight、cover、total_gain、total_cover

def feature_importances_(self):

"""

Feature importances property

.. note:: Feature importance is defined only for tree boosters

Feature importance is only defined when the decision tree model is chosen as base learner (`booster=gbtree`). It is not defined for other base learner types, such as linear learners .仅当选择决策树模型作为基础学习者（`booster=gbtree`）时，才定义特征重要性。它不适用于其他基本学习者类型，例如线性学习者(`booster=gblinear`).

Returns

-------

feature_importances_ : array of shape ``[n_features]``

"""

if getattr(self, 'booster', None) is not None and self.booster != 'gbtree':

raise AttributeError('Feature importance is not defined for Booster type {}'

.format(self.booster))

b = self.get_booster()

score = b.get_score(importance_type=self.importance_type)

all_features = [score.get(f, 0.) for f in b.feature_names]

all_features = np.array(all_features, dtype=np.float32)

return all_features / all_features.sum()

get_score

def get_score(self, fmap='', importance_type='weight'):

"""Get feature importance of each feature.

Importance type can be defined as:

* 'weight': the number of times a feature is used to split the data across all trees.一个特征用于在所有树上分割数据的次数。
* 'gain': the average gain across all splits the feature is used in.使用该特征的所有拆分的平均增益。
* 'cover': the average coverage across all splits the feature is used in.使用该特征的所有拆分的平均覆盖率。
* 'total_gain': the total gain across all splits the feature is used in.该特征在所有分割中使用的总增益。
* 'total_cover': the total coverage across all splits the feature is used in.使用该特征的所有拆分的总覆盖率。

.. note:: Feature importance is defined only for tree boosters

Parameters

----------

fmap: str (optional)

The name of feature map file.

importance_type: str, default 'weight'

One of the importance types defined above.

"""

if getattr(self, 'booster', None) is not None and self.booster not in {'gbtree', 'dart'}: raise ValueError('Feature importance is not defined for Booster type {}' .format(self.booster))

allowed_importance_types = ['weight', 'gain', 'cover', 'total_gain', 'total_cover']

if importance_type not in allowed_importance_types: msg = ("importance_type mismatch, got '{}', expected one of " + repr(allowed_importance_types))

raise ValueError(msg.format(importance_type))

# if it's weight, then omap stores the number of missing values

if importance_type == 'weight':

# do a simpler tree dump to save time

trees = self.get_dump(fmap, with_stats=False)

fmap = {}

for tree in trees:

for line in tree.split('\n'):

# look for the opening square bracket

arr = line.split('[')

# if no opening bracket (leaf node), ignore this line

if len(arr) == 1:

continue

# extract feature name from string between []

fid = arr[1].split(']')[0].split('<')[0]

if fid not in fmap:

# if the feature hasn't been seen yet

fmap[fid] = 1

else:

fmap[fid] += 1

return fmap

else:

average_over_splits = True

if importance_type == 'total_gain':

importance_type = 'gain'

average_over_splits = False

elif importance_type == 'total_cover':

importance_type = 'cover'

average_over_splits = False

trees = self.get_dump(fmap, with_stats=True)

importance_type += '='

fmap = {}

gmap = {}

for tree in trees:

for line in tree.split('\n'):

# look for the opening square bracket

arr = line.split('[')

# if no opening bracket (leaf node), ignore this line

if len(arr) == 1:

continue

# look for the closing bracket, extract only info within that bracket

fid = arr[1].split(']')

# extract gain or cover from string after closing bracket

g = float(fid[1].split(importance_type)[1].split(',')[0])

# extract feature name from string before closing bracket

fid = fid[0].split('<')[0]

if fid not in fmap:

# if the feature hasn't been seen yet

fmap[fid] = 1

gmap[fid] = g

else:

fmap[fid] += 1

gmap[fid] += g

# calculate average value (gain/cover) for each feature

if average_over_splits:

for fid in gmap:

gmap[fid] = gmap[fid] / fmap[fid]

return gmap

CatBoostClassifier

CatBoostClassifier.feature_importances_函数，采用is_groupwise_metric(loss)方式计算

CatC.feature_importances_

def feature_importances_(self):
loss = self._object._get_loss_function_name()
if loss and is_groupwise_metric(loss):
return np.array(getattr(self, "_loss_value_change", None))
else:
return np.array(getattr(self, "_prediction_values_change", None))

CatBoost简单地利用了在正常情况下(当我们包括特征时)使用模型获得的度量(损失函数)与不使用该特征的模型(模型建立大约与此功能从所有的树在合奏)。差别越大，特征就越重要。

文章标签：

Python

Dart

ML：LGBMClassifier、XGBClassifier和CatBoostClassifier的feature_importances_计算方法源代码解读之详细攻略

LGBMClassifier、XGBClassifier和CatBoostClassifier的feature_importances_计算方法源代码解读之详细攻略

LGBMClassifier

XGBClassifier

CatBoostClassifier

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

ML：LGBMClassifier、XGBClassifier和CatBoostClassifier的feature_importances_计算方法源代码解读之详细攻略

LGBMClassifier、XGBClassifier和CatBoostClassifier的feature_importances_计算方法源代码解读之详细攻略

LGBMClassifier

XGBClassifier

CatBoostClassifier

热门文章

最新文章

相关课程

相关电子书

相关实验场景