sklearn之XGBModel:XGBModel之feature_importances_、plot_importance的简介、使用方法之详细攻略

简介: sklearn之XGBModel:XGBModel之feature_importances_、plot_importance的简介、使用方法之详细攻略


目录

feature_importances_

1、feature_importances_方法的解释

2、feature_importances_的原生代码

plot_importance

1、plot_importance方法的解释

2、XGBModel之plot_importance的原生代码


 

相关文章

ML之xgboost:解读用法之xgboost库的core.py文件中的get_score(importance_type=self.importance_type)方法

ML之xgboost :xgboost.plot_importance()函数的解读

sklearn之XGBModel:XGBModel之feature_importances_、plot_importance的简介、使用方法之详细攻略

 

 

feature_importances_

1、feature_importances_方法的解释

XGBRegressor( ).feature_importances_

参数

  • 注意:特性重要性只定义为树增强器。只有在选择决策树模型作为基础时,才定义特征重要性。
  • 学习器(“助推器= gbtree”)。它不定义为其他基本的学习者类型,如线性学习者 (`booster=gblinear`).。

返回

  • feature_importances_: ' ' [n_features] ' '形状的数组

注意:importance_type: string, default "gain", The feature importance type for the feature_importances_ property: either "gain", "weight", "cover", "total_gain" or "total_cover".

 

2、feature_importances_的原生代码

1. class XGBModel(XGBModelBase):
2. # pylint: disable=too-many-arguments, too-many-instance-attributes, invalid-name
3. """Implementation of the Scikit-Learn API for XGBoost.
4. 
5.     Parameters
6.     ----------
7.     max_depth : int
8.         Maximum tree depth for base learners.
9.     learning_rate : float
10.         Boosting learning rate (xgb's "eta")
11.     n_estimators : int
12.         Number of boosted trees to fit.
13.     silent : boolean
14.         Whether to print messages while running boosting.
15.     objective : string or callable
16.         Specify the learning task and the corresponding learning objective or
17.         a custom objective function to be used (see note below).
18.     booster: string
19.         Specify which booster to use: gbtree, gblinear or dart.
20.     nthread : int
21.         Number of parallel threads used to run xgboost.  (Deprecated, please use ``n_jobs``)
22.     n_jobs : int
23.         Number of parallel threads used to run xgboost.  (replaces ``nthread``)
24.     gamma : float
25.         Minimum loss reduction required to make a further partition on a leaf node of the tree.
26.     min_child_weight : int
27.         Minimum sum of instance weight(hessian) needed in a child.
28.     max_delta_step : int
29.         Maximum delta step we allow each tree's weight estimation to be.
30.     subsample : float
31.         Subsample ratio of the training instance.
32.     colsample_bytree : float
33.         Subsample ratio of columns when constructing each tree.
34.     colsample_bylevel : float
35.         Subsample ratio of columns for each split, in each level.
36.     reg_alpha : float (xgb's alpha)
37.         L1 regularization term on weights
38.     reg_lambda : float (xgb's lambda)
39.         L2 regularization term on weights
40.     scale_pos_weight : float
41.         Balancing of positive and negative weights.
42.     base_score:
43.         The initial prediction score of all instances, global bias.
44.     seed : int
45.         Random number seed.  (Deprecated, please use random_state)
46.     random_state : int
47.         Random number seed.  (replaces seed)
48.     missing : float, optional
49.         Value in the data which needs to be present as a missing value. If
50.         None, defaults to np.nan.
51.     importance_type: string, default "gain"
52.         The feature importance type for the feature_importances_ property: either "gain",
53.         "weight", "cover", "total_gain" or "total_cover".
54.     \*\*kwargs : dict, optional
55.         Keyword arguments for XGBoost Booster object.  Full documentation of parameters can
56.         be found here: https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst.
57.         Attempting to set a parameter via the constructor args and \*\*kwargs dict simultaneously
58.         will result in a TypeError.
59. 
60.         .. note:: \*\*kwargs unsupported by scikit-learn
61. 
62.             \*\*kwargs is unsupported by scikit-learn.  We do not guarantee that parameters
63.             passed via this argument will interact properly with scikit-learn.
64. 
65.     Note
66.     ----
67.     A custom objective function can be provided for the ``objective``
68.     parameter. In this case, it should have the signature
69.     ``objective(y_true, y_pred) -> grad, hess``:
70. 
71.     y_true: array_like of shape [n_samples]
72.         The target values
73.     y_pred: array_like of shape [n_samples]
74.         The predicted values
75. 
76.     grad: array_like of shape [n_samples]
77.         The value of the gradient for each sample point.
78.     hess: array_like of shape [n_samples]
79.         The value of the second derivative for each sample point
80.     """
81. 
82. def __init__(self, max_depth=3, learning_rate=0.1, n_estimators=100,
83.                  silent=True, objective="reg:linear", booster='gbtree',
84.                  n_jobs=1, nthread=None, gamma=0, min_child_weight=1, max_delta_step=0,
85.                  subsample=1, colsample_bytree=1, colsample_bylevel=1,
86.                  reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
87.                  base_score=0.5, random_state=0, seed=None, missing=None,
88.                  importance_type="gain", **kwargs):
89. if not SKLEARN_INSTALLED:
90. raise XGBoostError('sklearn needs to be installed in order to use this module')
91.         self.max_depth = max_depth
92.         self.learning_rate = learning_rate
93.         self.n_estimators = n_estimators
94.         self.silent = silent
95.         self.objective = objective
96.         self.booster = booster
97.         self.gamma = gamma
98.         self.min_child_weight = min_child_weight
99.         self.max_delta_step = max_delta_step
100.         self.subsample = subsample
101.         self.colsample_bytree = colsample_bytree
102.         self.colsample_bylevel = colsample_bylevel
103.         self.reg_alpha = reg_alpha
104.         self.reg_lambda = reg_lambda
105.         self.scale_pos_weight = scale_pos_weight
106.         self.base_score = base_score
107.         self.missing = missing if missing is not None else np.nan
108.         self.kwargs = kwargs
109.         self._Booster = None
110.         self.seed = seed
111.         self.random_state = random_state
112.         self.nthread = nthread
113.         self.n_jobs = n_jobs
114.         self.importance_type = importance_type
115. 
116. def feature_importances_(self):
117. """
118.     Feature importances property
119.     
120.     .. note:: Feature importance is defined only for tree boosters
121.     
122.     Feature importance is only defined when the decision tree model is chosen as base
123.     learner (`booster=gbtree`). It is not defined for other base learner types, such
124.     as linear learners (`booster=gblinear`).
125.     
126.     Returns
127.     -------
128.     feature_importances_ : array of shape ``[n_features]``
129.     
130.     """
131. if getattr(self, 'booster', None) is not None and self.booster != 'gbtree':
132. raise AttributeError(
133. 'Feature importance is not defined for Booster type {}'.format(self.booster))
134.     b = self.get_booster()
135.     score = b.get_score(importance_type=self.importance_type)
136.     all_features = [score.get(f, 0.) for f in b.feature_names]
137.     all_features = np.array(all_features, dtype=np.float32)
138. return all_features / all_features.sum()

 

 

 

 

plot_importance

1、plot_importance方法的解释

作用:基于拟合树的重要性可视化。

参数

  •    booster : Booster, XGBModel or dict. Booster or XGBModel instance, or dict taken by Booster.get_fscore()
  •    ax : matplotlib Axes, default None. Target axes instance. If None, new figure and axes will be created.
  •    grid : bool, Turn the axes grids on or off.  Default is True (On).
  •    importance_type : str, default "weight".   How the importance is calculated: either "weight", "gain", or "cover"
           * "weight" is the number of times a feature appears in a tree,在树中出现的次数。
           * "gain" is the average gain of splits which use the feature,使用该特性的分割的平均增益。
           * "cover" is the average coverage of splits which use the feature,  where coverage is defined as the number of samples affected by the split.  分割的平均覆盖率,其中覆盖率定义为受分割影响的样本数。
  •    max_num_features : int, default None.  Maximum number of top features displayed on plot. If None, all features will be displayed.
  •    height : float, default 0.2.  Bar height, passed to ax.barh()
  •    xlim : tuple, default None.   Tuple passed to axes.xlim()
  •    ylim : tuple, default None.  Tuple passed to axes.ylim()
  •    title : str, default "Feature importance".  Axes title. To disable, pass None.
  •    xlabel : str, default "F score".  X axis title label. To disable, pass None.
  •    ylabel : str, default "Features".  Y axis title label. To disable, pass None.
  •    show_values : bool, default True.  Show values on plot. To disable, pass False.
  •    kwargs :  Other keywords passed to ax.barh()

返回

  • ax : matplotlib Axes

 

2、XGBModel之plot_importance的原生代码

1. # coding: utf-8
2. # pylint: disable=too-many-locals, too-many-arguments, invalid-name,
3. # pylint: disable=too-many-branches
4. """Plotting Library."""
5. from __future__ import absolute_import
6. 
7. import re
8. from io import BytesIO
9. import numpy as np
10. from .core import Booster
11. from .sklearn import XGBModel
12. 
13. 
14. def plot_importance(booster, ax=None, height=0.2,
15.                     xlim=None, ylim=None, title='Feature importance',
16.                     xlabel='F score', ylabel='Features',
17.                     importance_type='weight', max_num_features=None,
18.                     grid=True, show_values=True, **kwargs):
19. """Plot importance based on fitted trees.
20. 
21.     Parameters
22.     ----------
23.     booster : Booster, XGBModel or dict
24.         Booster or XGBModel instance, or dict taken by Booster.get_fscore()
25.     ax : matplotlib Axes, default None
26.         Target axes instance. If None, new figure and axes will be created.
27.     grid : bool, Turn the axes grids on or off.  Default is True (On).
28.     importance_type : str, default "weight"
29.         How the importance is calculated: either "weight", "gain", or "cover"
30. 
31.         * "weight" is the number of times a feature appears in a tree
32.         * "gain" is the average gain of splits which use the feature
33.         * "cover" is the average coverage of splits which use the feature
34.           where coverage is defined as the number of samples affected by the split
35.     max_num_features : int, default None
36.         Maximum number of top features displayed on plot. If None, all features will be displayed.
37.     height : float, default 0.2
38.         Bar height, passed to ax.barh()
39.     xlim : tuple, default None
40.         Tuple passed to axes.xlim()
41.     ylim : tuple, default None
42.         Tuple passed to axes.ylim()
43.     title : str, default "Feature importance"
44.         Axes title. To disable, pass None.
45.     xlabel : str, default "F score"
46.         X axis title label. To disable, pass None.
47.     ylabel : str, default "Features"
48.         Y axis title label. To disable, pass None.
49.     show_values : bool, default True
50.         Show values on plot. To disable, pass False.
51.     kwargs :
52.         Other keywords passed to ax.barh()
53. 
54.     Returns
55.     -------
56.     ax : matplotlib Axes
57.     """
58. # TODO: move this to compat.py
59. try:
60. import matplotlib.pyplot as plt
61. except ImportError:
62. raise ImportError('You must install matplotlib to plot importance')
63. 
64. if isinstance(booster, XGBModel):
65.         importance = booster.get_booster().get_score(importance_type=importance_type)
66. elif isinstance(booster, Booster):
67.         importance = booster.get_score(importance_type=importance_type)
68. elif isinstance(booster, dict):
69.         importance = booster
70. else:
71. raise ValueError('tree must be Booster, XGBModel or dict instance')
72. 
73. if len(importance) == 0:
74. raise ValueError('Booster.get_score() results in empty')
75. 
76.     tuples = [(k, importance[k]) for k in importance]
77. if max_num_features is not None:
78.         tuples = sorted(tuples, key=lambda x: x[1])[-max_num_features:]
79. else:
80.         tuples = sorted(tuples, key=lambda x: x[1])
81.     labels, values = zip(*tuples)
82. 
83. if ax is None:
84.         _, ax = plt.subplots(1, 1)
85. 
86.     ylocs = np.arange(len(values))
87.     ax.barh(ylocs, values, align='center', height=height, **kwargs)
88. 
89. if show_values is True:
90. for x, y in zip(values, ylocs):
91.             ax.text(x + 1, y, x, va='center')
92. 
93.     ax.set_yticks(ylocs)
94.     ax.set_yticklabels(labels)
95. 
96. if xlim is not None:
97. if not isinstance(xlim, tuple) or len(xlim) != 2:
98. raise ValueError('xlim must be a tuple of 2 elements')
99. else:
100.         xlim = (0, max(values) * 1.1)
101.     ax.set_xlim(xlim)
102. 
103. if ylim is not None:
104. if not isinstance(ylim, tuple) or len(ylim) != 2:
105. raise ValueError('ylim must be a tuple of 2 elements')
106. else:
107.         ylim = (-1, len(values))
108.     ax.set_ylim(ylim)
109. 
110. if title is not None:
111.         ax.set_title(title)
112. if xlabel is not None:
113.         ax.set_xlabel(xlabel)
114. if ylabel is not None:
115.         ax.set_ylabel(ylabel)
116.     ax.grid(grid)
117. return ax
118. 
119. 
120. _NODEPAT = re.compile(r'(\d+):\[(.+)\]')
121. _LEAFPAT = re.compile(r'(\d+):(leaf=.+)')
122. _EDGEPAT = re.compile(r'yes=(\d+),no=(\d+),missing=(\d+)')
123. _EDGEPAT2 = re.compile(r'yes=(\d+),no=(\d+)')
124. 
125. 
126. def _parse_node(graph, text, condition_node_params, leaf_node_params):
127. """parse dumped node"""
128.     match = _NODEPAT.match(text)
129. if match is not None:
130.         node = match.group(1)
131.         graph.node(node, label=match.group(2), **condition_node_params)
132. return node
133.     match = _LEAFPAT.match(text)
134. if match is not None:
135.         node = match.group(1)
136.         graph.node(node, label=match.group(2), **leaf_node_params)
137. return node
138. raise ValueError('Unable to parse node: {0}'.format(text))
139. 
140. 
141. def _parse_edge(graph, node, text, yes_color='#0000FF', no_color='#FF0000'):
142. """parse dumped edge"""
143. try:
144.         match = _EDGEPAT.match(text)
145. if match is not None:
146.             yes, no, missing = match.groups()
147. if yes == missing:
148.                 graph.edge(node, yes, label='yes, missing', color=yes_color)
149.                 graph.edge(node, no, label='no', color=no_color)
150. else:
151.                 graph.edge(node, yes, label='yes', color=yes_color)
152.                 graph.edge(node, no, label='no, missing', color=no_color)
153. return
154. except ValueError:
155. pass
156.     match = _EDGEPAT2.match(text)
157. if match is not None:
158.         yes, no = match.groups()
159.         graph.edge(node, yes, label='yes', color=yes_color)
160.         graph.edge(node, no, label='no', color=no_color)
161. return
162. raise ValueError('Unable to parse edge: {0}'.format(text))
163. 
164. 
165. def to_graphviz(booster, fmap='', num_trees=0, rankdir='UT',
166.                 yes_color='#0000FF', no_color='#FF0000',
167.                 condition_node_params=None, leaf_node_params=None, **kwargs):
168. """Convert specified tree to graphviz instance. IPython can automatically plot the
169.     returned graphiz instance. Otherwise, you should call .render() method
170.     of the returned graphiz instance.
171. 
172.     Parameters
173.     ----------
174.     booster : Booster, XGBModel
175.         Booster or XGBModel instance
176.     fmap: str (optional)
177.        The name of feature map file
178.     num_trees : int, default 0
179.         Specify the ordinal number of target tree
180.     rankdir : str, default "UT"
181.         Passed to graphiz via graph_attr
182.     yes_color : str, default '#0000FF'
183.         Edge color when meets the node condition.
184.     no_color : str, default '#FF0000'
185.         Edge color when doesn't meet the node condition.
186.     condition_node_params : dict (optional)
187.         condition node configuration,
188.         {'shape':'box',
189.                'style':'filled,rounded',
190.                'fillcolor':'#78bceb'
191.         }
192.     leaf_node_params : dict (optional)
193.         leaf node configuration
194.         {'shape':'box',
195.                'style':'filled',
196.                'fillcolor':'#e48038'
197.         }
198.     kwargs :
199.         Other keywords passed to graphviz graph_attr
200. 
201.     Returns
202.     -------
203.     ax : matplotlib Axes
204.     """
205. 
206. if condition_node_params is None:
207.         condition_node_params = {}
208. if leaf_node_params is None:
209.         leaf_node_params = {}
210. 
211. try:
212. from graphviz import Digraph
213. except ImportError:
214. raise ImportError('You must install graphviz to plot tree')
215. 
216. if not isinstance(booster, (Booster, XGBModel)):
217. raise ValueError('booster must be Booster or XGBModel instance')
218. 
219. if isinstance(booster, XGBModel):
220.         booster = booster.get_booster()
221. 
222.     tree = booster.get_dump(fmap=fmap)[num_trees]
223.     tree = tree.split()
224. 
225.     kwargs = kwargs.copy()
226.     kwargs.update({'rankdir': rankdir})
227.     graph = Digraph(graph_attr=kwargs)
228. 
229. for i, text in enumerate(tree):
230. if text[0].isdigit():
231.             node = _parse_node(
232.                 graph, text, condition_node_params=condition_node_params,
233.                 leaf_node_params=leaf_node_params)
234. else:
235. if i == 0:
236. # 1st string must be node
237. raise ValueError('Unable to parse given string as tree')
238.             _parse_edge(graph, node, text, yes_color=yes_color,
239.                         no_color=no_color)
240. 
241. return graph
242. 
243. 
244. def plot_tree(booster, fmap='', num_trees=0, rankdir='UT', ax=None, **kwargs):
245. """Plot specified tree.
246. 
247.     Parameters
248.     ----------
249.     booster : Booster, XGBModel
250.         Booster or XGBModel instance
251.     fmap: str (optional)
252.        The name of feature map file
253.     num_trees : int, default 0
254.         Specify the ordinal number of target tree
255.     rankdir : str, default "UT"
256.         Passed to graphiz via graph_attr
257.     ax : matplotlib Axes, default None
258.         Target axes instance. If None, new figure and axes will be created.
259.     kwargs :
260.         Other keywords passed to to_graphviz
261. 
262.     Returns
263.     -------
264.     ax : matplotlib Axes
265. 
266.     """
267. 
268. try:
269. import matplotlib.pyplot as plt
270. import matplotlib.image as image
271. except ImportError:
272. raise ImportError('You must install matplotlib to plot tree')
273. 
274. if ax is None:
275.         _, ax = plt.subplots(1, 1)
276. 
277.     g = to_graphviz(booster, fmap=fmap, num_trees=num_trees,
278.                     rankdir=rankdir, **kwargs)
279. 
280.     s = BytesIO()
281.     s.write(g.pipe(format='png'))
282.     s.seek(0)
283.     img = image.imread(s)
284. 
285.     ax.imshow(img)
286.     ax.axis('off')
287. return ax


相关文章
|
6月前
|
Python
请解释Python中的主成分分析(PCA)以及如何使用Sklearn库实现它。
PCA是数据降维工具,Python中可通过Sklearn的PCA类实现。以下是一个简例:导入numpy、PCA和数据集;加载鸢尾花数据,标准化;创建PCA对象,指定降维数(如2);应用PCA转换;最后输出降维结果。此示例展示了如何将数据从高维降至二维。
68 4
|
6月前
|
机器学习/深度学习 算法 数据可视化
请解释Python中的K-means聚类算法以及如何使用Sklearn库实现它。
【2月更文挑战第29天】【2月更文挑战第104篇】请解释Python中的K-means聚类算法以及如何使用Sklearn库实现它。
72 1
|
6月前
|
机器学习/深度学习 数据采集 算法
Python中的支持向量机(SVM)以及如何使用Sklearn库实现它
SVM是监督学习算法,用于分类和回归,Python中可通过Scikit-learn实现。步骤包括数据预处理、选择SVM模型(如线性或非线性,配合核函数)、训练模型找到最优超平面、模型评估、参数调整和新数据预测。Scikit-learn简化了这一流程,便于数据分析任务。
191 3
|
6月前
|
存储 数据可视化 索引
如何使用Python的Statsmodels库进行时间序列分析?
如何使用Python的Statsmodels库进行时间序列分析?
73 0
|
6月前
|
索引 Python
Pandas 2.2 中文官方教程和指南(六)(3)
Pandas 2.2 中文官方教程和指南(六)
39 2
|
6月前
|
索引 Python
Pandas 2.2 中文官方教程和指南(七)(3)
Pandas 2.2 中文官方教程和指南(七)
38 1
|
6月前
|
数据可视化 Python
请解释Python中的线性回归分析以及如何使用Sklearn库实现它。
【4月更文挑战第22天】使用Python的Sklearn库进行线性回归分析,包括7个步骤:导入库、准备数据、划分训练测试集、创建模型、训练、预测和评估。示例代码创建了一个简单的数据集,应用线性回归模型,计算了Mean Squared Error和R² Score,并展示了预测结果与实际值的可视化对比。
50 11
|
6月前
|
机器学习/深度学习 数据可视化 索引
lasso路径可视化 python
【4月更文挑战第15天】
|
6月前
|
存储 数据可视化 数据挖掘
如何使用Python的Statsmodels库进行时间序列分析
【4月更文挑战第19天】`Statsmodels`是Python的数据分析库,支持时间序列分析。安装后,导入必要的模块,如`ARIMA`。准备时间序列数据并用Pandas存储,检查数据分布。确定ARIMA模型参数(p, d, q),拟合模型并打印摘要信息。使用fit()和forecast()进行预测,评估模型性能。更多功能包括季节性调整和模型诊断。
84 2
|
6月前
|
索引 Python
Pandas 2.2 中文官方教程和指南(五)(3)
Pandas 2.2 中文官方教程和指南(五)
43 0