sklearn之XGBModel:XGBModel之feature_importances_、plot_importance的简介、使用方法之详细攻略

简介: sklearn之XGBModel:XGBModel之feature_importances_、plot_importance的简介、使用方法之详细攻略


目录

feature_importances_

1、feature_importances_方法的解释

2、feature_importances_的原生代码

plot_importance

1、plot_importance方法的解释

2、XGBModel之plot_importance的原生代码


 

相关文章

ML之xgboost:解读用法之xgboost库的core.py文件中的get_score(importance_type=self.importance_type)方法

ML之xgboost :xgboost.plot_importance()函数的解读

sklearn之XGBModel:XGBModel之feature_importances_、plot_importance的简介、使用方法之详细攻略

 

 

feature_importances_

1、feature_importances_方法的解释

XGBRegressor( ).feature_importances_

参数

  • 注意:特性重要性只定义为树增强器。只有在选择决策树模型作为基础时,才定义特征重要性。
  • 学习器(“助推器= gbtree”)。它不定义为其他基本的学习者类型,如线性学习者 (`booster=gblinear`).。

返回

  • feature_importances_: ' ' [n_features] ' '形状的数组

注意:importance_type: string, default "gain", The feature importance type for the feature_importances_ property: either "gain", "weight", "cover", "total_gain" or "total_cover".

 

2、feature_importances_的原生代码

1. class XGBModel(XGBModelBase):
2. # pylint: disable=too-many-arguments, too-many-instance-attributes, invalid-name
3. """Implementation of the Scikit-Learn API for XGBoost.
4. 
5.     Parameters
6.     ----------
7.     max_depth : int
8.         Maximum tree depth for base learners.
9.     learning_rate : float
10.         Boosting learning rate (xgb's "eta")
11.     n_estimators : int
12.         Number of boosted trees to fit.
13.     silent : boolean
14.         Whether to print messages while running boosting.
15.     objective : string or callable
16.         Specify the learning task and the corresponding learning objective or
17.         a custom objective function to be used (see note below).
18.     booster: string
19.         Specify which booster to use: gbtree, gblinear or dart.
20.     nthread : int
21.         Number of parallel threads used to run xgboost.  (Deprecated, please use ``n_jobs``)
22.     n_jobs : int
23.         Number of parallel threads used to run xgboost.  (replaces ``nthread``)
24.     gamma : float
25.         Minimum loss reduction required to make a further partition on a leaf node of the tree.
26.     min_child_weight : int
27.         Minimum sum of instance weight(hessian) needed in a child.
28.     max_delta_step : int
29.         Maximum delta step we allow each tree's weight estimation to be.
30.     subsample : float
31.         Subsample ratio of the training instance.
32.     colsample_bytree : float
33.         Subsample ratio of columns when constructing each tree.
34.     colsample_bylevel : float
35.         Subsample ratio of columns for each split, in each level.
36.     reg_alpha : float (xgb's alpha)
37.         L1 regularization term on weights
38.     reg_lambda : float (xgb's lambda)
39.         L2 regularization term on weights
40.     scale_pos_weight : float
41.         Balancing of positive and negative weights.
42.     base_score:
43.         The initial prediction score of all instances, global bias.
44.     seed : int
45.         Random number seed.  (Deprecated, please use random_state)
46.     random_state : int
47.         Random number seed.  (replaces seed)
48.     missing : float, optional
49.         Value in the data which needs to be present as a missing value. If
50.         None, defaults to np.nan.
51.     importance_type: string, default "gain"
52.         The feature importance type for the feature_importances_ property: either "gain",
53.         "weight", "cover", "total_gain" or "total_cover".
54.     \*\*kwargs : dict, optional
55.         Keyword arguments for XGBoost Booster object.  Full documentation of parameters can
56.         be found here: https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst.
57.         Attempting to set a parameter via the constructor args and \*\*kwargs dict simultaneously
58.         will result in a TypeError.
59. 
60.         .. note:: \*\*kwargs unsupported by scikit-learn
61. 
62.             \*\*kwargs is unsupported by scikit-learn.  We do not guarantee that parameters
63.             passed via this argument will interact properly with scikit-learn.
64. 
65.     Note
66.     ----
67.     A custom objective function can be provided for the ``objective``
68.     parameter. In this case, it should have the signature
69.     ``objective(y_true, y_pred) -> grad, hess``:
70. 
71.     y_true: array_like of shape [n_samples]
72.         The target values
73.     y_pred: array_like of shape [n_samples]
74.         The predicted values
75. 
76.     grad: array_like of shape [n_samples]
77.         The value of the gradient for each sample point.
78.     hess: array_like of shape [n_samples]
79.         The value of the second derivative for each sample point
80.     """
81. 
82. def __init__(self, max_depth=3, learning_rate=0.1, n_estimators=100,
83.                  silent=True, objective="reg:linear", booster='gbtree',
84.                  n_jobs=1, nthread=None, gamma=0, min_child_weight=1, max_delta_step=0,
85.                  subsample=1, colsample_bytree=1, colsample_bylevel=1,
86.                  reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
87.                  base_score=0.5, random_state=0, seed=None, missing=None,
88.                  importance_type="gain", **kwargs):
89. if not SKLEARN_INSTALLED:
90. raise XGBoostError('sklearn needs to be installed in order to use this module')
91.         self.max_depth = max_depth
92.         self.learning_rate = learning_rate
93.         self.n_estimators = n_estimators
94.         self.silent = silent
95.         self.objective = objective
96.         self.booster = booster
97.         self.gamma = gamma
98.         self.min_child_weight = min_child_weight
99.         self.max_delta_step = max_delta_step
100.         self.subsample = subsample
101.         self.colsample_bytree = colsample_bytree
102.         self.colsample_bylevel = colsample_bylevel
103.         self.reg_alpha = reg_alpha
104.         self.reg_lambda = reg_lambda
105.         self.scale_pos_weight = scale_pos_weight
106.         self.base_score = base_score
107.         self.missing = missing if missing is not None else np.nan
108.         self.kwargs = kwargs
109.         self._Booster = None
110.         self.seed = seed
111.         self.random_state = random_state
112.         self.nthread = nthread
113.         self.n_jobs = n_jobs
114.         self.importance_type = importance_type
115. 
116. def feature_importances_(self):
117. """
118.     Feature importances property
119.     
120.     .. note:: Feature importance is defined only for tree boosters
121.     
122.     Feature importance is only defined when the decision tree model is chosen as base
123.     learner (`booster=gbtree`). It is not defined for other base learner types, such
124.     as linear learners (`booster=gblinear`).
125.     
126.     Returns
127.     -------
128.     feature_importances_ : array of shape ``[n_features]``
129.     
130.     """
131. if getattr(self, 'booster', None) is not None and self.booster != 'gbtree':
132. raise AttributeError(
133. 'Feature importance is not defined for Booster type {}'.format(self.booster))
134.     b = self.get_booster()
135.     score = b.get_score(importance_type=self.importance_type)
136.     all_features = [score.get(f, 0.) for f in b.feature_names]
137.     all_features = np.array(all_features, dtype=np.float32)
138. return all_features / all_features.sum()

 

 

 

 

plot_importance

1、plot_importance方法的解释

作用:基于拟合树的重要性可视化。

参数

  •    booster : Booster, XGBModel or dict. Booster or XGBModel instance, or dict taken by Booster.get_fscore()
  •    ax : matplotlib Axes, default None. Target axes instance. If None, new figure and axes will be created.
  •    grid : bool, Turn the axes grids on or off.  Default is True (On).
  •    importance_type : str, default "weight".   How the importance is calculated: either "weight", "gain", or "cover"
           * "weight" is the number of times a feature appears in a tree,在树中出现的次数。
           * "gain" is the average gain of splits which use the feature,使用该特性的分割的平均增益。
           * "cover" is the average coverage of splits which use the feature,  where coverage is defined as the number of samples affected by the split.  分割的平均覆盖率,其中覆盖率定义为受分割影响的样本数。
  •    max_num_features : int, default None.  Maximum number of top features displayed on plot. If None, all features will be displayed.
  •    height : float, default 0.2.  Bar height, passed to ax.barh()
  •    xlim : tuple, default None.   Tuple passed to axes.xlim()
  •    ylim : tuple, default None.  Tuple passed to axes.ylim()
  •    title : str, default "Feature importance".  Axes title. To disable, pass None.
  •    xlabel : str, default "F score".  X axis title label. To disable, pass None.
  •    ylabel : str, default "Features".  Y axis title label. To disable, pass None.
  •    show_values : bool, default True.  Show values on plot. To disable, pass False.
  •    kwargs :  Other keywords passed to ax.barh()

返回

  • ax : matplotlib Axes

 

2、XGBModel之plot_importance的原生代码

1. # coding: utf-8
2. # pylint: disable=too-many-locals, too-many-arguments, invalid-name,
3. # pylint: disable=too-many-branches
4. """Plotting Library."""
5. from __future__ import absolute_import
6. 
7. import re
8. from io import BytesIO
9. import numpy as np
10. from .core import Booster
11. from .sklearn import XGBModel
12. 
13. 
14. def plot_importance(booster, ax=None, height=0.2,
15.                     xlim=None, ylim=None, title='Feature importance',
16.                     xlabel='F score', ylabel='Features',
17.                     importance_type='weight', max_num_features=None,
18.                     grid=True, show_values=True, **kwargs):
19. """Plot importance based on fitted trees.
20. 
21.     Parameters
22.     ----------
23.     booster : Booster, XGBModel or dict
24.         Booster or XGBModel instance, or dict taken by Booster.get_fscore()
25.     ax : matplotlib Axes, default None
26.         Target axes instance. If None, new figure and axes will be created.
27.     grid : bool, Turn the axes grids on or off.  Default is True (On).
28.     importance_type : str, default "weight"
29.         How the importance is calculated: either "weight", "gain", or "cover"
30. 
31.         * "weight" is the number of times a feature appears in a tree
32.         * "gain" is the average gain of splits which use the feature
33.         * "cover" is the average coverage of splits which use the feature
34.           where coverage is defined as the number of samples affected by the split
35.     max_num_features : int, default None
36.         Maximum number of top features displayed on plot. If None, all features will be displayed.
37.     height : float, default 0.2
38.         Bar height, passed to ax.barh()
39.     xlim : tuple, default None
40.         Tuple passed to axes.xlim()
41.     ylim : tuple, default None
42.         Tuple passed to axes.ylim()
43.     title : str, default "Feature importance"
44.         Axes title. To disable, pass None.
45.     xlabel : str, default "F score"
46.         X axis title label. To disable, pass None.
47.     ylabel : str, default "Features"
48.         Y axis title label. To disable, pass None.
49.     show_values : bool, default True
50.         Show values on plot. To disable, pass False.
51.     kwargs :
52.         Other keywords passed to ax.barh()
53. 
54.     Returns
55.     -------
56.     ax : matplotlib Axes
57.     """
58. # TODO: move this to compat.py
59. try:
60. import matplotlib.pyplot as plt
61. except ImportError:
62. raise ImportError('You must install matplotlib to plot importance')
63. 
64. if isinstance(booster, XGBModel):
65.         importance = booster.get_booster().get_score(importance_type=importance_type)
66. elif isinstance(booster, Booster):
67.         importance = booster.get_score(importance_type=importance_type)
68. elif isinstance(booster, dict):
69.         importance = booster
70. else:
71. raise ValueError('tree must be Booster, XGBModel or dict instance')
72. 
73. if len(importance) == 0:
74. raise ValueError('Booster.get_score() results in empty')
75. 
76.     tuples = [(k, importance[k]) for k in importance]
77. if max_num_features is not None:
78.         tuples = sorted(tuples, key=lambda x: x[1])[-max_num_features:]
79. else:
80.         tuples = sorted(tuples, key=lambda x: x[1])
81.     labels, values = zip(*tuples)
82. 
83. if ax is None:
84.         _, ax = plt.subplots(1, 1)
85. 
86.     ylocs = np.arange(len(values))
87.     ax.barh(ylocs, values, align='center', height=height, **kwargs)
88. 
89. if show_values is True:
90. for x, y in zip(values, ylocs):
91.             ax.text(x + 1, y, x, va='center')
92. 
93.     ax.set_yticks(ylocs)
94.     ax.set_yticklabels(labels)
95. 
96. if xlim is not None:
97. if not isinstance(xlim, tuple) or len(xlim) != 2:
98. raise ValueError('xlim must be a tuple of 2 elements')
99. else:
100.         xlim = (0, max(values) * 1.1)
101.     ax.set_xlim(xlim)
102. 
103. if ylim is not None:
104. if not isinstance(ylim, tuple) or len(ylim) != 2:
105. raise ValueError('ylim must be a tuple of 2 elements')
106. else:
107.         ylim = (-1, len(values))
108.     ax.set_ylim(ylim)
109. 
110. if title is not None:
111.         ax.set_title(title)
112. if xlabel is not None:
113.         ax.set_xlabel(xlabel)
114. if ylabel is not None:
115.         ax.set_ylabel(ylabel)
116.     ax.grid(grid)
117. return ax
118. 
119. 
120. _NODEPAT = re.compile(r'(\d+):\[(.+)\]')
121. _LEAFPAT = re.compile(r'(\d+):(leaf=.+)')
122. _EDGEPAT = re.compile(r'yes=(\d+),no=(\d+),missing=(\d+)')
123. _EDGEPAT2 = re.compile(r'yes=(\d+),no=(\d+)')
124. 
125. 
126. def _parse_node(graph, text, condition_node_params, leaf_node_params):
127. """parse dumped node"""
128.     match = _NODEPAT.match(text)
129. if match is not None:
130.         node = match.group(1)
131.         graph.node(node, label=match.group(2), **condition_node_params)
132. return node
133.     match = _LEAFPAT.match(text)
134. if match is not None:
135.         node = match.group(1)
136.         graph.node(node, label=match.group(2), **leaf_node_params)
137. return node
138. raise ValueError('Unable to parse node: {0}'.format(text))
139. 
140. 
141. def _parse_edge(graph, node, text, yes_color='#0000FF', no_color='#FF0000'):
142. """parse dumped edge"""
143. try:
144.         match = _EDGEPAT.match(text)
145. if match is not None:
146.             yes, no, missing = match.groups()
147. if yes == missing:
148.                 graph.edge(node, yes, label='yes, missing', color=yes_color)
149.                 graph.edge(node, no, label='no', color=no_color)
150. else:
151.                 graph.edge(node, yes, label='yes', color=yes_color)
152.                 graph.edge(node, no, label='no, missing', color=no_color)
153. return
154. except ValueError:
155. pass
156.     match = _EDGEPAT2.match(text)
157. if match is not None:
158.         yes, no = match.groups()
159.         graph.edge(node, yes, label='yes', color=yes_color)
160.         graph.edge(node, no, label='no', color=no_color)
161. return
162. raise ValueError('Unable to parse edge: {0}'.format(text))
163. 
164. 
165. def to_graphviz(booster, fmap='', num_trees=0, rankdir='UT',
166.                 yes_color='#0000FF', no_color='#FF0000',
167.                 condition_node_params=None, leaf_node_params=None, **kwargs):
168. """Convert specified tree to graphviz instance. IPython can automatically plot the
169.     returned graphiz instance. Otherwise, you should call .render() method
170.     of the returned graphiz instance.
171. 
172.     Parameters
173.     ----------
174.     booster : Booster, XGBModel
175.         Booster or XGBModel instance
176.     fmap: str (optional)
177.        The name of feature map file
178.     num_trees : int, default 0
179.         Specify the ordinal number of target tree
180.     rankdir : str, default "UT"
181.         Passed to graphiz via graph_attr
182.     yes_color : str, default '#0000FF'
183.         Edge color when meets the node condition.
184.     no_color : str, default '#FF0000'
185.         Edge color when doesn't meet the node condition.
186.     condition_node_params : dict (optional)
187.         condition node configuration,
188.         {'shape':'box',
189.                'style':'filled,rounded',
190.                'fillcolor':'#78bceb'
191.         }
192.     leaf_node_params : dict (optional)
193.         leaf node configuration
194.         {'shape':'box',
195.                'style':'filled',
196.                'fillcolor':'#e48038'
197.         }
198.     kwargs :
199.         Other keywords passed to graphviz graph_attr
200. 
201.     Returns
202.     -------
203.     ax : matplotlib Axes
204.     """
205. 
206. if condition_node_params is None:
207.         condition_node_params = {}
208. if leaf_node_params is None:
209.         leaf_node_params = {}
210. 
211. try:
212. from graphviz import Digraph
213. except ImportError:
214. raise ImportError('You must install graphviz to plot tree')
215. 
216. if not isinstance(booster, (Booster, XGBModel)):
217. raise ValueError('booster must be Booster or XGBModel instance')
218. 
219. if isinstance(booster, XGBModel):
220.         booster = booster.get_booster()
221. 
222.     tree = booster.get_dump(fmap=fmap)[num_trees]
223.     tree = tree.split()
224. 
225.     kwargs = kwargs.copy()
226.     kwargs.update({'rankdir': rankdir})
227.     graph = Digraph(graph_attr=kwargs)
228. 
229. for i, text in enumerate(tree):
230. if text[0].isdigit():
231.             node = _parse_node(
232.                 graph, text, condition_node_params=condition_node_params,
233.                 leaf_node_params=leaf_node_params)
234. else:
235. if i == 0:
236. # 1st string must be node
237. raise ValueError('Unable to parse given string as tree')
238.             _parse_edge(graph, node, text, yes_color=yes_color,
239.                         no_color=no_color)
240. 
241. return graph
242. 
243. 
244. def plot_tree(booster, fmap='', num_trees=0, rankdir='UT', ax=None, **kwargs):
245. """Plot specified tree.
246. 
247.     Parameters
248.     ----------
249.     booster : Booster, XGBModel
250.         Booster or XGBModel instance
251.     fmap: str (optional)
252.        The name of feature map file
253.     num_trees : int, default 0
254.         Specify the ordinal number of target tree
255.     rankdir : str, default "UT"
256.         Passed to graphiz via graph_attr
257.     ax : matplotlib Axes, default None
258.         Target axes instance. If None, new figure and axes will be created.
259.     kwargs :
260.         Other keywords passed to to_graphviz
261. 
262.     Returns
263.     -------
264.     ax : matplotlib Axes
265. 
266.     """
267. 
268. try:
269. import matplotlib.pyplot as plt
270. import matplotlib.image as image
271. except ImportError:
272. raise ImportError('You must install matplotlib to plot tree')
273. 
274. if ax is None:
275.         _, ax = plt.subplots(1, 1)
276. 
277.     g = to_graphviz(booster, fmap=fmap, num_trees=num_trees,
278.                     rankdir=rankdir, **kwargs)
279. 
280.     s = BytesIO()
281.     s.write(g.pipe(format='png'))
282.     s.seek(0)
283.     img = image.imread(s)
284. 
285.     ax.imshow(img)
286.     ax.axis('off')
287. return ax


相关文章
|
12月前
|
机器学习/深度学习 数据可视化 算法
机器学习模型可视化:基于sklearn和Matplotlib的库​scikit-plot
机器学习模型可视化:基于sklearn和Matplotlib的库​scikit-plot
|
存储 计算机视觉 索引
sklearn:sklearn.GridSearchCV函数的简介、使用方法之详细攻略
sklearn:sklearn.GridSearchCV函数的简介、使用方法之详细攻略
sklearn:sklearn.preprocessing中的Standardization、Scaling、 Normalization简介、使用方法之详细攻略
sklearn:sklearn.preprocessing中的Standardization、Scaling、 Normalization简介、使用方法之详细攻略
sklearn:sklearn.preprocessing中的Standardization、Scaling、 Normalization简介、使用方法之详细攻略
|
机器学习/深度学习 数据挖掘 索引
ML之sklearn:sklearn.metrics中常用的函数参数(比如confusion_matrix等 )解释及其用法说明之详细攻略
ML之sklearn:sklearn.metrics中常用的函数参数(比如confusion_matrix等 )解释及其用法说明之详细攻略
|
机器学习/深度学习 存储 测试技术
Python之 sklearn:sklearn中的RobustScaler 函数的简介及使用方法之详细攻略
Python之 sklearn:sklearn中的RobustScaler 函数的简介及使用方法之详细攻略
|
数据可视化 Python
sklearn之XGBModel:XGBModel之feature_importances_、plot_importance的简介、使用方法之详细攻略(二)
sklearn之XGBModel:XGBModel之feature_importances_、plot_importance的简介、使用方法之详细攻略(一)
sklearn之XGBModel:XGBModel之feature_importances_、plot_importance的简介、使用方法之详细攻略(一)
sklearn之XGBModel:XGBModel之feature_importances_、plot_importance的简介、使用方法之详细攻略
sklearn:sklearn.feature_selection的SelectFromModel函数的简介、使用方法之详细攻略(二)
sklearn:sklearn.feature_selection的SelectFromModel函数的简介、使用方法之详细攻略
|
存储 Unix API
sklearn:sklearn.feature_selection的SelectFromModel函数的简介、使用方法之详细攻略(一)
sklearn:sklearn.feature_selection的SelectFromModel函数的简介、使用方法之详细攻略