4、部分相依图PDP
部分相依图(Partial Dependence Plot)是由Friedman(2001)所提出,其目的是用来理解在模型中某一特征与预测目标y平均的关系,并且假设每一个特征都是独立的,主要以视觉化的方式呈现。
4.1 计算原理
PDP 的计算原理主要是透过将训练集资料丢入模型后平均预测的结果(即蒙地卡罗方法)。部分依赖函数的公式如下:
。xs是我们想要画部分依赖图的特征集合;
。xᴄ则是剩下的其他特征;
。n为样本数
训练好一个模型f̂(假设x₁、x₂、x₃、x₄为特征、目标变数为y),假设我们想探讨的是x₁与y之间的关系,那么xs = { x₁ } 、xᴄ= { x₂, x₃ , x₄ }。
- xs , xᴄ ⁽ ⁱ ⁼ ¹ ⁾代入得到第一组结果
- 接着置换xᴄ ⁽ ⁱ ⁼ ² ⁾得到第二个结果。
- 以此类推至i=n,并将得到的结果取平均。
部分相依图可以让资料科学家了解各个特征是如何影响预测的!
4.2 结果解释
从这张图可以理解新生儿头围与新生儿体重有一定的正向关系存在,并且可以了解到新生儿头围是如何影响新生儿体重的预测。
.优缺点汇整 优点: 1.容易计算生成 2.直观好理解 3.容易解释 缺点: 1.最多只能同时呈现两个特征与y的关系,因为超过三维的图根据现在的技术无法呈现。 2.具有很强的特征独立性假设,若特征存在相关性,会导致PDP估计程产生偏差。 3. PDP呈现的是特征对于目标变数的平均变化量,容易忽略资料异质性(heterogeneous effects)对结果产生的影响。
4.3 Melbourne Housing Data的简单实例
数据来自kaggle: https://www.kaggle.com/gunjanpathak/melb-data
# import pandas as pd from pandas import read_csv, DataFrame # from sklearn.preprocessing import Imputer from sklearn.impute import SimpleImputer from sklearn.ensemble import GradientBoostingRegressor import numpy as np def get_some_data(): cols_to_use = ['Distance', 'Landsize', 'BuildingArea'] # https://www.kaggle.com/gunjanpathak/melb-data data = pd.read_csv('data/melb_data.csv') y = data.Price X = data[cols_to_use] my_imputer = SimpleImputer() imputed_X = my_imputer.fit_transform(X) return imputed_X, y from sklearn.inspection import partial_dependence, plot_partial_dependence # 构建数据 X, y = get_some_data() # scikit-learn originally implemented partial dependence plots only for Gradient Boosting models # 构建GradientBoostingRegressor模型实例 my_model = GradientBoostingRegressor() # 训练模型 my_model.fit(X, y) # 画图plot_partial_dependence my_plots = plot_partial_dependence(my_model, features=[0, 1, 2], # column numbers of plots we want to show X=X, # raw predictors data. feature_names=['Distance', 'Landsize', 'BuildingArea'], # labels on graphs grid_resolution=10) # number of values to plot on x axis
4.4 实例:用于识别糖尿病前危险因素的部分依赖图
from __future__ import print_function print(__doc__) from pandas import read_csv, DataFrame import numpy as np filename = "data/ln_skin_ln_insulin_imp_data.csv" names = ['preg', 'gluc', 'dbp', 'skin', 'insul', 'bmi', 'pedi', 'age', 'class'] dataset = read_csv(filename, names=names) # Compute ratio of insulin to glucose #dataset['ratio'] = dataset['insul']/dataset['gluc'] import numpy as np import matplotlib.pyplot as plt from time import time from mpl_toolkits.mplot3d import Axes3D from sklearn.model_selection import train_test_split from sklearn.ensemble import GradientBoostingClassifier from sklearn.inspection import partial_dependence, plot_partial_dependence from joblib import dump from joblib import load # split dataset into inputs and outputs print(dataset.head()) values = dataset.values X = values[:,0:8] print(X.shape) y = values[:,8] #print(y.shape) #def main(): # split 80/20 train-test X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.20,random_state=1) names = ['preg', 'gluc', 'dbp', 'skin', 'insul', 'bmi', 'pedi', 'age'] print("Training GBRT...") model = GradientBoostingClassifier(n_estimators=100, max_depth=4,learning_rate=0.1, loss='deviance',random_state=1) t0 = time() model.fit(X_train, y_train) print(" done.") print("done in %0.3fs" % (time() - t0)) importances = model.feature_importances_ print(importances) #print('Convenience plot with ``partial_dependence_plots``') features = [0, 1, 2, 3, 4, 5, 6, 7, (4,6)] display = plot_partial_dependence(model, X_train, features,feature_names=names,n_jobs=3, grid_resolution=50)
输出
Automatically created module for IPython interactive environment preg gluc dbp skin insul bmi pedi age class 0 6.0 148.0 72.0 35.000000 165.475260 33.6 0.627 50.0 1.0 1 1.0 85.0 66.0 29.000000 62.304286 26.6 0.351 31.0 0.0 2 8.0 183.0 64.0 20.078082 210.991380 23.3 0.672 32.0 1.0 3 1.0 89.0 66.0 23.000000 94.000000 28.1 0.167 21.0 0.0 4 0.0 137.0 40.0 35.000000 168.000000 43.1 2.288 33.0 1.0 (768, 8) Training GBRT... done. done in 0.128s [0.0544944 0.23726041 0.04045046 0.04838732 0.23154373 0.15424069 0.12426746 0.10935554]
3d画展示
#fig.suptitle('Partial dependence plots of pre diabetes on risk factors') plt.subplots_adjust(bottom=0.1, right=1.1, top=1.4) # tight_layout causes overlap with suptitle print('Custom 3d plot via ``partial_dependence``') fig = plt.figure() target_feature = (4, 6) pdp, axes = partial_dependence(model, features=target_feature,X=X_train, grid_resolution=50) XX, YY = np.meshgrid(axes[0], axes[1]) Z = pdp[0].reshape(list(map(np.size, axes))).T ax = Axes3D(fig) surf = ax.plot_surface(XX, YY, Z, rstride=1, cstride=1,cmap=plt.cm.BuPu, edgecolor='k') ax.set_xlabel(names[target_feature[0]]) ax.set_ylabel(names[target_feature[1]]) ax.set_zlabel('Partial dependence') # pretty init view ax.view_init(elev=22, azim=142) plt.colorbar(surf) plt.suptitle('Partial dependence of pre diabetes risk factors') plt.subplots_adjust(right=1,top=.9) plt.show() # Needed on Windows because plot_partial_dependence uses multiprocessing #if __name__ == '__main__': # main() # check model print(model)
Custom 3d plot via ``partial_dependence``