0.下载数据集
本文房产估值模型源数据为厦门市房价数据,文件下载链接: https://pan.baidu.com/s/1vOact6MsyZZlTSxjmMqTbw 密码: 8zg6
下载文件打开后如下图所示:
从上图中可以看出数据已经经过简单的处理,只需要再稍微调整就可以投入模型的训练中。
1.MLPR和GBR模型对比
df_y = df['unitPrice']
得到DataFrame的unitPrice字段数据;y = df_y.values
得到shape为(21935,),类型为numpy.ndarray的矩阵,即长度为21935的一维矩阵;df_x = df.drop(['unitPrice'],axis=1)
得到DataFrame的除了unitPrice字段的其他字段;x = df_x.values
得到shape为(21935,120),类型为numpy.ndarray的矩阵,即大小为21935*120的二维矩阵。
用sklearn中的预处理函数preprocessing.StandardScaler()
对数据标准化处理,处理过程是先用训练集fit,再把测试集也标准化处理。
调用MLPRegresso()获得多层感知器-回归模型,再用训练集进行训练,最后对测试集进行测试得分。
调用GradientBoostingRegressor()获得集成-回归模型,再用训练集进行训练,最后对测试集进行测试得分。
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import GradientBoostingRegressor
import pandas as pd
#boston = load_boston()
df = pd.read_excel("数据处理结果.xlsx")
df_y = df['unitPrice']
df_x = df.drop(['unitPrice'],axis=1)
x = df_x.values
y = df_y.values
train_x,test_x,train_y,test_y = train_test_split(x,y,train_size=0.8,\
random_state=33)
ss_x = preprocessing.StandardScaler()
train_x1 = ss_x.fit_transform(train_x)
test_x1 = ss_x.transform(test_x)
ss_y = preprocessing.StandardScaler()
train_y1 = ss_y.fit_transform(train_y.reshape(-1,1))
test_y1 = ss_y.transform(test_y.reshape(-1,1))
model_mlp = MLPRegressor(solver='lbfgs',hidden_layer_sizes=(20,20,20),random_state=1)
model_mlp.fit(train_x1,train_y1.ravel())
mlp_score = model_mlp.score(test_x1,test_y1.ravel())
print("sklearn多层感知器-回归模型得分",mlp_score)
model_gbr = GradientBoostingRegressor()
model_gbr.fit(train_x1,train_y1.ravel())
gbr_score = model_gbr.score(test_x1,test_y1.ravel())
print("sklearn集成-回归模型得分",gbr_score)
打印的结果是:
sklearn多层感知器-回归模型得分 0.683941816792
sklearn集成-回归模型得分 0.762351806857
对于第一次调整模型,这个结果还可以接受。
2.异常值处理
从图中我们可以看到有的房子单价达到几十上百万,这种异常值需要删除。
暂时没有发现可以直接调用处理异常值的函数,所以需要自己写。下面的代码中定义了一个cleanOutlier函数,函数的功能主要是删除异常值。首先得清楚下四分位数和上四分位数的概念:例如总共有100个数,中位数是从小到大排序第50个数的值,低位数是从小到大排序第25个数,高位数是从小到大排序第75个数。
四分位距是上四分位数减下四分位数所得值,例如:上四分位数为900,下四分位数为700,则四分位距为200
异常值指的是过大或者过小的值。在我们这个删除异常值的方法中,低于(下四分位数-3 四分位距)的值或者高于(上四分位数+3四分位距)的值会被判定为异常值并删除。例如,上四分位数为900,下四分位数为700,则低于100或者高于1500的数被删除。
将DataFrame转换为ndarray只需要用 df.values就可以获得,训练模型时数值类型一般为float,所以用 df.values.astype('float')来获得浮点类型数值的矩阵。
用 cleanOutlier函数删除异常值,然后把第0列负值给y变量,把1列到最后一列赋值给x变量
因为x大多是1-hot编码,所以 不需要再进行标准化。
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import GradientBoostingRegressor
import pandas as pd
def cleanOutlier(data,column,mul=3):
data = data[data[:,column].argsort()] #得到排序后的ndarray
l = len(data)
low = int(l/4)
high = int(l/4*3)
lowValue = data[low,column]
highValue = data[high,column]
print("下四分位数为{} 上四分位数{}".format(lowValue,highValue))
if lowValue - mul * (highValue - lowValue) < data[0,column] :
delLowValue = data[0,column]
else:
delLowValue = lowValue - mul * (highValue - lowValue)
if highValue + mul * (highValue - lowValue) > data[-1,column]:
delHighValue = data[-1,column]
else:
delHighValue = highValue + mul * (highValue - lowValue)
print("删除第{}列中数值小于{}或者大于{}的部分".format(column,\
delLowValue,delHighValue))
for i in range(low):
if data[i,column] >= delLowValue:
recordLow = i
break
for i in range(len(data)-1,high,-1):
if data[i,column] <= delHighValue:
recordHigh = i
break
#打印处理异常值的相关信息
print("原矩阵共有{}行".format(len(data)),end=',')
print("保留{}到{}行".format(recordLow,recordHigh),end=',')
data = data[recordLow:recordHigh+1]
print("删除第{}列中的异常值后剩余{}行".format(column,\
recordHigh+1-recordLow))
return data
df = pd.read_excel("数据处理结果.xlsx")
data = df.values.astype('float')
data = cleanOutlier(data,0)
x = data[:,1:]
y = data[:,0]
train_x,test_x,train_y,test_y = train_test_split(x,y,train_size=0.8,\
random_state=33)
ss_y = preprocessing.StandardScaler()
train_y = ss_y.fit_transform(train_y.reshape(-1,1))
test_y = ss_y.transform(test_y.reshape(-1,1))
model_mlp = MLPRegressor(solver='lbfgs',hidden_layer_sizes=(20,20,20),random_state=1)
model_mlp.fit(train_x,train_y.ravel())
mlp_score = model_mlp.score(test_x,test_y.ravel())
print("sklearn多层感知器-回归模型得分",mlp_score)
model_gbr = GradientBoostingRegressor(learning_rate=0.1)
model_gbr.fit(train_x,train_y.ravel())
ss_y = preprocessing.StandardScaler()
train_y = ss_y.fit_transform(train_y.reshape(-1,1))
test_y = ss_y.transform(test_y.reshape(-1,1))
gbr_score = model_gbr.score(test_x,test_y.ravel())
print("sklearn集成-回归模型得分",gbr_score)
打印的结果是:
sklearn多层感知器-回归模型得分 0.795028773029
sklearn集成-回归模型得分 0.767157061712
对于第二次调整模型,我们可以看到sklearn多层感知器-回归模型得分明显提高,而对于sklearn集成-回归模型则没有太大提高。总之,这次异常值处理是成功的。
3.正态化
正态化就是将y的值以e为底取对数,得到新的一列赋值给y。
正态化用一个循环完成:for i in range(len(y)): y[i] = math.log(y[i])
正态化之后按照原理是不用再标准化了,但是经过实验,对x,y标准化都可以提高得分。
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import GradientBoostingRegressor
import pandas as pd
import math
def cleanOutlier(data,column,mul=3):
data = data[data[:,column].argsort()] #得到排序后的ndarray
l = len(data)
low = int(l/4)
high = int(l/4*3)
lowValue = data[low,column]
highValue = data[high,column]
print("下四分位数为{} 上四分位数{}".format(lowValue,highValue))
if lowValue - mul * (highValue - lowValue) < data[0,column] :
delLowValue = data[0,column]
else:
delLowValue = lowValue - mul * (highValue - lowValue)
if highValue + mul * (highValue - lowValue) > data[-1,column]:
delHighValue = data[-1,column]
else:
delHighValue = highValue + mul * (highValue - lowValue)
print("删除第{}列中数值小于{}或者大于{}的部分".format(column,\
delLowValue,delHighValue))
for i in range(low):
if data[i,column] >= delLowValue:
recordLow = i
break
for i in range(len(data)-1,high,-1):
if data[i,column] <= delHighValue:
recordHigh = i
break
#打印处理异常值的相关信息
print("原矩阵共有{}行".format(len(data)),end=',')
print("保留{}到{}行".format(recordLow,recordHigh),end=',')
data = data[recordLow:recordHigh+1]
print("删除第{}列中的异常值后剩余{}行".format(column,\
recordHigh+1-recordLow))
return data
df = pd.read_excel("数据处理结果.xlsx")
data = df.values.astype('float')
data = cleanOutlier(data,0)
x = data[:,1:]
y = data[:,0]
for i in range(len(y)):
y[i] = math.log(y[i])
train_x,test_x,train_y,test_y = train_test_split(x,y,train_size=0.8,\
random_state=33)
ss_x = preprocessing.StandardScaler()
train_x = ss_x.fit_transform(train_x)
test_x = ss_x.transform(test_x)
ss_y = preprocessing.StandardScaler()
train_y = ss_y.fit_transform(train_y.reshape(-1,1))
test_y = ss_y.transform(test_y.reshape(-1,1))
model_mlp = MLPRegressor(solver='lbfgs',hidden_layer_sizes=(20,20,20),random_state=1)
model_mlp.fit(train_x,train_y.ravel())
mlp_score = model_mlp.score(test_x,test_y.ravel())
print("sklearn多层感知器-回归模型得分",mlp_score)
model_gbr = GradientBoostingRegressor(learning_rate=0.1)
model_gbr.fit(train_x,train_y.ravel())
gbr_score = model_gbr.score(test_x,test_y.ravel())
print("sklearn集成-回归模型得分",gbr_score)
打印的结果是:
sklearn多层感知器-回归模型得分 0.831448099649
sklearn集成-回归模型得分 0.780133207248
相比较于前一次,分数又得到了提高,是一次成功的调整。
4.交叉验证
主要使用的是sklearn.model_selection中的KFold方法选择训练集和测试集
kf = KFold(n_splits=5,shuffle=True)这一行代码初始化KFold对象
for train_index,test_index in kf.split(x):这一行代码可以看出kf.split(x)得到的是一个长度为n_splits的列表,即长度为5的列表,列表中元素是元组,元组中的元素是训练集和测试集的索引。
from sklearn import preprocessing
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import GradientBoostingRegressor
import pandas as pd
import math
from sklearn.model_selection import KFold
def cleanOutlier(data,column,mul=3):
data = data[data[:,column].argsort()] #得到排序后的ndarray
l = len(data)
low = int(l/4)
high = int(l/4*3)
lowValue = data[low,column]
highValue = data[high,column]
print("下四分位数为{} 上四分位数{}".format(lowValue,highValue))
if lowValue - mul * (highValue - lowValue) < data[0,column] :
delLowValue = data[0,column]
else:
delLowValue = lowValue - mul * (highValue - lowValue)
if highValue + mul * (highValue - lowValue) > data[-1,column]:
delHighValue = data[-1,column]
else:
delHighValue = highValue + mul * (highValue - lowValue)
print("删除第{}列中数值小于{}或者大于{}的部分".format(column,\
delLowValue,delHighValue))
for i in range(low):
if data[i,column] >= delLowValue:
recordLow = i
break
for i in range(len(data)-1,high,-1):
if data[i,column] <= delHighValue:
recordHigh = i
break
#打印处理异常值的相关信息
print("原矩阵共有{}行".format(len(data)),end=',')
print("保留{}到{}行".format(recordLow,recordHigh),end=',')
data = data[recordLow:recordHigh+1]
print("删除第{}列中的异常值后剩余{}行".format(column,\
recordHigh+1-recordLow))
return data
df = pd.read_excel("数据处理结果.xlsx")
data = df.values.astype('float')
data = cleanOutlier(data,0)
x = data[:,1:]
y = data[:,0]
for i in range(len(y)):
y[i] = math.log(y[i])
kf = KFold(n_splits=5,shuffle=True)
for train_index,test_index in kf.split(x):
train_x = x[train_index]
test_x = x[test_index]
train_y = y[train_index]
test_y = y[test_index]
ss_x = preprocessing.StandardScaler()
train_x = ss_x.fit_transform(train_x)
test_x = ss_x.transform(test_x)
ss_y = preprocessing.StandardScaler()
train_y = ss_y.fit_transform(train_y.reshape(-1,1))
test_y = ss_y.transform(test_y.reshape(-1,1))
model_mlp = MLPRegressor(solver='lbfgs',hidden_layer_sizes=(20,20,20),random_state=1)
model_mlp.fit(train_x,train_y.ravel())
mlp_score = model_mlp.score(test_x,test_y.ravel())
print("sklearn多层感知器-回归模型得分",mlp_score)
model_gbr = GradientBoostingRegressor(learning_rate=0.1)
model_gbr.fit(train_x,train_y.ravel())
gbr_score = model_gbr.score(test_x,test_y.ravel())
print("sklearn集成-回归模型得分",gbr_score)
打印结果是:
sklearn多层感知器-回归模型得分 0.8427725943791746
sklearn集成-回归模型得分 0.7915684454283963
sklearn多层感知器-回归模型得分 0.8317854959807023
sklearn集成-回归模型得分 0.7705608099963528
sklearn多层感知器-回归模型得分 0.8369280445356948
sklearn集成-回归模型得分 0.7851823734454625
sklearn多层感知器-回归模型得分 0.8364897250676866
sklearn集成-回归模型得分 0.7833199279062474
sklearn多层感知器-回归模型得分 0.8335782493590231
sklearn集成-回归模型得分 0.7722233325504181