2012每日单车共享数量预估
1、 任务描述
请在Capital Bikeshare (美国Washington, D.C.的一个共享单车公司)提供的自行车数据上进行回归分析。训练数据为2011年的数据,要求预测2012年每天的单车共享数量。
原始数据集地址:http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset
- 文件说明
day.csv: 按天计的单车共享次数(只需使用该文件)
hour.csv: 按小时计的单车共享次数(无需理会)
readme:数据说明文件 - 字段说明
Instant记录号
Dteday:日期
Season:季节(1=春天、2=夏天、3=秋天、4=冬天)
yr:年份,(0: 2011, 1:2012)
mnth:月份( 1 to 12)
hr:小时 (0 to 23) (只在hour.csv有,忽略此字段)
holiday:是否是节假日
weekday:星期中的哪天,取值为0~6
workingday:是否工作日
1=工作日 (是否为工作日,1为工作日,0为非周末或节假日
weathersit:天气(1:晴天,多云2:雾天,阴天3:小雪,小雨4:大雨,大雪,大雾)
temp:气温摄氏度
atemp:体感温度
hum:湿度
windspeed:风速
casual:非注册用户个数
registered:注册用户个数
cnt:给定日期(天)时间(每小时)总租车人数,响应变量y
cnt特征为要预测的y,对cnt进行预测
导入必要的工具包
# 导入必要的工具包 # 数据读取及基本处理 import numpy as np import pandas as pd #可视化 import matplotlib.pyplot as plt import seaborn as sns #时间特征处理 import time import datetime #模型 from sklearn.linear_model import LinearRegression #模型评估 from sklearn.metrics import mean_squared_error from sklearn.metrics import r2_score #评价回归预测模型的性能 %matplotlib inline
读取数据
data = pd.read_csv("day.csv")
训练数据和测试数据分割
根据yr字段是否为1,分拆出2011年数据作为训练集trainData,21012年数据作为测试集testData
trainData = data[data.loc[:,'yr'] == 0].copy() testData = data[data.loc[:,'yr']==1].copy() trainData.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 365 entries, 0 to 364 Data columns (total 16 columns): instant 365 non-null int64 dteday 365 non-null object season 365 non-null int64 yr 365 non-null int64 mnth 365 non-null int64 holiday 365 non-null int64 weekday 365 non-null int64 workingday 365 non-null int64 weathersit 365 non-null int64 temp 365 non-null float64 atemp 365 non-null float64 hum 365 non-null float64 windspeed 365 non-null float64 casual 365 non-null int64 registered 365 non-null int64 cnt 365 non-null int64 dtypes: float64(4), int64(11), object(1) memory usage: 48.5+ KB
适当的特征工程(及数据探索)
将dteday转换为一年中的第几日特征dayCount,使得string转化为int类型,有效利用数据,然后将dteday删除
def getInterval_train(df): date=df['dteday'] date=time.strptime(date,"%Y-%m-%d") date1=datetime.datetime(2011,1,1) date=datetime.datetime(date[0],date[1],date[2]) return date-date1 Interval=trainData.apply(lambda r : getInterval_train(r),axis=1).dt.days.copy() trainData['dayCount']=Interval trainData = trainData.drop(['dteday','instant','yr'],axis=1) def getInterval_test(df): date=df['dteday'] date=time.strptime(date,"%Y-%m-%d") date1=datetime.datetime(2012,1,1) date=datetime.datetime(date[0],date[1],date[2]) return date-date1 Interval=testData.apply(lambda r : getInterval_test(r),axis=1).dt.days.copy() testData['dayCount']=Interval testData=testData.drop(['dteday','instant','yr'],axis=1) # mydate = pd.to_numeric(data["dteday"].str.replace('-','')) # year = mydate//10000 # month = (mydate-year*10000)//100 # date = mydate%100
各属性的统计特性
trainData.describe()
season | mnth | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt | dayCount | |
count | 365.000000 | 365.000000 | 365.000000 | 365.000000 | 365.000000 | 365.000000 | 365.000000 | 365.000000 | 365.000000 | 365.000000 | 365.000000 | 365.000000 | 365.000000 | 365.000000 |
mean | 2.498630 | 6.526027 | 0.027397 | 3.008219 | 0.684932 | 1.421918 | 0.486665 | 0.466835 | 0.643665 | 0.191403 | 677.402740 | 2728.358904 | 3405.761644 | 182.000000 |
std | 1.110946 | 3.452584 | 0.163462 | 2.006155 | 0.465181 | 0.571831 | 0.189596 | 0.168836 | 0.148744 | 0.076890 | 556.269121 | 1060.110413 | 1378.753666 | 105.510663 |
min | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.059130 | 0.079070 | 0.000000 | 0.022392 | 9.000000 | 416.000000 | 431.000000 | 0.000000 |
25% | 2.000000 | 4.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 0.325000 | 0.321954 | 0.538333 | 0.135583 | 222.000000 | 1730.000000 | 2132.000000 | 91.000000 |
50% | 3.000000 | 7.000000 | 0.000000 | 3.000000 | 1.000000 | 1.000000 | 0.479167 | 0.472846 | 0.647500 | 0.186900 | 614.000000 | 2915.000000 | 3740.000000 | 182.000000 |
75% | 3.000000 | 10.000000 | 0.000000 | 5.000000 | 1.000000 | 2.000000 | 0.656667 | 0.612379 | 0.742083 | 0.235075 | 871.000000 | 3632.000000 | 4586.000000 | 273.000000 |
max | 4.000000 | 12.000000 | 1.000000 | 6.000000 | 1.000000 | 3.000000 | 0.849167 | 0.840896 | 0.972500 | 0.507463 | 3065.000000 | 4614.000000 | 6043.000000 | 364.000000 |
归一化各属性
numerical_features = ['temp','atemp','hum','windspeed'] numerical_features_nor = ['temp_nor','atemp_nor','hum_nor','windspeed_nor'] for col in numerical_features: temp = trainData[col].copy() temp = (temp-temp.min())/(temp.max()-temp.min()) trainData[col+'_nor'] = temp for col in numerical_features: temp = testData[col].copy() temp = (temp-temp.min())/(temp.max()-temp.min()) testData[col+'_nor'] = temp trainData.head()
season | mnth | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt | dayCount | temp_nor | atemp_nor | hum_nor | windspeed_nor | |
0 | 1 | 1 | 0 | 6 | 0 | 2 | 0.344167 | 0.363625 | 0.805833 | 0.160446 | 331 | 654 | 985 | 0 | 0.360789 | 0.373517 | 0.828620 | 0.284606 |
1 | 1 | 1 | 0 | 0 | 0 | 2 | 0.363478 | 0.353739 | 0.696087 | 0.248539 | 131 | 670 | 801 | 1 | 0.385232 | 0.360541 | 0.715771 | 0.466215 |
2 | 1 | 1 | 0 | 1 | 1 | 1 | 0.196364 | 0.189405 | 0.437273 | 0.248309 | 120 | 1229 | 1349 | 2 | 0.173705 | 0.144830 | 0.449638 | 0.465740 |
3 | 1 | 1 | 0 | 2 | 1 | 1 | 0.200000 | 0.212122 | 0.590435 | 0.160296 | 108 | 1454 | 1562 | 3 | 0.178308 | 0.174649 | 0.607131 | 0.284297 |
4 | 1 | 1 | 0 | 3 | 1 | 1 | 0.226957 | 0.229270 | 0.436957 | 0.186900 | 82 | 1518 | 1600 | 4 | 0.212429 | 0.197158 | 0.449313 | 0.339143 |
分布以及散点可视化检验归一化效果
myShow = numerical_features for col in myShow: plt.figure(figsize=(12,12)) plt.subplot(2,2,1) sns.distplot(trainData[col], bins=30, kde=False) plt.title("Distributing of %s"%col) plt.subplot(2,2,2) plt.scatter(range(trainData.shape[0]), trainData[col].values,color='purple') plt.title("scatter of %s"%col)
myShow = numerical_features_nor for col in myShow: plt.figure(figsize=(12,12)) plt.subplot(2,2,1) sns.distplot(trainData[col], bins=30, kde=False) plt.title("Distributing of %s"%col) plt.subplot(2,2,2) plt.scatter(range(trainData.shape[0]), trainData[col].values,color='purple') plt.title("scatter of %s"%col)
删除非归一化特征
trainData = trainData.drop(numerical_features,axis=1) testData = testData.drop(numerical_features,axis=1) trainData.head()
season | mnth | holiday | weekday | workingday | weathersit | casual | registered | cnt | dayCount | temp_nor | atemp_nor | hum_nor | windspeed_nor | |
0 | 1 | 1 | 0 | 6 | 0 | 2 | 331 | 654 | 985 | 0 | 0.360789 | 0.373517 | 0.828620 | 0.284606 |
1 | 1 | 1 | 0 | 0 | 0 | 2 | 131 | 670 | 801 | 1 | 0.385232 | 0.360541 | 0.715771 | 0.466215 |
2 | 1 | 1 | 0 | 1 | 1 | 1 | 120 | 1229 | 1349 | 2 | 0.173705 | 0.144830 | 0.449638 | 0.465740 |
3 | 1 | 1 | 0 | 2 | 1 | 1 | 108 | 1454 | 1562 | 3 | 0.178308 | 0.174649 | 0.607131 | 0.284297 |
4 | 1 | 1 | 0 | 3 | 1 | 1 | 82 | 1518 | 1600 | 4 | 0.212429 | 0.197158 | 0.449313 | 0.339143 |
热图可视化各特征相关性
data_corr = trainData.corr().abs() plt.figure(figsize=(13, 9)) sns.heatmap(data_corr,annot=True) plt.show()
处理类别型特征
categorical_features = ['season','mnth','weathersit','weekday'] for col in categorical_features: print ("\n%s属性的不同取值和出现的次数'%col") print (trainData[col].value_counts()) trainData[col] = trainData[col].astype('object') testData[col] = testData[col].astype('object')
%s属性的不同取值和出现的次数'%col 3 94 2 92 1 90 4 89 Name: season, dtype: int64 %s属性的不同取值和出现的次数'%col 12 31 10 31 8 31 7 31 5 31 3 31 1 31 11 30 9 30 6 30 4 30 2 28 Name: mnth, dtype: int64 %s属性的不同取值和出现的次数'%col 1 226 2 124 3 15 Name: weathersit, dtype: int64 %s属性的不同取值和出现的次数'%col 6 53 5 52 4 52 3 52 2 52 1 52 0 52 Name: weekday, dtype: int64
categorical_features = ['season','mnth','weathersit','weekday'] x_train_cat = trainData[categorical_features] x_train_cat = pd.get_dummies(x_train_cat) x_train_rest = trainData.drop(categorical_features,axis=1) trainData = pd.concat([x_train_cat, x_train_rest], axis = 1, ignore_index=False) trainData.columns
Index(['season_1', 'season_2', 'season_3', 'season_4', 'mnth_1', 'mnth_2', 'mnth_3', 'mnth_4', 'mnth_5', 'mnth_6', 'mnth_7', 'mnth_8', 'mnth_9', 'mnth_10', 'mnth_11', 'mnth_12', 'weathersit_1', 'weathersit_2', 'weathersit_3', 'weekday_0', 'weekday_1', 'weekday_2', 'weekday_3', 'weekday_4', 'weekday_5', 'weekday_6', 'holiday', 'workingday', 'casual', 'registered', 'cnt', 'dayCount', 'temp_nor', 'atemp_nor', 'hum_nor', 'windspeed_nor'], dtype='object')
x_test_cat = testData[categorical_features] x_test_cat = pd.get_dummies(x_test_cat) x_test_rest = testData.drop(categorical_features,axis=1) testData = pd.concat([x_test_cat, x_test_rest], axis = 1, ignore_index=False) testData.columns
Index(['season_1', 'season_2', 'season_3', 'season_4', 'mnth_1', 'mnth_2', 'mnth_3', 'mnth_4', 'mnth_5', 'mnth_6', 'mnth_7', 'mnth_8', 'mnth_9', 'mnth_10', 'mnth_11', 'mnth_12', 'weathersit_1', 'weathersit_2', 'weathersit_3', 'weekday_0', 'weekday_1', 'weekday_2', 'weekday_3', 'weekday_4', 'weekday_5', 'weekday_6', 'holiday', 'workingday', 'casual', 'registered', 'cnt', 'dayCount', 'temp_nor', 'atemp_nor', 'hum_nor', 'windspeed_nor'], dtype='object')
回归
准备训练数据,分离特征与目标
y_train = trainData["cnt"] X_train = trainData.drop(['cnt','casual','registered','dayCount'], axis = 1) y_test = testData["cnt"] X_test = testData.drop(['cnt','casual','registered','dayCount'], axis = 1) y_test.head()
365 2294 366 1951 367 2236 368 2368 369 3272 Name: cnt, dtype: int64
# 数据标准化 # 初始化对目标值的标准化器 # 对y标准化不是必须,但对其进行标准化可以使得不同问题w的取值范围相对相同 #自己实现试试...,这些参数需要保留,对测试集预测完后还需要对其进行反变换 mean_y = y_train.mean() std_y = y_train.std() y_train = (y_train - mean_y)/std_y y_test = (y_test - mean_y)/std_y
最小二乘线性回归
# Linear Regression # 1\. 生成学习器实例 lr = LinearRegression() #2\. 在训练集上训练学习器 lr.fit(X_train, y_train) #3.训练上测试,得到训练误差,实际任务中这一步不需要 # Look at predictions on training and validation set y_train_pred = lr.predict(X_train) y_test_pred = lr.predict(X_test)
生成提交测试结果
y_test_pred = lr.predict(X_test) y_test_pred = y_test_pred * std_y + mean_y #生成提交测试结果 df = pd.DataFrame({'instant':np.arange(len(y_test_pred))+1,'cnt':y_test_pred}) #df.reindex(columns=['instant']) #y = pd.Series(data = y_test_pred, name = 'cnt') #df = pd.concat([testID, y], axis = 1, ignore_index=True) df.to_csv('submission.csv')
#查看预测结果 df.drop(['instant'],axis=1,inplace=True) df.head(20)