机器学习:丰田卡罗拉价格回归分析案例

简介: 机器学习:丰田卡罗拉价格回归分析案例

使用丰田卡罗拉数据集构建了4个回归模型。这些是线性回归、多项式回归、岭回归、套索回归,然后衡量并可视化模型的性能。借鉴黄海广老师的课件资料。


1. 概述


数据列:

Age: 车龄

KM: 累计里程

FuelType: 燃油类型 (Petrol, Diesel, CNG)

HP: 功率

MetColor: 是否金属漆 (Yes=1, No=0)

Automatic: 是否自动挡( (Yes=1, No=0)

CC: 排量

Doors: 车门数量

Weight: 整车重量

Price: 售价(欧元)


2、导入库并导入数据集


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
from collections import Counter
from IPython.core.display import display, HTML
sns.set_style('white')
import warnings
warnings.filterwarnings("ignore")#忽略警告
dataset = pd.read_csv('data/ToyotaCorolla.csv')
dataset.head()

b14383b8d4b14f87b13366e8a601d2cf.png


dataset.count()

cc8401a0ee08416089f18f8fd1896ed8.png

dataset.describe()


aa6d95181e0c4e4f874b433159dbe243.png

3、数据处理和可视化


dataset.isnull().sum()

0a615f3deb3a45e295eb4ceed6d7e2ba.png

corr = dataset.corr()
#Plot figsize
fig, ax = plt.subplots(figsize=(8, 8))
#Generate Heat Map, allow annotations and place floats in map
sns.heatmap(corr, cmap='magma', annot=True, fmt=".2f")
#Apply xticks
plt.xticks(range(len(corr.columns)), corr.columns);
#Apply yticks
plt.yticks(range(len(corr.columns)), corr.columns)
#show plot
plt.show()

574beb43f7194ad0b90adcf55d950cbd.png

f, axes = plt.subplots(2, 2, figsize=(12, 8))
sns.regplot(x='Price',
            y='Age',
            data=dataset,
            scatter_kws={'alpha': 0.6},
            ax=axes[0, 0])
axes[0, 0].set_xlabel('Price', fontsize=14)
axes[0, 0].set_ylabel('Age', fontsize=14)
axes[0, 0].yaxis.tick_left()
sns.regplot(x='Price',
            y='KM',
            data=dataset,
            scatter_kws={'alpha': 0.6},
            ax=axes[0, 1])
axes[0, 1].set_xlabel('Price', fontsize=14)
axes[0, 1].set_ylabel('KM', fontsize=14)
axes[0, 1].yaxis.set_label_position("right")
axes[0, 1].yaxis.tick_right()
sns.regplot(x='Price',
            y='Weight',
            data=dataset,
            scatter_kws={'alpha': 0.6},
            ax=axes[1, 0])
axes[1, 0].set_xlabel('Price', fontsize=14)
axes[1, 0].set_ylabel('Weight', fontsize=14)
sns.regplot(x='Price',
            y='HP',
            data=dataset,
            scatter_kws={'alpha': 0.6},
            ax=axes[1, 1])
axes[1, 1].set_xlabel('Price', fontsize=14)
axes[1, 1].set_ylabel('HP', fontsize=14)
axes[1, 1].yaxis.set_label_position("right")
axes[1, 1].yaxis.tick_right()
axes[1, 1].set(ylim=(40, 160))
plt.show()

e5c15efe18954691aed9f2abf9c18b7e.png

f, axes = plt.subplots(1,2,figsize=(14,4))
sns.distplot(dataset['KM'], ax = axes[0])
axes[0].set_xlabel('KM', fontsize=14)
axes[0].set_ylabel('Count', fontsize=14)
axes[0].yaxis.tick_left()
sns.scatterplot(x = 'Price', y = 'KM', data = dataset, ax = axes[1])
axes[1].set_xlabel('Price', fontsize=14)
axes[1].set_ylabel('KM', fontsize=14)
axes[1].yaxis.set_label_position("right")
axes[1].yaxis.tick_right()
plt.show()

78300a20f8a04f2a93a0ef71ec84c22c.png

fuel_list= Counter(dataset['FuelType'])
labels = fuel_list.keys()
sizes = fuel_list.values()
f, axes = plt.subplots(1,2,figsize=(14,4))
sns.countplot(dataset['FuelType'], ax = axes[0], palette="Set1")
axes[0].set_xlabel('Fuel Type', fontsize=14)
axes[0].set_ylabel('Count', fontsize=14)
axes[0].yaxis.tick_left()
sns.violinplot(x = 'FuelType', y = 'Price', data = dataset, ax = axes[1])
axes[1].set_xlabel('Fuel Type', fontsize=14)
axes[1].set_ylabel('Price', fontsize=14)
axes[1].yaxis.set_label_position("right")
axes[1].yaxis.tick_right()
plt.show()

e25546193775495198158a8299400dc7.png

f, axes = plt.subplots(1,2,figsize=(14,4))
sns.distplot(dataset['HP'], ax = axes[0])
axes[0].set_xlabel('HP', fontsize=14)
axes[0].set_ylabel('Count', fontsize=14)
axes[0].yaxis.tick_left()
sns.scatterplot(x = 'HP', y = 'Price', data = dataset, ax = axes[1])
axes[1].set_xlabel('HP', fontsize=14)
axes[1].set_ylabel('Price', fontsize=14)
axes[1].yaxis.set_label_position("right")
axes[1].yaxis.tick_right()
plt.show()

baf9dfefb6894f79bd20dd0a222d6244.png

f, axes = plt.subplots(1, 2, figsize=(14, 4))
sns.distplot(dataset['MetColor'], ax=axes[0])
axes[0].set_xlabel('MetColor', fontsize=14)
axes[0].set_ylabel('Count', fontsize=14)
axes[0].yaxis.tick_left()
sns.boxplot(x='MetColor', y='Price', data=dataset, ax=axes[1])
axes[1].set_xlabel('MetColor', fontsize=14)
axes[1].set_ylabel('Price', fontsize=14)
axes[1].yaxis.set_label_position("right")
axes[1].yaxis.tick_right()
plt.show()

844095498da04c07b4979151d6ec6cc1.png

f, axes = plt.subplots(1, 2, figsize=(14, 4))
sns.distplot(dataset['Automatic'], ax=axes[0])
axes[0].set_xlabel('Automatic', fontsize=14)
axes[0].set_ylabel('Count', fontsize=14)
axes[0].yaxis.tick_left()
sns.boxenplot(x='Automatic', y='Price', data=dataset, ax=axes[1])
axes[1].set_xlabel('Automatic', fontsize=14)
axes[1].set_ylabel('Price', fontsize=14)
axes[1].yaxis.set_label_position("right")
axes[1].yaxis.tick_right()
plt.show()

a656aae30f94424991a4ee4b57e346dd.png

f, axes = plt.subplots(1, 2, figsize=(14, 4))
sns.distplot(dataset['CC'], ax=axes[0])
axes[0].set_xlabel('CC', fontsize=14)
axes[0].set_ylabel('Count', fontsize=14)
axes[0].yaxis.tick_left()
sns.boxplot(x='CC', y='Price', data=dataset, ax=axes[1])
axes[1].set_xlabel('CC', fontsize=14)
axes[1].set_ylabel('Price', fontsize=14)
axes[1].yaxis.set_label_position("right")
axes[1].yaxis.tick_right()
plt.show()

74028ceaa5644848b51601c0603440ce.png

f, axes = plt.subplots(1, 2, figsize=(14, 4))
sns.distplot(dataset['Doors'], ax=axes[0])
axes[0].set_xlabel('Doors', fontsize=14)
axes[0].set_ylabel('Count', fontsize=14)
axes[0].yaxis.tick_left()
sns.boxenplot(x='Doors', y='Price', data=dataset, ax=axes[1])
axes[1].set_xlabel('Doors', fontsize=14)
axes[1].set_ylabel('Price', fontsize=14)
axes[1].yaxis.set_label_position("right")
axes[1].yaxis.tick_right()
plt.show()

78be30703ed54f4f91d48e5fe4cb4b77.png

dataset = pd.get_dummies(dataset)
dataset.head()

b6464664a0c84081bf4f33690ac9c3e8.png

X = dataset.drop('Price', axis = 1).values
y = dataset.iloc[:, 0].values.reshape(-1,1)
#Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.25,
                                                    random_state=42)
print("Shape of X_train: ",X_train.shape)
print("Shape of X_test: ", X_test.shape)
print("Shape of y_train: ",y_train.shape)
print("Shape of y_test",y_test.shape)


Shape of X_train: (1077, 11)

Shape of X_test: (359, 11)

Shape of y_train: (1077, 1)

Shape of y_test (359, 1)


4.回归模型


线性回归


from sklearn.linear_model import LinearRegression
regressor_linear = LinearRegression()
regressor_linear.fit(X_train, y_train)


LinearRegression()


from sklearn.metrics import r2_score
#Predicting Cross Validation Score the Test set results
cv_linear = cross_val_score(estimator=regressor_linear,
                            X=X_train,
                            y=y_train,
                            cv=10)
#Predicting R2 Score the Train set results
y_pred_linear_train = regressor_linear.predict(X_train)
r2_score_linear_train = r2_score(y_train, y_pred_linear_train)
#Predicting R2 Score the Test set results
y_pred_linear_test = regressor_linear.predict(X_test)
r2_score_linear_test = r2_score(y_test, y_pred_linear_test)
#Predicting RMSE the Test set results
rmse_linear = (np.sqrt(mean_squared_error(y_test, y_pred_linear_test)))
print("CV: ", cv_linear.mean())
print('R2_score (train): ', r2_score_linear_train)
print('R2_score (test): ', r2_score_linear_test)
print("RMSE: ", rmse_linear)


CV: 0.8480754345159047

R2_score (train): 0.8702260786694702

R2_score (test): 0.8621869690956068

RMSE: 1398.4596051422188


二阶多项式回归


from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 2)
X_poly = poly_reg.fit_transform(X_train)
poly_reg.fit(X_poly, y_train)
regressor_poly2 = LinearRegression()
regressor_poly2.fit(X_poly, y_train)
LinearRegression()
from sklearn.metrics import r2_score
#Predicting Cross Validation Score the Test set results
cv_poly2 = cross_val_score(estimator = regressor_poly2, X = X_train, y = y_train, cv = 10)
#Predicting R2 Score the Train set results
y_pred_poly2_train = regressor_poly2.predict(poly_reg.fit_transform(X_train))
r2_score_poly2_train = r2_score(y_train, y_pred_poly2_train)
#Predicting R2 Score the Test set results
y_pred_poly2_test = regressor_poly2.predict(poly_reg.fit_transform(X_test))
r2_score_poly2_test = r2_score(y_test, y_pred_poly2_test)
#Predicting RMSE the Test set results
rmse_poly2 = (np.sqrt(mean_squared_error(y_test, y_pred_poly2_test)))
print('CV: ', cv_poly2.mean())
print('R2_score (train): ', r2_score_poly2_train)
print('R2_score (test): ', r2_score_poly2_test)
print("RMSE: ", rmse_poly2)


CV: 0.8480754345159047

R2_score (train): 0.9157086185553889

R2_score (test): 0.7619825755103794

RMSE: 1837.8461795439769


岭回归


from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
steps = [
    ('scalar', StandardScaler()),
    ('poly', PolynomialFeatures(degree=3)),
    ('model', Ridge(alpha=1777, fit_intercept=True))
]
ridge_pipe = Pipeline(steps)
ridge_pipe.fit(X_train, y_train)
Pipeline(steps=[('scalar', StandardScaler()),
                ('poly', PolynomialFeatures(degree=3)),
                ('model', Ridge(alpha=1777))])
from sklearn.metrics import r2_score
#Predicting Cross Validation Score the Test set results
cv_ridge = cross_val_score(estimator = ridge_pipe, X = X_train, y = y_train.ravel(), cv = 10)
#Predicting R2 Score the Test set results
y_pred_ridge_train = ridge_pipe.predict(X_train)
r2_score_ridge_train = r2_score(y_train, y_pred_ridge_train)
#Predicting R2 Score the Test set results
y_pred_ridge_test = ridge_pipe.predict(X_test)
r2_score_ridge_test = r2_score(y_test, y_pred_ridge_test)
#Predicting RMSE the Test set results
rmse_ridge = (np.sqrt(mean_squared_error(y_test, y_pred_ridge_test)))
print('CV: ', cv_ridge.mean())
print('R2_score (train): ', r2_score_ridge_train)
print('R2_score (test): ', r2_score_ridge_test)
print("RMSE: ", rmse_ridge)


CV: 0.7785178588873436

R2_score (train): 0.87000985560043

R2_score (test): 0.8697806448706517

RMSE: 1359.3852529159908


套索回归


from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
steps = [('scalar', StandardScaler()), ('poly', PolynomialFeatures(degree=3)),
         ('model',
          Lasso(alpha=2.36, fit_intercept=True, tol=0.0199, max_iter=2000))]
lasso_pipe = Pipeline(steps)
lasso_pipe.fit(X_train, y_train)
Pipeline(steps=[('scalar', StandardScaler()),
                ('poly', PolynomialFeatures(degree=3)),
                ('model', Lasso(alpha=2.36, max_iter=2000, tol=0.0199))])
from sklearn.metrics import r2_score
# Predicting Cross Validation Score
cv_lasso = cross_val_score(estimator = lasso_pipe, X = X_train, y = y_train, cv = 10)
# Predicting R2 Score the Test set results
y_pred_lasso_train = lasso_pipe.predict(X_train)
r2_score_lasso_train = r2_score(y_train, y_pred_lasso_train)
# Predicting R2 Score the Test set results
y_pred_lasso_test = lasso_pipe.predict(X_test)
r2_score_lasso_test = r2_score(y_test, y_pred_lasso_test)
# Predicting RMSE the Test set results
rmse_lasso = (np.sqrt(mean_squared_error(y_test, y_pred_lasso_test)))
print('CV: ', cv_lasso.mean())
print('R2_score (train): ', r2_score_lasso_train)
print('R2_score (test): ', r2_score_lasso_test)
print("RMSE: ", rmse_lasso)


CV: 0.7427712620107894

R2_score (train): 0.9273633923675705

R2_score (test): 0.9022945020939632

RMSE: 1177.509135460343


5.衡量误差


models = [
    ('Linear Regression', rmse_linear, r2_score_linear_train,
     r2_score_linear_test, cv_linear.mean()),
    ('Polynomial Regression (2nd)', rmse_poly2, r2_score_poly2_train,
     r2_score_poly2_test, cv_poly2.mean()),
    ('Ridge Regression', rmse_ridge, r2_score_ridge_train, r2_score_ridge_test,
     cv_ridge.mean()),
    ('Lasso Regression', rmse_lasso, r2_score_lasso_train, r2_score_lasso_test,
     cv_lasso.mean()),
]
predict = pd.DataFrame(data = models, columns=['Model', 'RMSE', 'R2_Score(training)', 'R2_Score(test)', 'Cross-Validation'])
predict

d5051ca56e874a278232ba287a6a5b11.png

5、模型性能可视化


f, axe = plt.subplots(1,1, figsize=(18,6))
predict.sort_values(by=['Cross-Validation'], ascending=False, inplace=True)
sns.barplot(x='Cross-Validation', y='Model', data = predict, ax = axe)
#axes[0].set(xlabel='Region', ylabel='Charges')
axe.set_xlabel('Cross-Validaton Score', size=16)
axe.set_ylabel('Model',size=16)
axe.set_xlim(0,1.0)
axe.set_xticks(np.arange(0, 1.1, 0.1))
plt.show()


c9bc4d12106f4185a6eddf6d1bcff87c.png

f, axes = plt.subplots(2, 1, figsize=(14, 10))
predict.sort_values(by=['R2_Score(training)'], ascending=False, inplace=True)
sns.barplot(x='R2_Score(training)',
            y='Model',
            data=predict,
            palette='Blues_d',
            ax=axes[0])
#axes[0].set(xlabel='Region', ylabel='Charges')
axes[0].set_xlabel('R2 Score (Training)', size=16)
axes[0].set_ylabel('Model', size=16)
axes[0].set_xlim(0, 1.0)
axes[0].set_xticks(np.arange(0, 1.1, 0.1))
predict.sort_values(by=['R2_Score(test)'], ascending=False, inplace=True)
sns.barplot(x='R2_Score(test)',
            y='Model',
            data=predict,
            palette='Reds_d',
            ax=axes[1])
#axes[0].set(xlabel='Region', ylabel='Charges')
axes[1].set_xlabel('R2 Score (Test)', size=16)
axes[1].set_ylabel('Model', size=16)
axes[1].set_xlim(0, 1.0)
axes[1].set_xticks(np.arange(0, 1.1, 0.1))
plt.show()

450069c8b7a340dfa87389a70320f56a.png

predict.sort_values(by=['RMSE'], ascending=False, inplace=True)
f, axe = plt.subplots(1, 1, figsize=(10, 6))
sns.barplot(x='Model', y='RMSE', data=predict, ax=axe)
axe.set_xlabel('Model', size=16)
axe.set_ylabel('RMSE', size=16)
plt.show()

ae51431ac68841c3b0d0731b477c597a.png


6.结论


在这段代码中,我使用丰田卡罗拉数据集构建了4个回归模型。这些是线性回归、多项式回归、岭回归、套索回归,然后衡量并可视化模型的性能。

目录
相关文章
|
22天前
|
机器学习/深度学习 传感器 运维
使用机器学习技术进行时间序列缺失数据填充:基础方法与入门案例
本文探讨了时间序列分析中数据缺失的问题,并通过实际案例展示了如何利用机器学习技术进行缺失值补充。文章构建了一个模拟的能源生产数据集,采用线性回归和决策树回归两种方法进行缺失值补充,并从统计特征、自相关性、趋势和季节性等多个维度进行了详细评估。结果显示,决策树方法在处理复杂非线性模式和保持数据局部特征方面表现更佳,而线性回归方法则适用于简单的线性趋势数据。文章最后总结了两种方法的优劣,并给出了实际应用建议。
57 7
使用机器学习技术进行时间序列缺失数据填充:基础方法与入门案例
|
15天前
|
机器学习/深度学习 数据可视化 大数据
机器学习与大数据分析的结合:智能决策的新引擎
机器学习与大数据分析的结合:智能决策的新引擎
100 15
|
20天前
|
机器学习/深度学习 数据采集 运维
机器学习在运维中的实时分析应用:新时代的智能运维
机器学习在运维中的实时分析应用:新时代的智能运维
73 12
|
2月前
|
机器学习/深度学习 分布式计算 算法
【大数据分析&机器学习】分布式机器学习
本文主要介绍分布式机器学习基础知识,并介绍主流的分布式机器学习框架,结合实例介绍一些机器学习算法。
247 5
|
3月前
|
机器学习/深度学习 数据可视化 数据挖掘
机器学习中空间和时间自相关的分析:从理论基础到实践应用
空间和时间自相关是数据分析中的重要概念,揭示了现象在空间和时间维度上的相互依赖关系。本文探讨了这些概念的理论基础,并通过野火风险预测的实际案例,展示了如何利用随机森林模型捕捉时空依赖性,提高预测准确性。
121 0
机器学习中空间和时间自相关的分析:从理论基础到实践应用
|
3月前
|
数据采集 移动开发 数据可视化
模型预测笔记(一):数据清洗分析及可视化、模型搭建、模型训练和预测代码一体化和对应结果展示(可作为baseline)
这篇文章介绍了数据清洗、分析、可视化、模型搭建、训练和预测的全过程,包括缺失值处理、异常值处理、特征选择、数据归一化等关键步骤,并展示了模型融合技术。
196 1
模型预测笔记(一):数据清洗分析及可视化、模型搭建、模型训练和预测代码一体化和对应结果展示(可作为baseline)
|
3月前
|
机器学习/深度学习 数据可视化 算法
机器学习中的回归分析:理论与实践
机器学习中的回归分析:理论与实践
|
3月前
|
机器学习/深度学习 数据采集 算法
【Python篇】从零到精通:全面分析Scikit-Learn在机器学习中的绝妙应用
【Python篇】从零到精通:全面分析Scikit-Learn在机器学习中的绝妙应用
56 2
|
3月前
|
机器学习/深度学习 数据挖掘
二、机器学习之回归模型分析
二、机器学习之回归模型分析
238 0
|
4月前
|
机器学习/深度学习 存储 人工智能
文本情感识别分析系统Python+SVM分类算法+机器学习人工智能+计算机毕业设计
使用Python作为开发语言,基于文本数据集(一个积极的xls文本格式和一个消极的xls文本格式文件),使用Word2vec对文本进行处理。通过支持向量机SVM算法训练情绪分类模型。实现对文本消极情感和文本积极情感的识别。并基于Django框架开发网页平台实现对用户的可视化操作和数据存储。
62 0
文本情感识别分析系统Python+SVM分类算法+机器学习人工智能+计算机毕业设计