ML之LiR&LassoR:利用boston房价数据集(PCA处理)采用线性回归和Lasso套索回归算法实现房价预测模型评估

简介: ML之LiR&LassoR:利用boston房价数据集(PCA处理)采用线性回归和Lasso套索回归算法实现房价预测模型评估


目录

利用boston房价数据集(PCA处理)采用线性回归和Lasso套索回归算法实现房价预测模型评估

设计思路

输出结果

核心代码


 

 

 

 

 

利用boston房价数据集(PCA处理)采用线性回归和Lasso套索回归算法实现房价预测模型评估

设计思路

更新……

 

 

 

输出结果

1.    Id  MSSubClass MSZoning  ...  SaleType  SaleCondition SalePrice
2. 0   1          60       RL  ...        WD         Normal    208500
3. 1   2          20       RL  ...        WD         Normal    181500
4. 2   3          60       RL  ...        WD         Normal    223500
5. 3   4          70       RL  ...        WD        Abnorml    140000
6. 4   5          60       RL  ...        WD         Normal    250000
7. 
8. [5 rows x 81 columns]
9. numeric_columns 36 ['LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold', 'SalePrice']
10. (1460, 36)
11.    LotFrontage  LotArea  OverallQual  ...  MoSold  YrSold  SalePrice
12. 0         65.0     8450            7  ...       2    2008     208500
13. 1         80.0     9600            6  ...       5    2007     181500
14. 2         68.0    11250            7  ...       9    2008     223500
15. 3         60.0     9550            7  ...       2    2006     140000
16. 4         84.0    14260            8  ...      12    2008     250000
17. 
18. 
19. 依次统计每列缺失值元素个数: 
20. 36 [259, 0, 0, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 81, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
21. Missing_data_Per_dict_0: (33, 0.9167, {'LotArea': 0.0, 'OverallQual': 0.0, 'OverallCond': 0.0, 'YearBuilt': 0.0, 'YearRemodAdd': 0.0, 'BsmtFinSF1': 0.0, 'BsmtFinSF2': 0.0, 'BsmtUnfSF': 0.0, 'TotalBsmtSF': 0.0, '1stFlrSF': 0.0, '2ndFlrSF': 0.0, 'LowQualFinSF': 0.0, 'GrLivArea': 0.0, 'BsmtFullBath': 0.0, 'BsmtHalfBath': 0.0, 'FullBath': 0.0, 'HalfBath': 0.0, 'BedroomAbvGr': 0.0, 'KitchenAbvGr': 0.0, 'TotRmsAbvGrd': 0.0, 'Fireplaces': 0.0, 'GarageCars': 0.0, 'GarageArea': 0.0, 'WoodDeckSF': 0.0, 'OpenPorchSF': 0.0, 'EnclosedPorch': 0.0, '3SsnPorch': 0.0, 'ScreenPorch': 0.0, 'PoolArea': 0.0, 'MiscVal': 0.0, 'MoSold': 0.0, 'YrSold': 0.0, 'SalePrice': 0.0})
22. Missing_data_Per_dict_Not0: (3, 0.0833, {'LotFrontage': 0.177397, 'MasVnrArea': 0.005479, 'GarageYrBlt': 0.055479})
23. Missing_data_Per_dict_under01: (2, 0.0556, {'MasVnrArea': 0.005479, 'GarageYrBlt': 0.055479})
24. 依次计算每列缺失值元素占比: {'LotFrontage': 0.177397, 'MasVnrArea': 0.005479, 'GarageYrBlt': 0.055479}
25. data_Missing_dict {'LotFrontage': 0.1773972602739726, 'LotArea': 0.0, 'OverallQual': 0.0, 'OverallCond': 0.0, 'YearBuilt': 0.0, 'YearRemodAdd': 0.0, 'MasVnrArea': 0.005479452054794521, 'BsmtFinSF1': 0.0, 'BsmtFinSF2': 0.0, 'BsmtUnfSF': 0.0, 'TotalBsmtSF': 0.0, '1stFlrSF': 0.0, '2ndFlrSF': 0.0, 'LowQualFinSF': 0.0, 'GrLivArea': 0.0, 'BsmtFullBath': 0.0, 'BsmtHalfBath': 0.0, 'FullBath': 0.0, 'HalfBath': 0.0, 'BedroomAbvGr': 0.0, 'KitchenAbvGr': 0.0, 'TotRmsAbvGrd': 0.0, 'Fireplaces': 0.0, 'GarageYrBlt': 0.05547945205479452, 'GarageCars': 0.0, 'GarageArea': 0.0, 'WoodDeckSF': 0.0, 'OpenPorchSF': 0.0, 'EnclosedPorch': 0.0, '3SsnPorch': 0.0, 'ScreenPorch': 0.0, 'PoolArea': 0.0, 'MiscVal': 0.0, 'MoSold': 0.0, 'YrSold': 0.0, 'SalePrice': 0.0}
26. after dropna (1121, 36)
27. <class 'numpy.ndarray'>
28.       LotFrontage   LotArea  OverallQual  ...    MiscVal    MoSold    YrSold
29. 0       -0.233570 -0.205885     0.570704  ...  -0.141407 -1.615345  0.153084
30. 1        0.384834 -0.064358    -0.153825  ...  -0.141407 -0.498715 -0.596291
31. 2       -0.109889  0.138702     0.570704  ...  -0.141407  0.990125  0.153084
32. 3       -0.439705 -0.070512     0.570704  ...  -0.141407 -1.615345 -1.345665
33. 4        0.549742  0.509132     1.295234  ...  -0.141407  2.106755  0.153084
34. ...           ...       ...          ...  ...        ...       ...       ...
35. 1116    -0.357251 -0.271480    -0.153825  ...  -0.141407  0.617915 -0.596291
36. 1117     0.590968  0.375605    -0.153825  ...  -0.141407 -1.615345  1.651832
37. 1118    -0.192343 -0.133030     0.570704  ...  14.947388 -0.498715  1.651832
38. 1119    -0.109889 -0.049960    -0.878355  ...  -0.141407 -0.870925  1.651832
39. 1120     0.178699 -0.022885    -0.878355  ...  -0.141407 -0.126505  0.153084
40. 
41. [1121 rows x 35 columns]
42. 前10个主成分解释了数据中63.80%的变化
43. 经过PCA后,进行第一层主成分分析-------------------------------------
44. [(0.16970682313415306, 'LotFrontage'), (0.1211669980146095, 'LotArea'), (0.3008665261375608, 'OverallQual'), (-0.1017783758120348, 'OverallCond'), (0.23754113423286216, 'YearBuilt'), (0.21067267847804322, 'YearRemodAdd'), (0.19125461510335365, 'MasVnrArea'), (0.14136511574315347, 'BsmtFinSF1'), (-0.013552848692716916, 'BsmtFinSF2'), (0.11439764110410199, 'BsmtUnfSF'), (0.259354275741638, 'TotalBsmtSF'), (0.2591780447881022, '1stFlrSF'), (0.11504305093601253, '2ndFlrSF'), (0.004231304806602964, 'LowQualFinSF'), (0.2877802164879641, 'GrLivArea'), (0.08317879411803167, 'BsmtFullBath'), (-0.02114280846249704, 'BsmtHalfBath'), (0.25499633884283257, 'FullBath'), (0.11080279874459822, 'HalfBath'), (0.1017767099777179, 'BedroomAbvGr'), (-0.01012145139988125, 'KitchenAbvGr'), (0.23572236584667458, 'TotRmsAbvGrd'), (0.17611466785004926, 'Fireplaces'), (0.23726651555979883, 'GarageYrBlt'), (0.2831568046802727, 'GarageCars'), (0.279827792756442, 'GarageArea'), (0.13036585867815073, 'WoodDeckSF'), (0.16664693092097654, 'OpenPorchSF'), (-0.08602539908222213, 'EnclosedPorch'), (0.010532579475601184, '3SsnPorch'), (0.02556170369869493, 'ScreenPorch'), (0.06246570190310543, 'PoolArea'), (-0.015493399959318557, 'MiscVal'), (0.028399126033275164, 'MoSold'), (-0.011129722622237775, 'YrSold')]
45. [(0.3008665261375608, 'OverallQual'), (0.2877802164879641, 'GrLivArea'), (0.2831568046802727, 'GarageCars'), (0.279827792756442, 'GarageArea'), (0.259354275741638, 'TotalBsmtSF'), (0.2591780447881022, '1stFlrSF'), (0.25499633884283257, 'FullBath'), (0.23754113423286216, 'YearBuilt'), (0.23726651555979883, 'GarageYrBlt'), (0.23572236584667458, 'TotRmsAbvGrd'), (0.21067267847804322, 'YearRemodAdd'), (0.19125461510335365, 'MasVnrArea'), (0.17611466785004926, 'Fireplaces'), (0.16970682313415306, 'LotFrontage'), (0.16664693092097654, 'OpenPorchSF'), (0.14136511574315347, 'BsmtFinSF1'), (0.13036585867815073, 'WoodDeckSF'), (0.1211669980146095, 'LotArea'), (0.11504305093601253, '2ndFlrSF'), (0.11439764110410199, 'BsmtUnfSF'), (0.11080279874459822, 'HalfBath'), (0.1017767099777179, 'BedroomAbvGr'), (0.08317879411803167, 'BsmtFullBath'), (0.06246570190310543, 'PoolArea'), (0.028399126033275164, 'MoSold'), (0.02556170369869493, 'ScreenPorch'), (0.010532579475601184, '3SsnPorch'), (0.004231304806602964, 'LowQualFinSF'), (-0.01012145139988125, 'KitchenAbvGr'), (-0.011129722622237775, 'YrSold'), (-0.013552848692716916, 'BsmtFinSF2'), (-0.015493399959318557, 'MiscVal'), (-0.02114280846249704, 'BsmtHalfBath'), (-0.08602539908222213, 'EnclosedPorch'), (-0.1017783758120348, 'OverallCond')]
46. 经过PCA后,进行第二层主成分分析-------------------------------------
47. [(0.037140668512444255, 'LotFrontage'), (0.005762269875424171, 'LotArea'), (-0.02265545744738413, 'OverallQual'), (0.06797580738610676, 'OverallCond'), (-0.22034458100877843, 'YearBuilt'), (-0.11769773674122082, 'YearRemodAdd'), (-0.02330741979867707, 'MasVnrArea'), (-0.26830830083400875, 'BsmtFinSF1'), (-0.06776753790369254, 'BsmtFinSF2'), (0.10349973537774373, 'BsmtUnfSF'), (-0.2014230745261159, 'TotalBsmtSF'), (-0.14501101153644946, '1stFlrSF'), (0.43960496790131565, '2ndFlrSF'), (0.11932040000909688, 'LowQualFinSF'), (0.2706724094458561, 'GrLivArea'), (-0.2741406761479087, 'BsmtFullBath'), (-0.001880261013674545, 'BsmtHalfBath'), (0.12608264523927462, 'FullBath'), (0.23358978781221817, 'HalfBath'), (0.3864399252645517, 'BedroomAbvGr'), (0.12179545892853964, 'KitchenAbvGr'), (0.3371810668951179, 'TotRmsAbvGrd'), (0.06581774146310777, 'Fireplaces'), (-0.1834261688794573, 'GarageYrBlt'), (-0.04640661259007604, 'GarageCars'), (-0.08613653500685643, 'GarageArea'), (-0.047991361825782064, 'WoodDeckSF'), (0.03130768246434415, 'OpenPorchSF'), (0.13376424222015906, 'EnclosedPorch'), (-0.02564456693744644, '3SsnPorch'), (0.04211790221668751, 'ScreenPorch'), (0.03032238859229474, 'PoolArea'), (0.04968459727862472, 'MiscVal'), (0.02754218343139985, 'MoSold'), (-0.04555808126996797, 'YrSold')]
48. [(0.43960496790131565, '2ndFlrSF'), (0.3864399252645517, 'BedroomAbvGr'), (0.3371810668951179, 'TotRmsAbvGrd'), (0.2706724094458561, 'GrLivArea'), (0.23358978781221817, 'HalfBath'), (0.13376424222015906, 'EnclosedPorch'), (0.12608264523927462, 'FullBath'), (0.12179545892853964, 'KitchenAbvGr'), (0.11932040000909688, 'LowQualFinSF'), (0.10349973537774373, 'BsmtUnfSF'), (0.06797580738610676, 'OverallCond'), (0.06581774146310777, 'Fireplaces'), (0.04968459727862472, 'MiscVal'), (0.04211790221668751, 'ScreenPorch'), (0.037140668512444255, 'LotFrontage'), (0.03130768246434415, 'OpenPorchSF'), (0.03032238859229474, 'PoolArea'), (0.02754218343139985, 'MoSold'), (0.005762269875424171, 'LotArea'), (-0.001880261013674545, 'BsmtHalfBath'), (-0.02265545744738413, 'OverallQual'), (-0.02330741979867707, 'MasVnrArea'), (-0.02564456693744644, '3SsnPorch'), (-0.04555808126996797, 'YrSold'), (-0.04640661259007604, 'GarageCars'), (-0.047991361825782064, 'WoodDeckSF'), (-0.06776753790369254, 'BsmtFinSF2'), (-0.08613653500685643, 'GarageArea'), (-0.11769773674122082, 'YearRemodAdd'), (-0.14501101153644946, '1stFlrSF'), (-0.1834261688794573, 'GarageYrBlt'), (-0.2014230745261159, 'TotalBsmtSF'), (-0.22034458100877843, 'YearBuilt'), (-0.26830830083400875, 'BsmtFinSF1'), (-0.2741406761479087, 'BsmtFullBath')]
49. 不进行PCA的线性回归的MSE是1644140595.6636596
50. 前10个PCA主成分进行线性回归的MSE是1836601962.4751632
51. [1e-10, 1e-09, 1e-08, 1e-07, 1e-06, 1e-05, 0.0001, 0.001, 0.01, 0.1]
52. [1642818822.3530025, 1642818822.3529558, 1642818822.3524888, 1642818822.3471866, 1642818822.3005185, 1642818821.7415214, 1642818817.1179569, 1642818756.7038794, 1642818283.0732899, 1642813588.5752773]
53. [1e-10, 1e-09, 1e-08, 1e-07, 1e-06, 1e-05, 0.0001, 0.001, 0.01, 0.1]
54. [1836601962.4751682, 1836601962.4752123, 1836601962.475657, 1836601962.480097, 1836601962.5245085, 1836601962.9652405, 1836601967.4063494, 1836602011.8174434, 1836602455.9288514, 1836606882.1034737]
55. 
56. 
57. 
58. 
59. 
60. 
61. 
62. 
63. 
64.

 

 

核心代码

1. PCA
2. class TruncatedSVD Found at: sklearn.decomposition._truncated_svd
3. 
4. class TruncatedSVD(TransformerMixin, BaseEstimator):
5. """Dimensionality reduction using truncated SVD (aka LSA).
6.     
7.     This transformer performs linear dimensionality reduction by means of
8.     truncated singular value decomposition (SVD). Contrary to PCA, this
9.     estimator does not center the data before computing the singular value
10.     decomposition. This means it can work with sparse matrices
11.     efficiently.
12.     
13.     In particular, truncated SVD works on term count/tf-idf matrices as
14.     returned by the vectorizers in :mod:`sklearn.feature_extraction.text`. In
15.     that context, it is known as latent semantic analysis (LSA).
16.     
17.     This estimator supports two algorithms: a fast randomized SVD solver, 
18.      and
19.     a "naive" algorithm that uses ARPACK as an eigensolver on `X * X.T` or
20.     `X.T * X`, whichever is more efficient.
21.     
22. 
23. LinearRegression 
24. class LinearRegression Found at: sklearn.linear_model._base
25. 
26. class LinearRegression(MultiOutputMixin, RegressorMixin, LinearModel):
27.     """
28.     Ordinary least squares Linear Regression.
29. 
30.     LinearRegression fits a linear model with coefficients w = (w1, ..., wp)
31.     to minimize the residual sum of squares between the observed targets in
32.     the dataset, and the targets predicted by the linear approximation.
33. 
34. 
35. Lasso 
36. class Lasso Found at: sklearn.linear_model._coordinate_descent
37. class Lasso(ElasticNet):
38. """Linear Model trained with L1 prior as regularizer (aka the Lasso)
39.     
40.     The optimization objective for Lasso is::
41.     
42.     (1 / (2 * n_samples)) * ||y - Xw||^2_2 + alpha * ||w||_1
43.     
44.     Technically the Lasso model is optimizing the same objective function as
45.     the Elastic Net with ``l1_ratio=1.0`` (no L2 penalty).
46.     
47.     Read more in the :ref:`User Guide <lasso>`.
48. 
49. 
50. 
51. 
52. 
53. 
54. 
55. 
56. 
57. 
58. 
59.


相关文章
|
1月前
|
机器学习/深度学习 算法 数据挖掘
K-means聚类算法是机器学习中常用的一种聚类方法,通过将数据集划分为K个簇来简化数据结构
K-means聚类算法是机器学习中常用的一种聚类方法,通过将数据集划分为K个簇来简化数据结构。本文介绍了K-means算法的基本原理,包括初始化、数据点分配与簇中心更新等步骤,以及如何在Python中实现该算法,最后讨论了其优缺点及应用场景。
99 4
|
1月前
|
机器学习/深度学习 自然语言处理 算法
深入理解机器学习算法:从线性回归到神经网络
深入理解机器学习算法:从线性回归到神经网络
|
3月前
|
存储 算法 测试技术
预见未来?Python线性回归算法:数据中的秘密预言家
【9月更文挑战第11天】在数据的海洋中,线性回归算法犹如智慧的预言家,助我们揭示未知。本案例通过收集房屋面积、距市中心距离等数据,利用Python的pandas和scikit-learn库构建房价预测模型。经过训练与测试,模型展现出较好的预测能力,均方根误差(RMSE)低,帮助房地产投资者做出更明智决策。尽管现实关系复杂多变,线性回归仍提供了有效工具,引领我们在数据世界中自信前行。
58 5
|
4月前
|
机器学习/深度学习 人工智能 算法
【人工智能】线性回归模型:数据结构、算法详解与人工智能应用,附代码实现
线性回归是一种预测性建模技术,它研究的是因变量(目标)和自变量(特征)之间的关系。这种关系可以表示为一个线性方程,其中因变量是自变量的线性组合。
92 2
|
4月前
|
数据采集 机器学习/深度学习 算法
【python】python客户信息审计风险决策树算法分类预测(源码+数据集+论文)【独一无二】
【python】python客户信息审计风险决策树算法分类预测(源码+数据集+论文)【独一无二】
|
4月前
|
机器学习/深度学习 算法 数据中心
【机器学习】面试问答:PCA算法介绍?PCA算法过程?PCA为什么要中心化处理?PCA为什么要做正交变化?PCA与线性判别分析LDA降维的区别?
本文介绍了主成分分析(PCA)算法,包括PCA的基本概念、算法过程、中心化处理的必要性、正交变换的目的,以及PCA与线性判别分析(LDA)在降维上的区别。
109 4
|
4月前
|
存储 算法 定位技术
预见未来?Python线性回归算法:数据中的秘密预言家
【8月更文挑战第3天】站在数据的海洋边,线性回归算法犹如智慧的预言家,揭示着房价的秘密。作为房地产投资者,面对复杂的市场,我们可通过收集房屋面积、位置等数据并利用Python的pandas及scikit-learn库,建立线性回归模型预测房价。通过评估模型的均方根误差(RMSE),我们可以更精准地判断投资时机,让数据引领我们走向成功的彼岸。
30 1
|
4月前
|
机器学习/深度学习 算法 数据可视化
Python数据分析高手修炼手册:线性回归算法,让你的数据说话更有力
【8月更文挑战第1天】在数据驱动时代,掌握数据分析技能至关重要。线性回归是最基础且强大的工具之一,能从复杂数据中提炼简单有效的模型。本文探索Python中线性回归的应用并通过实战示例加深理解。线性回归建立变量间线性关系模型:Y = β0 + β1*X + ε。使用scikit-learn库进行实战:首先安装必要库,然后加载数据、训练模型并评估性能。示例展示了如何使用`LinearRegression`模型进行房价预测,包括数据可视化。掌握线性回归,让数据“说话”更有力。
49 2
|
4月前
|
机器学习/深度学习 算法 数据挖掘
【数据挖掘】PCA 主成分分析算法过程及原理讲解
主成分分析(PCA)的原理和算法过程。
138 0
|
3天前
|
机器学习/深度学习 算法
基于改进遗传优化的BP神经网络金融序列预测算法matlab仿真
本项目基于改进遗传优化的BP神经网络进行金融序列预测,使用MATLAB2022A实现。通过对比BP神经网络、遗传优化BP神经网络及改进遗传优化BP神经网络,展示了三者的误差和预测曲线差异。核心程序结合遗传算法(GA)与BP神经网络,利用GA优化BP网络的初始权重和阈值,提高预测精度。GA通过选择、交叉、变异操作迭代优化,防止局部收敛,增强模型对金融市场复杂性和不确定性的适应能力。
110 80

热门文章

最新文章