2基于模型特征选择(SelectFromModel)
基于模型特征选择,使用sklearn.feature_selection.SelectFromModel类。我们用随机森林模型进行特征选择。
from sklearn.feature_selection import SelectFromModel from sklearn.ensemble import RandomForestRegressor Def selectFromModel(): stock =pd.read_csv('stock.csv',encoding='GBK') y = stock['涨跌幅’] features = stock.loc[:,'价格':'流通市值'] X = features.values X_train, X_test, y_train,y_test = train_test_split(X, y, random_state=62) #预处理 scaler = StandardScaler() scaler.fit(X_train) X_train_scaled =scaler.transform(X_train) X_test_scaled =scaler.transform(X_test) sfm = SelectFromModel(estimator=RandomForestRegressor(n_estimators=100,random_state=38),threshold='median') sfm.fit(X_train_scaled,y_train) X_train_sfm =sfm.transform(X_train_scaled) print('经过随机森林模型特征选择后的的数据形态:{}'.format(X_train_sfm.shape)) mask = sfm.get_support() print(mask)
输出
经过随机森林模型进行特征后的的数据形态:(306, 8) [FalseTrue False False False False True False False True True False True True True True]
第2、7、10、11、13、14、15、16(涨跌幅、最高、成交额、换手、委比、振幅、市盈率、流通市值)数据被保留,1、4、5、6、7、8、11、13(价格、5分钟涨跌额、今开、昨收、最高、最低、换手、委比)被抛弃。(由于threshold='median',所以也保留了50%的特征项)
#使用随机森林特征选择后数据集训练神经网络 X_test_sfm =sfm.transform(X_test_scaled) mlp_sfm =MLPRegressor(random_state=62,hidden_layer_sizes=[100,200,100],alpha=0.1) mlp_sfm.fit(X_train_sfm,y_train) print('经过随机森林模型特征选择后训练集得分:{:.2%}'.format(mlp_sfm.score(X_train_sfm,y_train))) print('经过随机森林模型特征选择后测试集得分:{:.2%}'.format(mlp_sfm.score(X_test_sfm,y_test)))
输出
经过随机森林模型特征选择后训练集得分:96.75% 经过随机森林模型特征选择后测试集得分:97.31%
3迭代特征选择(RFE)
特征项选择,使用sklearn.feature_selection.RFE类。
from sklearn.feature_selection import RFE def elimination(): stock =pd.read_csv('stock.csv',encoding='GBK') y = stock['涨跌幅'] features = stock.loc[:,'价格':'流通市值'] X = features.values X_train, X_test, y_train,y_test = train_test_split(X, y, random_state=62) #预处理 scaler = StandardScaler() scaler.fit(X_train) X_train_scaled =scaler.transform(X_train) X_test_scaled =scaler.transform(X_test) rfe = RFE(RandomForestRegressor(n_estimators=100,random_state=38),n_features_to_select=8) rfe.fit(X_train_scaled,y_train) X_train_rfe =rfe.transform(X_train_scaled) print('经过随机森林模型进行迭代特征选择后的的数据形态:{}'.format(X_train_rfe.shape)) mask = rfe.get_support() print(mask) print(mask) #用图像表示特征选择结果 plt.matshow(mask.reshape(1,-1),cmap=plt.cm.cool) plt.xlabel(u"特征选择") plt.rcParams['font.sans-serif']=['SimHei’] plt.rcParams['axes.unicode_minus']=False plt.show()
输出
经过随机森林模型进行迭代特征选择后的的数据形态:(306, 8) [FalseTrue True False False False False False False True True True True True TrueFalse]
第2、3、10、11、12、13、14、15(涨跌幅、涨跌额、成交额、换手、量比、委比、振幅、市盈率)数据被保留,1、4、5、6、7、8、9、16(价格、5分钟涨跌额、今开、昨收、最高、最低、成交量、流通市值)被抛弃。(由于n_features_to_select=8,所以也保留了8个的特征项)
使用随机森林迭代特征选择后数据集训练随机森林 X_test_rfe =rfe.transform(X_test_scaled) mlp_rfe =MLPRegressor(random_state=62,hidden_layer_sizes=[100,200,100],alpha=0.1) mlp_rfe.fit(X_train_rfe,y_train) print('经过随机森林迭代特征选择后训练集得分:{:.2%}'.format(mlp_rfe.score(X_train_rfe,y_train))) print('经过随机森林迭代特征选择后测试集得分:{:.2%}'.format(mlp_rfe.score(X_test_rfe,y_test)))
输出
经过随机森林迭代特征选择后训练集得分:96.64% 经过随机森林迭代特征选择后测试集得分:96.66%
4 综合
最后我们来比较一下三个特征选择的保留与舍弃项。
1价格
2涨跌幅
3涨跌额
4 5分钟涨跌额
5今开
6昨收
7最高
8最低
9成交量
10成交额
11换手
12量比
13委比
14振幅
15市盈率
16流通市值
单一变量 |
1 |
2 |
3 |
4 |
6 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
基于模型 |
1 |
2 |
3 |
4 |
6 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
迭代特征 |
1 |
2 |
3 |
4 |
6 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
综合 |
1 |
2 |
3 |
4 |
6 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
def summary(): stock =pd.read_csv('stock.csv',encoding='GBK') y = stock['涨跌幅'] features = stock.loc[:,'价格':'流通市值'] X = features.values clf_sp =SelectPercentile(percentile=50) clf_sp.fit(X,y) sp_masks =clf_sp.get_support() clf_sfm =SelectFromModel(estimator=RandomForestRegressor(n_estimators=100,random_state=38),threshold='median') clf_sfm.fit(X,y) sfm_masks =clf_sfm.get_support() clf_rfe =RFE(RandomForestRegressor(n_estimators=100,random_state=38),n_features_to_select=8) clf_rfe.fit(X,y) rfe_masks =clf_rfe.get_support() merage_masks =[] for sp_mask,sfm_mask,rfe_maskin zip(sp_masks,sfm_masks,rfe_masks): merage = sp_maskand sfm_mask and rfe_mask merage_masks.append(merage) i = 0 New_X =np.empty([X.shape[0],1]) for merage_mask in(merage_masks): if merage_mask: New_X = np.column_stack((X[:,i],New_X)) i = i+1 New_X=np.delete(New_X,0,1) X_train, X_test,y_train, y_test = train_test_split(New_X, y, random_state=88) scaler =StandardScaler() scaler.fit(X_train) X_train_scaled =scaler.transform(X_train) X_test_scaled =scaler.transform(X_test) mlp_merage =MLPRegressor(random_state=62,hidden_layer_sizes=[100,200,100],alpha=0.1) mlp_merage.fit(X_train,y_train) print('训练集得分:{:.2%}'.format(mlp_merage.score(X_train,y_train))) print('测试集得分:{:.2%}'.format(mlp_merage.score(X_test,y_test)))
输出
训练集得分:36.97% 测试集得分:51.17%
合并以后结果并不理想
—————————————————————————————————
软件安全测试
https://study.163.com/course/courseMain.htm?courseId=1209779852&share=2&shareId=480000002205486
接口自动化测试
https://study.163.com/course/courseMain.htm?courseId=1209794815&share=2&shareId=480000002205486
DevOps 和Jenkins之DevOps
https://study.163.com/course/courseMain.htm?courseId=1209817844&share=2&shareId=480000002205486
DevOps与Jenkins 2.0之Jenkins
https://study.163.com/course/courseMain.htm?courseId=1209819843&share=2&shareId=480000002205486
Selenium自动化测试
https://study.163.com/course/courseMain.htm?courseId=1209835807&share=2&shareId=480000002205486
性能测试第1季:性能测试基础知识
https://study.163.com/course/courseMain.htm?courseId=1209852815&share=2&shareId=480000002205486
性能测试第2季:LoadRunner12使用
https://study.163.com/course/courseMain.htm?courseId=1209980013&share=2&shareId=480000002205486
性能测试第3季:JMeter工具使用
https://study.163.com/course/courseMain.htm?courseId=1209903814&share=2&shareId=480000002205486
性能测试第4季:监控与调优
https://study.163.com/course/courseMain.htm?courseId=1209959801&share=2&shareId=480000002205486
Django入门
https://study.163.com/course/courseMain.htm?courseId=1210020806&share=2&shareId=480000002205486
啄木鸟顾老师漫谈软件测试
https://study.163.com/course/courseMain.htm?courseId=1209958326&share=2&shareId=480000002205486