开发者社区> 问答> 正文

特征工程是什么?(2)



随机森林特征重要性



组件功能


使用原始数据和随机森林模型,计算特征重要性.

PAI 命令

<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">

  1. pai -name feature_importance -project algo_public
  2.     -DinputTableName=pai_dense_10_10
  3.     -DmodelName=xlab_m_random_forests_1_20318_v0
  4.     -DoutputTableName=erkang_test_dev.pai_temp_2252_20319_1  
  5.     -DlabelColName=y
  6.     - DfeatureColNames="pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,age,campaign,poutcome"
  7.     -Dlifecycle=28 ;


算法参数

参数名称参数描述参数值可选项默认值
inputTableName必选,输入表--
outputTableName必选,输出表--
labelColName必选,label所在的列--
modelName必选,输入的模型名--
featureColNames可选,输入表选择的特征-默认除label外的其他列
inputTablePartitions可选,输入表选择的分区-默认选择全表
lifecycle可选,输出表的生命周期-默认不设置
coreNum可选,核心数-默认自动计算
memSizePerCore可选,内存数-默认自动计算


实例



训练数据

<pre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
  1. drop table if exists pai_dense_10_10;
  2. creat table if not exists pai_dense_10_10 as
  3. select  
  4.     age,campaign,pdays, previous, poutcome, emp_var_rate, cons_price_idx, cons_conf_idx, euribor3m, nr_employed, y
  5. from  bank_data limit 10;


参数配置


实例流程图如下,数据源为pai_dense_10_10, y列为随机森林的标签列,其他列为特征列,强制转换列勾选age和campaign,表示这两个特征当做枚举特征处理,其他采用默认参数。 运行成功如下


运行结果

colnameginientropy
age0.066250000000000030.13978726292803723
campaign0.00175000000000000030.004348515545596772
cons_conf_idx0.0139999999999999990.02908409497018851
cons_price_idx0.0020.0049804499913461255
emp_var_rate0.0147000000000000030.026786360680260933
euribor3m0.063000000000000030.1321936348846039
nr_employed0.104999999999999980.2203227248076733
pdays0.08450.17750329234397513
poutcome0.033600000000000010.07050327193845542
previous0.0177000000000000040.03810381005801592

随机森林特征重要性组件上 右键查看可视化分析,效果如下所示



GBDT特征重要性



组件功能


计算梯度渐进决策树(GBDT)特征重要性

PAI 命令

<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
  1. pai -name gbdt_importance -project algo_public
  2.     -DmodelName=xlab_m_GBDT_LR_1_20307_v0
  3.     -Dlifecycle=28 -DoutputTableName=pai_temp_2252_20308_1 -DlabelColName=y
  4.     -DfeatureColNames=pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,age,campaign
  5.     -DinputTableName=pai_dense_10_9;


算法参数

参数名称参数描述参数值可选项默认值
inputTableName必选,输入表--
outputTableName必选,输出表--
labelColName必选,label所在的列--
modelName必选,输入的模型名--
featureColNames可选,输入表选择的特征-默认除label外的其他列
inputTablePartitions可选,输入表选择的分区-默认选择全表
lifecycle可选,输出表的生命周期-默认不设置
coreNum可选,核心数-默认自动计算
memSizePerCore可选,内存数-默认自动计算


实例



输入数据

<pre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
  1. drop table if exists pai_dense_10_9;
  2. create table if not exists pai_dense_10_9 as
  3. select  
  4.     age,campaign,pdays, previous, emp_var_rate, cons_price_idx, cons_conf_idx, euribor3m, nr_employed, y
  5. from  bank_data limit 10;


参数配置


实例流程图如下,输入数据为pai_dense_10_9, GBDT二分类组件选择标签列y,其他字段作为特征列,组件参数配置中 叶节点最小样本数配置为1,运行


运行结果

colnamefeature_importance
age0.004667214954427797
campaign0.001962038566773853
cons_conf_idx0.04857761873887033
cons_price_idx0.01925292649801252
emp_var_rate0.044881269590771274
euribor3m0.025034606434306696
nr_employed0.036085457464908766
pdays0.639121250405536
previous0.18041761734639272

右键查看可视化分析报告



线性模型特征重要性



组件功能


计算线性模型的特征重要性,包括线性回归和二分类逻辑回归。 支持稀疏和稠密。

PAI 命令

<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
  1. PAI -name regression_feature_importance -project algo_public
  2.     -DmodelName=xlab_m_logisticregressi_20317_v0  
  3.     -DoutputTableName=pai_temp_2252_20321_1
  4.     -DlabelColName=y
  5.   -DfeatureColNames=pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,age,campaign
  6.     -DenableSparse=false -DinputTableName=pai_dense_10_9;


算法参数

参数名称参数描述参数值可选项默认值
inputTableName必选,输入表--
outputTableName必选,输出表--
modelName必选,输入的模型名--
labelColName必选,label所在列--
feaureColNames可选,输入表选择的特征-默认除label外的其他列
inputTablePartitions可选,输入表选择的分区-默认选全表
enableSparse可选,输入表数据是否为稀疏格式true, falsefalse
itemDelimiter可选,当输入表数据为稀疏格式时,kv间的分割符-空格
kvDelimiter可选,当输入表数据为稀疏格式时,key和value的分割符-冒号
lifecycle可选,输出表的生命周期-默认不设置
coreNum可选,核心数-默认自动计算
memSizePerCore可选,内存数-默认自动计算


实例



输入数据

<pre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
  1. create table if not exists pai_dense_10_9 as
  2. select  
  3.     age,campaign,pdays, previous, emp_var_rate, cons_price_idx, cons_conf_idx, euribor3m, nr_employed, y
  4. from  bank_data limit 10;


参数配置


建模流程如下图示, 逻辑回归多分类组件选择标签列为y,其他字段为特征列,其他参数默认,运行


运行效果

colnameweightimportance
pdays0.03394260025658333416.31387797440866
previous0.000042481303424853440.000030038817725357177
emp_var_rate0.000067202426176946110.00010554561260753949
cons_price_idx0.000123110472291423070.00006581255124425219
cons_conf_idx0.000172279654718192130.0008918770542818432
euribor3m0.000061137582126791130.00010427128177450988
nr_employed0.00345413773104906970.26048098230126043
age0.000096181627080807440.0009267659744232966
campaign0.0000191425517852744550.000041793353660529855

指标计算公式
列名公式
weightabs(w_)
importanceabs(w_j) * STD(f_i)

在线性模型特征重要性组件上, 右键查看可视化分析

展开
收起
nicenelly 2017-10-25 10:44:24 1724 0
0 条回答
写回答
取消 提交回答
问答排行榜
最热
最新

相关电子书

更多
纯干货|机器学习中梯度下降法的分类及对比分析 立即下载
纯干货 | 机器学习中梯度下降法的分类及对比分析 立即下载
机器学习中,使用Scikit-Leam简单处理文本数据 立即下载