特征工程是什么?(2)_问答-阿里云开发者社区

随机森林特征重要性

组件功能

使用原始数据和随机森林模型，计算特征重要性.

PAI 命令

<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">

pai -name feature_importance -project algo_public
-DinputTableName=pai_dense_10_10
-DmodelName=xlab_m_random_forests_1_20318_v0
-DoutputTableName=erkang_test_dev.pai_temp_2252_20319_1
-DlabelColName=y
- DfeatureColNames="pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,age,campaign,poutcome"
-Dlifecycle=28 ;

算法参数

参数名称	参数描述	参数值可选项	默认值
inputTableName	必选，输入表	-	-
outputTableName	必选，输出表	-	-
labelColName	必选，label所在的列	-	-
modelName	必选，输入的模型名	-	-
featureColNames	可选，输入表选择的特征	-	默认除label外的其他列
inputTablePartitions	可选，输入表选择的分区	-	默认选择全表
lifecycle	可选，输出表的生命周期	-	默认不设置
coreNum	可选，核心数	-	默认自动计算
memSizePerCore	可选，内存数	-	默认自动计算

实例

训练数据

<pre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">

drop table if exists pai_dense_10_10;
creat table if not exists pai_dense_10_10 as
select
age,campaign,pdays, previous, poutcome, emp_var_rate, cons_price_idx, cons_conf_idx, euribor3m, nr_employed, y
from bank_data limit 10;

参数配置

实例流程图如下，数据源为pai_dense_10_10, y列为随机森林的标签列，其他列为特征列，强制转换列勾选age和campaign，表示这两个特征当做枚举特征处理，其他采用默认参数。运行成功如下

运行结果

colname	gini	entropy
age	0.06625000000000003	0.13978726292803723
campaign	0.0017500000000000003	0.004348515545596772
cons_conf_idx	0.013999999999999999	0.02908409497018851
cons_price_idx	0.002	0.0049804499913461255
emp_var_rate	0.014700000000000003	0.026786360680260933
euribor3m	0.06300000000000003	0.1321936348846039
nr_employed	0.10499999999999998	0.2203227248076733
pdays	0.0845	0.17750329234397513
poutcome	0.03360000000000001	0.07050327193845542
previous	0.017700000000000004	0.03810381005801592

随机森林特征重要性组件上 右键查看可视化分析,效果如下所示

GBDT特征重要性

组件功能

计算梯度渐进决策树(GBDT)特征重要性

PAI 命令

<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">

pai -name gbdt_importance -project algo_public
-DmodelName=xlab_m_GBDT_LR_1_20307_v0
-Dlifecycle=28 -DoutputTableName=pai_temp_2252_20308_1 -DlabelColName=y
-DfeatureColNames=pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,age,campaign
-DinputTableName=pai_dense_10_9;

算法参数

参数名称	参数描述	参数值可选项	默认值
inputTableName	必选，输入表	-	-
outputTableName	必选，输出表	-	-
labelColName	必选，label所在的列	-	-
modelName	必选，输入的模型名	-	-
featureColNames	可选，输入表选择的特征	-	默认除label外的其他列
inputTablePartitions	可选，输入表选择的分区	-	默认选择全表
lifecycle	可选，输出表的生命周期	-	默认不设置
coreNum	可选，核心数	-	默认自动计算
memSizePerCore	可选，内存数	-	默认自动计算

实例

输入数据

<pre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">

drop table if exists pai_dense_10_9;
create table if not exists pai_dense_10_9 as
select
age,campaign,pdays, previous, emp_var_rate, cons_price_idx, cons_conf_idx, euribor3m, nr_employed, y
from bank_data limit 10;

参数配置

实例流程图如下，输入数据为pai_dense_10_9, GBDT二分类组件选择标签列y，其他字段作为特征列，组件参数配置中 叶节点最小样本数配置为1，运行

运行结果

colname	feature_importance
age	0.004667214954427797
campaign	0.001962038566773853
cons_conf_idx	0.04857761873887033
cons_price_idx	0.01925292649801252
emp_var_rate	0.044881269590771274
euribor3m	0.025034606434306696
nr_employed	0.036085457464908766
pdays	0.639121250405536
previous	0.18041761734639272

右键查看可视化分析报告

线性模型特征重要性

组件功能

计算线性模型的特征重要性，包括线性回归和二分类逻辑回归。支持稀疏和稠密。

PAI 命令

<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">

PAI -name regression_feature_importance -project algo_public
-DmodelName=xlab_m_logisticregressi_20317_v0
-DoutputTableName=pai_temp_2252_20321_1
-DlabelColName=y
-DfeatureColNames=pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,age,campaign
-DenableSparse=false -DinputTableName=pai_dense_10_9;

算法参数

参数名称	参数描述	参数值可选项	默认值
inputTableName	必选，输入表	-	-
outputTableName	必选，输出表	-	-
modelName	必选，输入的模型名	-	-
labelColName	必选，label所在列	-	-
feaureColNames	可选，输入表选择的特征	-	默认除label外的其他列
inputTablePartitions	可选，输入表选择的分区	-	默认选全表
enableSparse	可选，输入表数据是否为稀疏格式	true, false	false
itemDelimiter	可选，当输入表数据为稀疏格式时，kv间的分割符	-	空格
kvDelimiter	可选，当输入表数据为稀疏格式时，key和value的分割符	-	冒号
lifecycle	可选，输出表的生命周期	-	默认不设置
coreNum	可选，核心数	-	默认自动计算
memSizePerCore	可选，内存数	-	默认自动计算

create table if not exists pai_dense_10_9 as
select
age,campaign,pdays, previous, emp_var_rate, cons_price_idx, cons_conf_idx, euribor3m, nr_employed, y
from bank_data limit 10;

参数配置

建模流程如下图示，逻辑回归多分类组件选择标签列为y，其他字段为特征列，其他参数默认，运行

运行效果

colname	weight	importance
pdays	0.033942600256583334	16.31387797440866
previous	0.00004248130342485344	0.000030038817725357177
emp_var_rate	0.00006720242617694611	0.00010554561260753949
cons_price_idx	0.00012311047229142307	0.00006581255124425219
cons_conf_idx	0.00017227965471819213	0.0008918770542818432
euribor3m	0.00006113758212679113	0.00010427128177450988
nr_employed	0.0034541377310490697	0.26048098230126043
age	0.00009618162708080744	0.0009267659744232966
campaign	0.000019142551785274455	0.000041793353660529855

指标计算公式

列名	公式
weight	abs(w_)
importance	abs(w_j) * STD(f_i)

在线性模型特征重要性组件上， 右键查看可视化分析

特征工程是什么?(2)

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

特征工程是什么?(2)

相关课程

相关文章

相关电子书

相关实验场景