特征工程
目录
主成分分析
- PCA 利用主成分分析方法,实现降维和降维的功能. PCA算法原理见wiki
- 目前支持稠密数据格式
PAI 命令
<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
- PAI -name PrinCompAnalysis
- -project algo_public
- -DinputTableName=bank_data
- -DeigOutputTableName=pai_temp_2032_17900_2
- -DprincompOutputTableName=pai_temp_2032_17900_1
- -DselectedColNames=pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed
- -DtransType=Simple
- -DcalcuType=CORR
- -DcontriRate=0.9;
算法参数说明
PCA输出示例
降维后的数据表
特征值和特征向量表
特征尺度变换
功能说明
- 支持常见的 尺度变化函数 log2,log10,ln,abs,sqrt。支持 稠密或稀疏
PAI 命令
<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
- PAI -name fe_scale_runner -project algo_public
- -Dlifecycle=28
- -DscaleMethod=log2
- -DscaleCols=nr_employed
- -DinputTable=pai_dense_10_1
- -DoutputTable=pai_temp_2262_20380_1;
算法参数
实例
输入数据
<pre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
- create table if not exists pai_dense_10_1 as
- select
- nr_employed
- from bank_data limit 10;
参数配置
勾选nr_employed 做特征尺度变化的特征,只支持数值类特征尺度变化函数,勾选log2
运行结果
特征离散
离散模块功能介绍
- 支持 稠密或稀疏的 数值类特征 离散
- 支持 等频离散 和 等距离离散(默认)
pai 命令
<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
- PAI -name fe_discrete_runner -project algo_public
- -DdiscreteMethod=SameFrequecy
- -Dlifecycle=28
- -DmaxBins=5
- -DinputTable=pai_dense_10_1
- -DdiscreteCols=nr_employed
- -DoutputTable=pai_temp_2262_20382_1;
算法参数
建模示例
输入数据
<pre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
- create table if not exists pai_dense_10_1 as
- select
- nr_employed
- from bank_data limit 10;
参数配置
输入数据为pai_dense_10_1离散特征勾选 nr_employed,用
等距离离散 方法离散成5个区间, 结果如下
运行结果
特征异常平滑
组件功能介绍
- 功能:将输入特征中含有异常的数据平滑到一定区间,支持稀疏和稠密
ps: 特征平滑组件只是将异常取值的特征值修正成正常值,本身不过滤或删除任何记录,输入数据维度和条数都不变.
平滑方法介绍
- Zscore平滑:如果特征分布遵循正态分布,考虑噪音一般集在-3xalpha 和 3xalpha 之外,ZScore是将该范围数据平滑到[-3xalpha,3xalpha]。
eg: 某个特征遵循特征分布,均值为0,标准差为3,因此-10的特征值会被识别为异常而修正为-3x3+0=-9,同理,10会被修正为3x3+0
- 百分位平滑: 将数据分布在[minPer, maxPer]分位之外的数据平滑平滑到minPer/maxPer这两个分位点
eg: age特征取值0-200,设置minPer为0,maxPer为50%,那么在0-100之外的特征取值都会被修正成0或者100
- 阈值平滑: 将数据分布在[minThresh, maxThresh]之外的数据平滑到minThresh和maxThresh这两个数据点.
eg: age特征取值0-200,设置minThresh=10,maxThresh=80,那么在0-80之外的特征取值都会被修正成0或者80
三 pai 命令
<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
- PAI -name fe_soften_runner -project algo_public
- -DminThresh=5000 -Dlifecycle=28
- -DsoftenMethod=min-max-thresh
- -DsoftenCols=nr_employed
- -DmaxThresh=6000
- -DinputTable=pai_dense_10_1
- -DoutputTable=pai_temp_2262_20381_1;
四 算法参数
五 实例
1. 输入数据
输入数据
<pre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
- create table if not exists pai_dense_10_1 as
- select
- nr_employed
- from bank_data limit 10;
参数配置
平滑特征列勾选nr_employed,参数配置中选择阈值平滑,下限5000,上限6000
运行结果