目录
主成分分析
- PCA 利用主成分分析方法,实现降维和降维的功能. PCA算法原理见wiki
- 目前支持稠密数据格式
PAI 命令
<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
- PAI -name PrinCompAnalysis
- -project algo_public
- -DinputTableName=bank_data
- -DeigOutputTableName=pai_temp_2032_17900_2
- -DprincompOutputTableName=pai_temp_2032_17900_1
- -DselectedColNames=pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed
- -DtransType=Simple
- -DcalcuType=CORR
- -DcontriRate=0.9;
算法参数说明
PCA输出示例
降维后的数据表
特征值和特征向量表
特征尺度变换
功能说明
- 支持常见的 尺度变化函数 log2,log10,ln,abs,sqrt。支持 稠密或稀疏
PAI 命令
<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
- PAI -name fe_scale_runner -project algo_public
- -Dlifecycle=28
- -DscaleMethod=log2
- -DscaleCols=nr_employed
- -DinputTable=pai_dense_10_1
- -DoutputTable=pai_temp_2262_20380_1;
算法参数
实例
输入数据
<pre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
- create table if not exists pai_dense_10_1 as
- select
- nr_employed
- from bank_data limit 10;
参数配置
勾选nr_employed 做特征尺度变化的特征,只支持数值类特征尺度变化函数,勾选log2
运行结果
特征离散
离散模块功能介绍
- 支持 稠密或稀疏的 数值类特征 离散
- 支持 等频离散 和 等距离离散(默认)
pai 命令
<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
- PAI -name fe_discrete_runner -project algo_public
- -DdiscreteMethod=SameFrequecy
- -Dlifecycle=28
- -DmaxBins=5
- -DinputTable=pai_dense_10_1
- -DdiscreteCols=nr_employed
- -DoutputTable=pai_temp_2262_20382_1;
算法参数
建模示例
输入数据
<pre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
- create table if not exists pai_dense_10_1 as
- select
- nr_employed
- from bank_data limit 10;
参数配置
输入数据为pai_dense_10_1离散特征勾选 nr_employed,用
等距离离散 方法离散成5个区间, 结果如下
运行结果
特征异常平滑
组件功能介绍
- 功能:将输入特征中含有异常的数据平滑到一定区间,支持稀疏和稠密
ps: 特征平滑组件只是将异常取值的特征值修正成正常值,本身不过滤或删除任何记录,输入数据维度和条数都不变.
平滑方法介绍
- Zscore平滑:如果特征分布遵循正态分布,考虑噪音一般集在-3xalpha 和 3xalpha 之外,ZScore是将该范围数据平滑到[-3xalpha,3xalpha]。
eg: 某个特征遵循特征分布,均值为0,标准差为3,因此-10的特征值会被识别为异常而修正为-3x3+0=-9,同理,10会被修正为3x3+0
- 百分位平滑: 将数据分布在[minPer, maxPer]分位之外的数据平滑平滑到minPer/maxPer这两个分位点
eg: age特征取值0-200,设置minPer为0,maxPer为50%,那么在0-100之外的特征取值都会被修正成0或者100
- 阈值平滑: 将数据分布在[minThresh, maxThresh]之外的数据平滑到minThresh和maxThresh这两个数据点.
eg: age特征取值0-200,设置minThresh=10,maxThresh=80,那么在0-80之外的特征取值都会被修正成0或者80
三 pai 命令
<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
- PAI -name fe_soften_runner -project algo_public
- -DminThresh=5000 -Dlifecycle=28
- -DsoftenMethod=min-max-thresh
- -DsoftenCols=nr_employed
- -DmaxThresh=6000
- -DinputTable=pai_dense_10_1
- -DoutputTable=pai_temp_2262_20381_1;
四 算法参数
五 实例
1. 输入数据
输入数据
<pre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
- create table if not exists pai_dense_10_1 as
- select
- nr_employed
- from bank_data limit 10;
参数配置
平滑特征列勾选nr_employed,参数配置中选择阈值平滑,下限5000,上限6000
运行结果
随机森林特征重要性
组件功能
使用原始数据和随机森林模型,计算特征重要性.
PAI 命令
<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
- pai -name feature_importance -project algo_public
- -DinputTableName=pai_dense_10_10
- -DmodelName=xlab_m_random_forests_1_20318_v0
- -DoutputTableName=erkang_test_dev.pai_temp_2252_20319_1
- -DlabelColName=y
- - DfeatureColNames="pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,age,campaign,poutcome"
- -Dlifecycle=28 ;
算法参数
实例
训练数据
<pre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
- drop table if exists pai_dense_10_10;
- creat table if not exists pai_dense_10_10 as
- select
- age,campaign,pdays, previous, poutcome, emp_var_rate, cons_price_idx, cons_conf_idx, euribor3m, nr_employed, y
- from bank_data limit 10;
参数配置
实例流程图如下,数据源为pai_dense_10_10, y列为随机森林的标签列,其他列为特征列,强制转换列勾选age和campaign,表示这两个特征当做枚举特征处理,其他采用默认参数。 运行成功如下
运行结果
随机森林特征重要性组件上
右键查看可视化分析,效果如下所示
GBDT特征重要性
组件功能
计算梯度渐进决策树(GBDT)特征重要性
PAI 命令
<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
- pai -name gbdt_importance -project algo_public
- -DmodelName=xlab_m_GBDT_LR_1_20307_v0
- -Dlifecycle=28 -DoutputTableName=pai_temp_2252_20308_1 -DlabelColName=y
- -DfeatureColNames=pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,age,campaign
- -DinputTableName=pai_dense_10_9;
算法参数
实例
输入数据
<pre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
- drop table if exists pai_dense_10_9;
- create table if not exists pai_dense_10_9 as
- select
- age,campaign,pdays, previous, emp_var_rate, cons_price_idx, cons_conf_idx, euribor3m, nr_employed, y
- from bank_data limit 10;
参数配置
实例流程图如下,输入数据为pai_dense_10_9, GBDT二分类组件选择标签列y,其他字段作为特征列,组件参数配置中
叶节点最小样本数配置为1,运行
运行结果
右键查看可视化分析报告
线性模型特征重要性
组件功能
计算线性模型的特征重要性,包括线性回归和二分类逻辑回归。 支持稀疏和稠密。
PAI 命令
<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
- PAI -name regression_feature_importance -project algo_public
- -DmodelName=xlab_m_logisticregressi_20317_v0
- -DoutputTableName=pai_temp_2252_20321_1
- -DlabelColName=y
- -DfeatureColNames=pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,age,campaign
- -DenableSparse=false -DinputTableName=pai_dense_10_9;
算法参数
实例
输入数据
<pre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
- create table if not exists pai_dense_10_9 as
- select
- age,campaign,pdays, previous, emp_var_rate, cons_price_idx, cons_conf_idx, euribor3m, nr_employed, y
- from bank_data limit 10;
参数配置
建模流程如下图示, 逻辑回归多分类组件选择标签列为y,其他字段为特征列,其他参数默认,运行
运行效果
指标计算公式
在线性模型特征重要性组件上,
右键查看可视化分析
偏好计算
功能介绍
- 给定用户的明细行为特征数据,自动计算用户对特征值的偏好得分.
- 输入表包含用户id和用户明细行为特征输入,假设在口碑到店场景,某用户2088xxx1在3个月内吃了两次川菜,一次西式快餐,那么输入表形式如下:
- 输出表为用户对川菜和西式快餐的偏好得分,形式如下:
PAI命令示例
<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
- pai -name=preference
- -project=algo_public
- -DInputTableName=preference_input_table
- -DIdColumnName=user_id
- -DFeatureColNames=cate
- -DOutputTableName=preference_output_table
- -DmapInstanceNum=2
- -DreduceInstanceNum=1;
算法参数
实例
测试数据
新建数据SQL<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
- drop table if exists preference_input_table;
- create table preference_input_table as
- select
- *
- from
- (
- select '2088xxx1' as user_id, '川菜' as cate from alipaydw.dual
- union all
- select '2088xxx1' as user_id, '川菜' as cate from alipaydw.dual
- union all
- select '2088xxx1' as user_id, '西式快餐' cate from alipaydw.dual
- union all
- select '2088xxx3' as user_id, '川菜' as cate from alipaydw.dual
- union all
- select '2088xxx3' as user_id, '川菜' as cate from alipaydw.dual
- union all
- select '2088xxx3' as user_id, '西式快餐' as cate from alipaydw.dual
- ) tmp;
运行结果<pre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
- +------------+------------+
- | user_id | cate |
- +------------+------------+
- | 2088xxx1 | 川菜:0.0544694,西式快餐:0.0272347 |
- | 2088xxx3 | 川菜:0.0544694,西式快餐:0.0272347 |
- +------------+------------+