统计分析
目录
百分位
对一个存在的表,单列数据计算百分位
参数设置
选择需要分析的字段,仅支持double类型和bigint类型
运行结果,如下
PAI 命令
<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
- PAI -name Percentile -project algo_public -DoutputTableName="pai_temp_666_6014_1"\
- -DcolName="euribor3m" -DinputTableName="bank_data";
- name: 组件名字
- project: project名字,用于指定算法所在空间。系统默认是algo_public,用户自己更改后系统会报错
- outputTableName: 系统执行百分位运算后自动分配的结果表
- colName:要计算百分位的列,仅支持数字型
- inputTableName: 输入表的名字
全表统计
对一个存在的表,进行全表基本统计,或者仅对选中的列做统计
PAI 命令
<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
- PAI -name stat_summary
- -project algo_public
- -DinputTableName=test_data
- -DoutputTableName=test_summary_out
- -DinputTablePartitions="ds='20160101'"
- -DselectColNames=col0,col1,col2
- -Dlifecycle=1
参数说明
输入格式:选择输入列框中可选择需要进行统计的列,默认情况下统计全部列
输出格式:输出统计结果的全部字段如下
实例
测试数据
新建数据SQL<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
- drop table if exists summary_test_input;
- create table summary_test_input as
- select
- *
- from
- (
- select 'a' as col1, 1 as col2, 0.001 as col3 from dual
- union all
- select 'b' as col1, 2 as col2, 100.01 as col3 from dual
- ) tmp;
运行命令<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
- PAI -name stat_summary
- -project algo_public
- -DinputTableName=summary_test_input
- -DoutputTableName=summary_test_input_out
- -DselectColNames=col1,col2,col3
- -Dlifecycle=1;
运行结果<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
- | colname | datatype | totalcount | count | missingcount | nancount | positiveinfinitycount | negativeinfinitycount | min | max | mean | variance | standarddeviation | standarderror | skewness | kurtosis | moment2 | moment3 | moment4 | centralmoment2 | centralmoment3 | centralmoment4 | sum | sum2 | sum3 | sum4 |
- | col1 | string | 2 | 2 | 0 | 0 | 0 | 0 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
- | col2 | bigint | 2 | 2 | 0 | 0 | 0 | 0 | 1 | 2 | 1.5 | 0.5 | 0.7071067811865476 | 0.5 | 0 | -2 | 2.5 | 4.5 | 8.5 | 0.25 | 0 | 0.0625 | 3 | 5 | 9 | 17 |
- | col3 | double | 2 | 2 | 0 | 0 | 0 | 0 | 0.001 | 100.01 | 50.0055 | 5000.900040500001 | 70.71704207968544 | 50.00450000000001 | 2.327677906939552e-16 | -1.999999999999999 | 5001.000050500001 | 500150.0150005006 | 50020003.00020002 | 2500.45002025 | 2.91038304567337e-11 | 6252250.303768232 | 100.011 | 10002.000101 | 1000300.030001001 | 100040006.0004 |
皮尔森系数
对输入表或分区的2列(必须为数值列),计算其pearson相关系数,结果存入输出表。
使用说明
组件的仅两个参数:输入列1、输入列2;将需要计算相关系数的两列的列名填入即可;
运行后,组件右击菜单—> 查看分析报告,如下最后一列皮尔森系数值
pai命令示例
<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
- pai -name pearson
- -project algo_test
- -DinputTableName=wpbc
- -Dcol1Name=f1
- -Dcol2Name=f2
- -DoutputTableName=wpbc_pear;
算法参数
直方图
对一个存在的表,单列数据计算直方图
参数设置
选择需要分析字段,支持double类型和bigint类型
查看分析报告,如下
可调节步长大小,以及滑动查看直方图
离散值特征分析
离散值特征分析统计离散特征的分布,gini,entropy,gini gain,infomation gain,infomation gain ratio等指标。其中计算每个离散值对应的gini,entropy,计算单列对应的gini gain,infomation gain,infomation gain ratio。
gini index:
entropy:
pai命令示例
<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
- PAI
- -name enum_feature_selection
- -project algo_public
- -DinputTableName=enumfeautreselection_input
- -DlabelColName=label
- -DfeatureColNames=col0,col1
- -DenableSparse=false
- -DoutputCntTableName=enumfeautreselection_output_cntTable
- -DoutputValueTableName=enumfeautreselection_output_valuetable
- -DoutputEnumValueTableName=enumfeautreselection_output_enumvaluetable;
算法参数
示例
测试数据
新建数据SQL<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
- drop table if exists enum_feature_selection_test_input;
- create table enum_feature_selection_test_input
- as
- select
- *
- from
- (
- select
- '00' as col_string,
- 1 as col_bigint,
- 0.0 as col_double
- from dual
- union all
- select
- cast(null as string) as col_string,
- 0 as col_bigint,
- 0.0 as col_double
- from dual
- union all
- select
- '01' as col_string,
- 0 as col_bigint,
- 1.0 as col_double
- from dual
- union all
- select
- '01' as col_string,
- 1 as col_bigint,
- cast(null as double) as col_double
- from dual
- union all
- select
- '01' as col_string,
- 1 as col_bigint,
- 1.0 as col_double
- from dual
- union all
- select
- '00' as col_string,
- 0 as col_bigint,
- 0.0 as col_double
- from dual
- ) tmp;
输入数据说明<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
- +------------+------------+------------+
- | col_string | col_bigint | col_double |
- +------------+------------+------------+
- | 01 | 1 | 1.0 |
- | 01 | 0 | 1.0 |
- | 01 | 1 | NULL |
- | NULL | 0 | 0.0 |
- | 00 | 1 | 0.0 |
- | 00 | 0 | 0.0 |
- +------------+------------+------------+
运行命令<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
- drop table if exists enum_feature_selection_test_input_enum_value_output;
- drop table if exists enum_feature_selection_test_input_cnt_output;
- drop table if exists enum_feature_selection_test_input_value_output;
- PAI -name enum_feature_selection -project algo_public -DitemDelimiter=":" -Dlifecycle="28" -DoutputValueTableName="enum_feature_selection_test_input_value_output" -DkvDelimiter="," -DlabelColName="col_bigint" -DfeatureColNames="col_double,col_string" -DoutputEnumValueTableName="enum_feature_selection_test_input_enum_value_output" -DenableSparse="false" -DinputTableName="enum_feature_selection_test_input" -DoutputCntTableName="enum_feature_selection_test_input_cnt_output";
界面
参数界面
界面运行结果
运行结果
enum_feature_selection_test_input_cnt_output<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
- +------------+------------+------------+------------+
- | colname | colvalue | labelvalue | cnt |
- +------------+------------+------------+------------+
- | col_double | NULL | 1 | 1 |
- | col_double | 0 | 0 | 2 |
- | col_double | 0 | 1 | 1 |
- | col_double | 1 | 0 | 1 |
- | col_double | 1 | 1 | 1 |
- | col_string | NULL | 0 | 1 |
- | col_string | 00 | 0 | 1 |
- | col_string | 00 | 1 | 1 |
- | col_string | 01 | 0 | 1 |
- | col_string | 01 | 1 | 2 |
- +------------+------------+------------+------------+
enum_feature_selection_test_input_value_output<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
- +------------+------------+------------+------------+------------+---------------+
- | colname | gini | entropy | infogain | ginigain | infogainratio |
- +------------+------------+------------+------------+------------+---------------+
- | col_double | 0.3888888888888889 | 0.792481250360578 | 0.20751874963942196 | 0.1111111111111111 | 0.14221913160264427 |
- | col_string | 0.38888888888888884 | 0.792481250360578 | 0.20751874963942196 | 0.11111111111111116 | 0.14221913160264427 |
- +------------+------------+------------+------------+------------+---------------+
enum_feature_selection_test_input_enum_value_output<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
- +------------+------------+------------+------------+
- | colname | colvalue | gini | entropy |
- +------------+------------+------------+------------+
- | col_double | NULL | 0.0 | 0.0 |
- | col_double | 0 | 0.22222222222222224 | 0.4591479170272448 |
- | col_double | 1 | 0.16666666666666666 | 0.3333333333333333 |
- | col_string | NULL | 0.0 | 0.0 |
- | col_string | 00 | 0.16666666666666666 | 0.3333333333333333 |
- | col_string | 01 | 0.2222222222222222 | 0.4591479170272448 |
- +------------+------------+------------+------------+
T检验
单样本T检验是检验某个变量的总体均值和某指定值之间是否存在显著差异。T检验的前提是样本总体服从正态分布。
pai命令示例
<pre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
- pai -name t_test -project algo_public
- -DxTableName=pai_t_test_all_type
- -DxColName=col1_double
- -DoutputTableName=pai_t_test_out
- -DxTablePartitions=ds=2010/dt=1
- -Dalternative=less
- -Dmu=47
- -DconfidenceLevel=0.95