开发者社区> 问答> 正文

统计分析(1)



统计分析




目录




百分位


对一个存在的表,单列数据计算百分位

参数设置


选择需要分析的字段,仅支持double类型和bigint类型
运行结果,如下

PAI 命令

<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
  1. PAI -name Percentile -project algo_public -DoutputTableName="pai_temp_666_6014_1"\
  2.     -DcolName="euribor3m" -DinputTableName="bank_data";
  • name: 组件名字
  • project: project名字,用于指定算法所在空间。系统默认是algo_public,用户自己更改后系统会报错
  • outputTableName: 系统执行百分位运算后自动分配的结果表
  • colName:要计算百分位的列,仅支持数字型
  • inputTableName: 输入表的名字



全表统计


对一个存在的表,进行全表基本统计,或者仅对选中的列做统计


PAI 命令

<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
  1. PAI -name stat_summary
  2. -project algo_public
  3. -DinputTableName=test_data
  4. -DoutputTableName=test_summary_out
  5. -DinputTablePartitions="ds='20160101'"
  6. -DselectColNames=col0,col1,col2
  7. -Dlifecycle=1


参数说明


参数名称参数描述参数值可选项默认值
inputTableName必选,输入表名--
outputTableName必选,推荐结果的输出表名--
inputTablePartitions可选,输入表的分区-“”
selectColNames可选,指定需要统计的列名-“”
lifecycle可选,输出结果表的生命周期-不设生命周期
coreNum可选,指定instance的总数--1
memSizePerCore可选,指定memory大小,范围在100~64*1024之间--1


输入格式:选择输入列框中可选择需要进行统计的列,默认情况下统计全部列



输出格式:输出统计结果的全部字段如下


列名描述
colname列名
datatype类型
totalcount总数
count非NULL数量
missingcountNULL数量
nancountNAN数量
positiveinfinitycount正无穷数量
negativeinfinitycount负无穷数量
min最小值
max最大值
mean平均值
variance方差
standarddeviation标准差
standarderror标准误差
skewness偏度
kurtosis峰度
moment2二阶矩
moment3三阶矩
moment4四阶矩
centralmoment2二阶中心距
centralmoment3三阶中心距
centralmoment4四阶中心距
sum总和
sum2平方和
sum3立方和
sum4四次方和


实例


测试数据
新建数据SQL<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
  1. drop table if exists summary_test_input;
  2. create table summary_test_input as
  3. select
  4.   *
  5. from
  6. (
  7.   select 'a' as col1, 1 as col2, 0.001 as col3 from dual
  8.     union all
  9.   select 'b' as col1, 2 as col2, 100.01 as col3 from dual
  10. ) tmp;

运行命令<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
  1. PAI -name stat_summary
  2. -project algo_public
  3. -DinputTableName=summary_test_input
  4. -DoutputTableName=summary_test_input_out
  5. -DselectColNames=col1,col2,col3
  6. -Dlifecycle=1;

运行结果<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
  1.     | colname    | datatype   | totalcount | count      | missingcount | nancount   | positiveinfinitycount | negativeinfinitycount | min        | max        | mean       | variance   | standarddeviation | standarderror | skewness   | kurtosis   | moment2    | moment3    | moment4    | centralmoment2 | centralmoment3 | centralmoment4 | sum        | sum2       | sum3       | sum4       |
  2.     | col1       | string     | 2          | 2          | 0            | 0          | 0                     | 0                     | NULL       | NULL       | NULL       | NULL       | NULL              | NULL          | NULL       | NULL       | NULL       | NULL       | NULL       | NULL           | NULL           | NULL           | NULL       | NULL       | NULL       | NULL       |
  3.     | col2       | bigint     | 2          | 2          | 0            | 0          | 0                     | 0                     | 1          | 2          | 1.5        | 0.5        | 0.7071067811865476 | 0.5           | 0          | -2         | 2.5        | 4.5        | 8.5        | 0.25           | 0              | 0.0625         | 3          | 5          | 9          | 17         |
  4.     | col3       | double     | 2          | 2          | 0            | 0          | 0                     | 0                     | 0.001      | 100.01     | 50.0055    | 5000.900040500001 | 70.71704207968544 | 50.00450000000001 | 2.327677906939552e-16 | -1.999999999999999 | 5001.000050500001 | 500150.0150005006 | 50020003.00020002 | 2500.45002025  | 2.91038304567337e-11 | 6252250.303768232 | 100.011    | 10002.000101 | 1000300.030001001 | 100040006.0004 |



皮尔森系数


对输入表或分区的2列(必须为数值列),计算其pearson相关系数,结果存入输出表。

使用说明


  1. 组件的仅两个参数:输入列1、输入列2;将需要计算相关系数的两列的列名填入即可;

  2. 运行后,组件右击菜单—> 查看分析报告,如下最后一列皮尔森系数值


pai命令示例

<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
  1. pai -name pearson
  2.     -project algo_test
  3.     -DinputTableName=wpbc
  4.     -Dcol1Name=f1
  5.     -Dcol2Name=f2
  6.     -DoutputTableName=wpbc_pear;


算法参数


参数key名称参数描述取值范围是否必选,默认值/行为
inputTableName输入表的表名表名必选
inputTablePartitions输入表中指定哪些分区参与计算格式为: partition_name=value。如果是多级格式为name1=value1/name2=value2;如果是指定多个分区,中间用’,’分开输入表的所有partition
col1Name输入列1列名必选
col2Name输入列2列名必选
outputTableName输出结果表表名必选



直方图


对一个存在的表,单列数据计算直方图

参数设置


选择需要分析字段,支持double类型和bigint类型
查看分析报告,如下 可调节步长大小,以及滑动查看直方图


离散值特征分析


离散值特征分析统计离散特征的分布,gini,entropy,gini gain,infomation gain,infomation gain ratio等指标。其中计算每个离散值对应的gini,entropy,计算单列对应的gini gain,infomation gain,infomation gain ratio。

gini index:
entropy:


pai命令示例

<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
  1. PAI
  2. -name enum_feature_selection
  3. -project algo_public
  4. -DinputTableName=enumfeautreselection_input
  5. -DlabelColName=label
  6. -DfeatureColNames=col0,col1
  7. -DenableSparse=false
  8. -DoutputCntTableName=enumfeautreselection_output_cntTable
  9. -DoutputValueTableName=enumfeautreselection_output_valuetable
  10. -DoutputEnumValueTableName=enumfeautreselection_output_enumvaluetable;


算法参数


参数key名称参数描述取值范围默认值
inputTableName必选,输入表名--
inputTablePartitions可选,输入表选择的分区-默认选择全表
featureColNames可选,输入表选择的列名-默认选择除label外的其他列,如果输入表为KV格式,则默认选择所有的string类型的列
labelColName必选,label所在的列--
enableSparse可选,输入表是否是KV格式-默认为表
kvFeatureColNames可选,KV格式的特征-默认选择全表
kvDelimiter可选,KV之间的分隔符-默认为:
itemDelimiter可选,K和V的分隔符-默认为,
outputCntTableName必选,输出离散特征的枚举值分布数表--
outputValueTableName必选,输出离散特征的gini,entropy表--
outputEnumValueTableName必选,输出离散特征枚举值gini,entropy表--
lifecycle可选,输入表的声明周期-默认不设置声明周期
coreNum可选,总得core个数-默认自动设置
memSizePerCore可选,单个core对应的内存数量,单位为M-默认为自动设置


示例


测试数据
新建数据SQL<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
  1. drop table if exists enum_feature_selection_test_input;
  2. create table enum_feature_selection_test_input
  3. as
  4. select
  5.     *
  6. from
  7. (
  8.     select
  9.         '00' as col_string,
  10.         1 as col_bigint,
  11.         0.0 as col_double
  12.     from dual
  13.     union all
  14.         select
  15.             cast(null as string) as col_string,
  16.             0 as col_bigint,
  17.             0.0 as col_double
  18.         from dual
  19.     union all
  20.         select
  21.             '01' as col_string,
  22.             0 as col_bigint,
  23.             1.0 as col_double
  24.         from dual
  25.     union all
  26.         select
  27.             '01' as col_string,
  28.             1 as col_bigint,
  29.             cast(null as double) as col_double
  30.         from dual
  31.     union all
  32.         select
  33.             '01' as col_string,
  34.             1 as col_bigint,
  35.             1.0 as col_double
  36.         from dual
  37.     union all
  38.         select
  39.             '00' as col_string,
  40.             0 as col_bigint,
  41.             0.0 as col_double
  42.         from dual
  43. ) tmp;

输入数据说明<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
  1. +------------+------------+------------+
  2. | col_string | col_bigint | col_double |
  3. +------------+------------+------------+
  4. | 01         | 1          | 1.0        |
  5. | 01         | 0          | 1.0        |
  6. | 01         | 1          | NULL       |
  7. | NULL       | 0          | 0.0        |
  8. | 00         | 1          | 0.0        |
  9. | 00         | 0          | 0.0        |
  10. +------------+------------+------------+

运行命令<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
  1. drop table if exists enum_feature_selection_test_input_enum_value_output;
  2. drop table if exists enum_feature_selection_test_input_cnt_output;
  3. drop table if exists enum_feature_selection_test_input_value_output;
  4. PAI -name enum_feature_selection -project algo_public -DitemDelimiter=":" -Dlifecycle="28" -DoutputValueTableName="enum_feature_selection_test_input_value_output" -DkvDelimiter="," -DlabelColName="col_bigint" -DfeatureColNames="col_double,col_string" -DoutputEnumValueTableName="enum_feature_selection_test_input_enum_value_output" -DenableSparse="false" -DinputTableName="enum_feature_selection_test_input" -DoutputCntTableName="enum_feature_selection_test_input_cnt_output";

界面

参数界面

界面运行结果

运行结果
enum_feature_selection_test_input_cnt_output<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
  1. +------------+------------+------------+------------+
  2. | colname    | colvalue   | labelvalue | cnt        |
  3. +------------+------------+------------+------------+
  4. | col_double | NULL       | 1          | 1          |
  5. | col_double | 0          | 0          | 2          |
  6. | col_double | 0          | 1          | 1          |
  7. | col_double | 1          | 0          | 1          |
  8. | col_double | 1          | 1          | 1          |
  9. | col_string | NULL       | 0          | 1          |
  10. | col_string | 00         | 0          | 1          |
  11. | col_string | 00         | 1          | 1          |
  12. | col_string | 01         | 0          | 1          |
  13. | col_string | 01         | 1          | 2          |
  14. +------------+------------+------------+------------+

enum_feature_selection_test_input_value_output<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
  1. +------------+------------+------------+------------+------------+---------------+
  2. | colname    | gini       | entropy    | infogain   | ginigain   | infogainratio |
  3. +------------+------------+------------+------------+------------+---------------+
  4. | col_double | 0.3888888888888889 | 0.792481250360578 | 0.20751874963942196 | 0.1111111111111111 | 0.14221913160264427 |
  5. | col_string | 0.38888888888888884 | 0.792481250360578 | 0.20751874963942196 | 0.11111111111111116 | 0.14221913160264427 |
  6. +------------+------------+------------+------------+------------+---------------+

enum_feature_selection_test_input_enum_value_output<divre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
  1. +------------+------------+------------+------------+
  2. | colname    | colvalue   | gini       | entropy    |
  3. +------------+------------+------------+------------+
  4. | col_double | NULL       | 0.0        | 0.0        |
  5. | col_double | 0          | 0.22222222222222224 | 0.4591479170272448 |
  6. | col_double | 1          | 0.16666666666666666 | 0.3333333333333333 |
  7. | col_string | NULL       | 0.0        | 0.0        |
  8. | col_string | 00         | 0.16666666666666666 | 0.3333333333333333 |
  9. | col_string | 01         | 0.2222222222222222 | 0.4591479170272448 |
  10. +------------+------------+------------+------------+



T检验


单样本T检验是检验某个变量的总体均值和某指定值之间是否存在显著差异。T检验的前提是样本总体服从正态分布。

pai命令示例

<pre style='background: rgb(246, 246, 246); font: 12px/1.6 "YaHei Consolas Hybrid", Consolas, "Meiryo UI", "Malgun Gothic", "Segoe UI", "Trebuchet MS", Helvetica, monospace, monospace; margin: 0px 0px 16px; padding: 10px; outline: 0px; border-radius: 3px; border: 1px solid rgb(221, 221, 221); color: rgb(51, 51, 51); text-transform: none; text-indent: 0px; letter-spacing: normal; overflow: auto; word-spacing: 0px; white-space: pre-wrap; word-wrap: break-word; box-sizing: border-box; orphans: 2; widows: 2; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;' prettyprinted?="" linenums="">
  1. pai -name t_test -project algo_public
  2.     -DxTableName=pai_t_test_all_type
  3.     -DxColName=col1_double
  4.     -DoutputTableName=pai_t_test_out
  5.     -DxTablePartitions=ds=2010/dt=1
  6.     -Dalternative=less
  7.     -Dmu=47
  8.     -DconfidenceLevel=0.95

展开
收起
nicenelly 2017-10-25 10:47:01 1579 0
0 条回答
写回答
取消 提交回答
问答排行榜
最热
最新

相关电子书

更多
实时数据分析演示 立即下载
《实时数据分析演示)》 立即下载
BigGraph:大规模在线图分析平台 立即下载