hive2solr时count的一个bug-阿里云开发者社区

hive2solr时count的一个bug

2017-11-14 1095

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介：

最近在测试hive导入solr，github上有个相关的代码

https://github.com/chimpler/hive-solr

其原理就是实现inputformat和outputformat，通过mapred来做数据的读写操作。

测试的表结构：

 
        show create table table_in_solr1;
       
        CREATE EXTERNAL TABLE table_in_solr1(
       
        id string COMMENT 
        'from deserializer'
        , 
       
        cookie_id_s string COMMENT 
        'from deserializer'
        , 
       
        first_url_s string COMMENT 
        'from deserializer'
        , 
       
        warehouse_s string COMMENT 
        'from deserializer'
        ) 
       
        ROW FORMAT SERDE
       
        'com.chimpler.hive.solr.SolrSerDe' 
       
        STORED BY
       
        'com.chimpler.hive.solr.SolrStorageHandler' 
       
        WITH SERDEPROPERTIES (
       
        'serialization.format'
        =
        '1'
        , 
       
        'solr.column.mapping'
        =
        'id,cookie_id_s,first_url_s,warehouse_s'
        ) 
       
        LOCATION
       
        'hdfs://xxxxxx:9000/bip/hive_warehouse/table_in_solr1' 
       
        TBLPROPERTIES (
       
        'solr.url'
        =
        'http://xxxxxxx:8888/solr/userinfo'
        , 
       
        'transient_lastDdlTime'
        =
        '1401357066'
        , 
       
        'solr.buffer.input.rows'
        =
        '10000'
        , 
       
        'solr.buffer.output.rows'
        =
        '10000'
        )

导入数据之后，进行count测试：

 
        select count(
        1
        ) from table_in_solr1; 
       
        Ended Job = job_1401419652664_0010 with errors
       
        Error during job, obtaining debugging information...
       
        Examining task ID: task_1401419652664_0010_m_000000 (and more) from job job_1401419652664_0010
       
        Task with the most failures(
        4
        ): 
       
        -----
       
        Task ID:
       
        task_1401419652664_0010_m_000000 
       
        URL:
       
        http:
        //nfzm:8088/taskdetails.jsp?jobid=job_1401419652664_0010&tipid=task_1401419652664_0010_m_000000 
       
        -----
       
        Diagnostic Messages 
        for 
        this 
        Task: 
       
        Error: java.io.IOException: java.lang.NumberFormatException: For input string: 
        "" 
       
        at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:
        97
        ) 
       
        at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:
        57
        ) 
       
        at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:
        243
        ) 
       
        at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:
        526
        ) 
       
        at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:
        166
        ) 
       
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:
        407
        ) 
       
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:
        340
        ) 
       
        at org.apache.hadoop.mapred.YarnChild$
        2
        .run(YarnChild.java:
        160
        ) 
       
        at java.security.AccessController.doPrivileged(Native Method) 
       
        at javax.security.auth.Subject.doAs(Subject.java:
        396
        ) 
       
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:
        1438
        ) 
       
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:
        155
        ) 
       
        Caused by: java.lang.NumberFormatException: For input string: 
        "" 
       
        at java.lang.NumberFormatException.forInputString(NumberFormatException.java:
        48
        ) 
       
        at java.lang.Integer.parseInt(Integer.java:
        470
        ) 
       
        at java.lang.Integer.parseInt(Integer.java:
        499
        ) 
       
        at com.chimpler.hive.solr.SolrInputFormat.getReadColumnIDs(SolrInputFormat.java:
        38
        ) 
       
        at com.chimpler.hive.solr.SolrInputFormat.getRecordReader(SolrInputFormat.java:
        50
        ) 
       
        at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:
        240
        ) 
       
        ... 
        9 
        more 
       
        FAILED: Execution Error, 
        return 
        code 
        2 
        from org.apache.hadoop.hive.ql.exec.MapRedTask 
       
        MapReduce Jobs Launched:
       
        Job 
        0
        : Map: 
        2  
        Reduce: 
        1   
        HDFS Read: 
        0 
        HDFS Write: 
        0 
        FAIL 
       
        Total MapReduce CPU Time Spent: 
        0 
        msec

而count(一个字段)是ok的。

从explain的结果来看。

count(字段)对比count(1)增加了Select Operator：

 
        Select Operator 
       
        expressions: 
       
        expr: id 
       
        type: string 
       
        outputColumnNames: id 
       
        Group By Operator 
       
        aggregations: 
       
        expr: count(id)

从job的出错信息来看，是由于类型转换出错导致，根据堆栈的信息，可以定位到出错的代码片段：

 
        at com.chimpler.hive.solr.SolrInputFormat.getReadColumnIDs(SolrInputFormat.java:
        38
        ) 
       
        即：   
       
        public 
        static 
        final 
        String READ_COLUMN_IDS_CONF_STR = 
        "hive.io.file.readcolumn.ids"
        ; 
       
        public 
        static 
        final 
        String READ_COLUMN_NAMES_CONF_STR = 
        "hive.io.file.readcolumn.names"
        ; 
       
        private 
        static 
        final 
        String READ_COLUMN_IDS_CONF_STR_DEFAULT = 
        ""
        ; 
       
        private 
        static 
        final 
        String READ_ALL_COLUMNS = 
        "hive.io.file.read.all.columns"
        ; 
       
        private 
        static 
        final 
        boolean 
        READ_ALL_COLUMNS_DEFAULT = 
        true
        ; 
       
        /** 
       
        * Returns an array of column ids(start from zero) which is set in the given 
       
        * parameter <tt>conf</tt>. 
       
        */ 
       
        public 
        static 
        List<Integer> getReadColumnIDs(JobConf conf) { 
       
        String skips = conf.get(READ_COLUMN_IDS_CONF_STR, READ_COLUMN_IDS_CONF_STR_DEFAULT); 
       
        String[] list = skips.split(
        ","
        ); 
       
        List<Integer> result = 
        new 
        ArrayList<Integer>(list.length); 
       
        for 
        (String element : list) { 
       
        // it may contain duplicates, remove duplicates 
       
        Integer toAdd = Integer.parseInt(element); 
       
        if 
        (!result.contains(toAdd)) { 
       
        result.add(toAdd); 
       
        } 
       
        } 
       
        return 
        result; 
       
        }

即在hive的map阶段，通过SolrInputFormat做数据读取时，hive.io.file.readcolumn.ids的默认值是空，如果没有字段名，就会产生空值，在Integer.parseInt(element)就会报错。

这点可以通过下面的代码进行验证:

 
        import 
        java.io.IOException; 
       
        import 
        java.util.ArrayList; 
       
        import 
        java.util.List; 
       
        public 
        class 
        ttt { 
       
        public 
        static 
        void  
        main(String[] args) { 
       
        String skips = 
        ""
        ; 
       
        String[] list = skips.split(
        ","
        ); 
       
        System.out.println(list); 
       
        List<Integer> result = 
        new 
        ArrayList<Integer>(list.length); 
       
        for 
        (String element : list) { 
       
        try
        { 
       
        Integer toAdd = Integer.parseInt(element); 
       
        if 
        (!result.contains(toAdd)) { 
       
        result.add(toAdd); 
       
        } 
       
        } 
       
        catch 
        (Exception e) { 
       
        System.out.println(e); 
       
        } 
       
        } 
       
        } 
       
        }

 
  本文转自菜菜光 51CTO博客，原文链接：http://blog.51cto.com/caiguangguang/1422733，如需转载请自行联系原作者

文章标签：

Java

Apache

hive2solr时count的一个bug

热门文章

最新文章

相关课程

相关电子书

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

hive2solr时count的一个bug

热门文章

最新文章

相关课程

相关电子书