MapReduce开发技巧-阿里云开发者社区

MapReduce开发技巧

2016-03-18 894

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

数据类型的选择

自定义数据类型

参考：Hadoop提交作业自定义排序和分组

MapWritable/SortedMapWritable

Hadoop中可传输的Map集合，和Java中的Map用法差不多，但是可以用与mapper和reducer之间的数据传输

Map输出不同类型的Value

使用自定义的数据类型继承自GenericWritable可以实现在mapper中输出多个不同类型的value

//使用这个数据类型将可以输出IntWritable和Text两种类型的value
public class MultiValueWritable extends GenericWritable{
    private static Class[] CLASSES = new Class{
        IntWritable.class,
        Text.class
    }

    public MultiValueWritable(){
    }

    public MultiValueWritable(Writable value){
        set(value);
    }

    protected Class[] getTypes(){
        return CLASSES;
    }
}

mapper中context.write的时候可以使用如下的格式：

context.write(key,new MultiValueWritable(new Text("1")));
context.write(key,new MultiValueWritable(IntWritable Text(1)));

reducer的Values迭代器中可以通过这种方式来判断value是那种数据类型：

Writable value = value.get();
if(value instanceof Text){
    ...
}

选择合适的InputFormat/OutputFormat

基本上每个InputFormat都会有一个对应的OutputFormat

TextInputFormat

默认的输入格式，按行读取，key为每行偏移量，value为行的内容

NLineInputFormat

可以指定一次数据文件多少行的内容：

//设置一次读取50行的内容
NLineInputFormat.setNumLinesPerSplit(job,50);

SequenceFileInputFormat

输入的格式为keylen,key,valuelen,value，适合用于多个job之间的数据连接

DBInputFormat

处理数据库输入，待使用测试

自定义的InputFormat

参考：自定义分片策略解决大量小文件问题

同时处理不同类型的输入

参考：多个Mapper和Reducer处理多个输入

Partitioner的选择

TotalOrderPartitioner

对所有reducer中的结果进行排序，默认情况下每个reducer中的内容都是各自排序互不影响的

自定义partitioner

参考：Hadoop作业中自定义分区和归约

KeyFieldBasedPartitioner

在分区的时候mapper的key部分会参与计算
配合参数

map.output.key.field.separator
num.key.fields.for.partition

指定分隔符和要参与分区的字符索引

例如：key=”name-price”，指定map.output.key.field.separator=”-“,num.key.fields.for.partition=1表示key的price部分参与分区计算

二次排序

setSortComparatorClass

map中每个分区调用进行排序，reduce中shuffle之后再次调用

setGroupingComparatorClass

第二次排序，属于同一组的顺序记录并放入同一个value迭代器

分布式缓存的使用

参考：MapReduce中的DistributedCache

作者：@小黑

MapReduce开发技巧

数据类型的选择

自定义数据类型

MapWritable/SortedMapWritable

Map输出不同类型的Value

选择合适的InputFormat/OutputFormat

TextInputFormat

NLineInputFormat

SequenceFileInputFormat

DBInputFormat

自定义的InputFormat

同时处理不同类型的输入

Partitioner的选择

TotalOrderPartitioner

自定义partitioner

KeyFieldBasedPartitioner

二次排序

setSortComparatorClass

setGroupingComparatorClass

分布式缓存的使用

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

MapReduce开发技巧

数据类型的选择

自定义数据类型

MapWritable/SortedMapWritable

Map输出不同类型的Value

选择合适的InputFormat/OutputFormat

TextInputFormat

NLineInputFormat

SequenceFileInputFormat

DBInputFormat

自定义的InputFormat

同时处理不同类型的输入

Partitioner的选择

TotalOrderPartitioner

自定义partitioner

KeyFieldBasedPartitioner

二次排序

setSortComparatorClass

setGroupingComparatorClass

分布式缓存的使用

热门文章

最新文章

相关课程

相关电子书

相关实验场景