34 MAPREDUCE自定义inputFormat-阿里云开发者社区

34 MAPREDUCE自定义inputFormat

2023-11-01 59

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 34 MAPREDUCE自定义inputFormat

需求

无论hdfs还是mapreduce，对于小文件都有损效率，实践中，又难免面临处理大量小文件的场景，此时，就需要有相应解决方案。

分析

小文件的优化无非以下几种方式：

在数据采集的时候，就将小文件或小批数据合成大文件再上传HDFS。
在业务处理之前，在HDFS上使用mapreduce程序对小文件进行合并。
在mapreduce处理时，可采用combineInputFormat提高效率。

实现

本节实现的是上述第二种方式

程序的核心机制：

自定义一个InputFormat
改写RecordReader，实现一次读取一个完整文件封装为KV
在输出时使用SequenceFileOutPutFormat输出合并文件

代码如下：

自定义InputFromat

public class WholeFileInputFormat extends
    FileInputFormat<NullWritable, BytesWritable> {
  //设置每个小文件不可分片,保证一个小文件生成一个key-value键值对
  @Override
  protected boolean isSplitable(JobContext context, Path file) {
    return false;
  }
  @Override
  public RecordReader<NullWritable, BytesWritable> createRecordReader(
      InputSplit split, TaskAttemptContext context) throws IOException,
      InterruptedException {
    WholeFileRecordReader reader = new WholeFileRecordReader();
    reader.initialize(split, context);
    return reader;
  }
}

自定义RecordReader

class WholeFileRecordReader extends RecordReader<NullWritable, BytesWritable> {
  private FileSplit fileSplit;
  private Configuration conf;
  private BytesWritable value = new BytesWritable();
  private boolean processed = false;
  @Override
  public void initialize(InputSplit split, TaskAttemptContext context)
      throws IOException, InterruptedException {
    this.fileSplit = (FileSplit) split;
    this.conf = context.getConfiguration();
  }
  @Override
  public boolean nextKeyValue() throws IOException, InterruptedException {
    if (!processed) {
      byte[] contents = new byte[(int) fileSplit.getLength()];
      Path file = fileSplit.getPath();
      FileSystem fs = file.getFileSystem(conf);
      FSDataInputStream in = null;
      try {
        in = fs.open(file);
        IOUtils.readFully(in, contents, 0, contents.length);
        value.set(contents, 0, contents.length);
      } finally {
        IOUtils.closeStream(in);
      }
      processed = true;
      return true;
    }
    return false;
  }
  @Override
  public NullWritable getCurrentKey() throws IOException,
      InterruptedException {
    return NullWritable.get();
  }
  @Override
  public BytesWritable getCurrentValue() throws IOException,
      InterruptedException {
    return value;
  }
  @Override
  public float getProgress() throws IOException {
    return processed ? 1.0f : 0.0f;
  }
  @Override
  public void close() throws IOException {
    // do nothing
  }
}

定义mapreduce处理流程

public class SmallFilesToSequenceFileConverter extends Configured implements
    Tool {
  static class SequenceFileMapper extends
      Mapper<NullWritable, BytesWritable, Text, BytesWritable> {
    private Text filenameKey;
    @Override
    protected void setup(Context context) throws IOException,
        InterruptedException {
      InputSplit split = context.getInputSplit();
      Path path = ((FileSplit) split).getPath();
      filenameKey = new Text(path.toString());
    }
    @Override
    protected void map(NullWritable key, BytesWritable value,
        Context context) throws IOException, InterruptedException {
      context.write(filenameKey, value);
    }
  }
  @Override
  public int run(String[] args) throws Exception {
    Configuration conf = new Configuration();
    System.setProperty("HADOOP_USER_NAME", "hdfs");
    String[] otherArgs = new GenericOptionsParser(conf, args)
        .getRemainingArgs();
    if (otherArgs.length != 2) {
      System.err.println("Usage: combinefiles <in> <out>");
      System.exit(2);
    }
    Job job = Job.getInstance(conf,"combine small files to sequencefile");
//    job.setInputFormatClass(WholeFileInputFormat.class);
    job.setOutputFormatClass(SequenceFileOutputFormat.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(BytesWritable.class);
    job.setMapperClass(SequenceFileMapper.class);
    return job.waitForCompletion(true) ? 0 : 1;
  }
  public static void main(String[] args) throws Exception {
    int exitCode = ToolRunner.run(new SmallFilesToSequenceFileConverter(),
        args);
    System.exit(exitCode);
  }
}

34 MAPREDUCE自定义inputFormat

需求

分析

实现

热门文章

最新文章

相关课程

相关电子书

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

34 MAPREDUCE自定义inputFormat

需求

分析

实现

热门文章

最新文章

相关课程

相关电子书