Hadoop MapReduce编程 API入门系列之二次排序(十六)

简介:

 

 

复制代码
2016-12-12 17:04:32,012 INFO [org.apache.hadoop.metrics.jvm.JvmMetrics] - Initializing JVM Metrics with processName=JobTracker, sessionId=
2016-12-12 17:04:33,056 WARN [org.apache.hadoop.mapreduce.JobSubmitter] - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
2016-12-12 17:04:33,059 WARN [org.apache.hadoop.mapreduce.JobSubmitter] - No job jar file set. User classes may not be found. See Job or Job#setJar(String).
2016-12-12 17:04:33,083 INFO [org.apache.hadoop.mapreduce.lib.input.FileInputFormat] - Total input paths to process : 1
2016-12-12 17:04:33,161 INFO [org.apache.hadoop.mapreduce.JobSubmitter] - number of splits:1
2016-12-12 17:04:33,562 INFO [org.apache.hadoop.mapreduce.JobSubmitter] - Submitting tokens for job: job_local1173601391_0001
2016-12-12 17:04:34,242 INFO [org.apache.hadoop.mapreduce.Job] - The url to track the job: http://localhost:8080/
2016-12-12 17:04:34,244 INFO [org.apache.hadoop.mapreduce.Job] - Running job: job_local1173601391_0001
2016-12-12 17:04:34,247 INFO [org.apache.hadoop.mapred.LocalJobRunner] - OutputCommitter set in config null
2016-12-12 17:04:34,264 INFO [org.apache.hadoop.mapred.LocalJobRunner] - OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
2016-12-12 17:04:34,371 INFO [org.apache.hadoop.mapred.LocalJobRunner] - Waiting for map tasks
2016-12-12 17:04:34,373 INFO [org.apache.hadoop.mapred.LocalJobRunner] - Starting task: attempt_local1173601391_0001_m_000000_0
2016-12-12 17:04:34,439 INFO [org.apache.hadoop.yarn.util.ProcfsBasedProcessTree] - ProcfsBasedProcessTree currently is supported only on Linux.
2016-12-12 17:04:34,667 INFO [org.apache.hadoop.mapred.Task] - Using ResourceCalculatorProcessTree : org.apache.hadoop.yarn.util.WindowsBasedProcessTree@65bb90dc
2016-12-12 17:04:34,676 INFO [org.apache.hadoop.mapred.MapTask] - Processing split: file:/D:/Code/MyEclipseJavaCode/myMapReduce/data/secondarySort/secondarySort.txt:0+120
2016-12-12 17:04:34,762 INFO [org.apache.hadoop.mapred.MapTask] - (EQUATOR) 0 kvi 26214396(104857584)
2016-12-12 17:04:34,763 INFO [org.apache.hadoop.mapred.MapTask] - mapreduce.task.io.sort.mb: 100
2016-12-12 17:04:34,763 INFO [org.apache.hadoop.mapred.MapTask] - soft limit at 83886080
2016-12-12 17:04:34,763 INFO [org.apache.hadoop.mapred.MapTask] - bufstart = 0; bufvoid = 104857600
2016-12-12 17:04:34,763 INFO [org.apache.hadoop.mapred.MapTask] - kvstart = 26214396; length = 6553600
2016-12-12 17:04:34,771 INFO [org.apache.hadoop.mapred.MapTask] - Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2016-12-12 17:04:34,789 INFO [org.apache.hadoop.mapred.LocalJobRunner] - 
2016-12-12 17:04:34,789 INFO [org.apache.hadoop.mapred.MapTask] - Starting flush of map output
2016-12-12 17:04:34,789 INFO [org.apache.hadoop.mapred.MapTask] - Spilling map output
2016-12-12 17:04:34,789 INFO [org.apache.hadoop.mapred.MapTask] - bufstart = 0; bufend = 216; bufvoid = 104857600
2016-12-12 17:04:34,790 INFO [org.apache.hadoop.mapred.MapTask] - kvstart = 26214396(104857584); kvend = 26214328(104857312); length = 69/6553600
2016-12-12 17:04:34,809 INFO [org.apache.hadoop.mapred.MapTask] - Finished spill 0
2016-12-12 17:04:34,818 INFO [org.apache.hadoop.mapred.Task] - Task:attempt_local1173601391_0001_m_000000_0 is done. And is in the process of committing
2016-12-12 17:04:34,838 INFO [org.apache.hadoop.mapred.LocalJobRunner] - map
2016-12-12 17:04:34,838 INFO [org.apache.hadoop.mapred.Task] - Task 'attempt_local1173601391_0001_m_000000_0' done.
2016-12-12 17:04:34,838 INFO [org.apache.hadoop.mapred.LocalJobRunner] - Finishing task: attempt_local1173601391_0001_m_000000_0
2016-12-12 17:04:34,839 INFO [org.apache.hadoop.mapred.LocalJobRunner] - map task executor complete.
2016-12-12 17:04:34,846 INFO [org.apache.hadoop.mapred.LocalJobRunner] - Waiting for reduce tasks
2016-12-12 17:04:34,846 INFO [org.apache.hadoop.mapred.LocalJobRunner] - Starting task: attempt_local1173601391_0001_r_000000_0
2016-12-12 17:04:34,864 INFO [org.apache.hadoop.yarn.util.ProcfsBasedProcessTree] - ProcfsBasedProcessTree currently is supported only on Linux.
2016-12-12 17:04:34,950 INFO [org.apache.hadoop.mapred.Task] - Using ResourceCalculatorProcessTree : org.apache.hadoop.yarn.util.WindowsBasedProcessTree@59b59452
2016-12-12 17:04:34,954 INFO [org.apache.hadoop.mapred.ReduceTask] - Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@73d5cf65
2016-12-12 17:04:34,974 INFO [org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl] - MergerManager: memoryLimit=1327077760, maxSingleShuffleLimit=331769440, mergeThreshold=875871360, ioSortFactor=10, memToMemMergeOutputsThreshold=10
2016-12-12 17:04:35,011 INFO [org.apache.hadoop.mapreduce.task.reduce.EventFetcher] - attempt_local1173601391_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
2016-12-12 17:04:35,048 INFO [org.apache.hadoop.mapreduce.task.reduce.LocalFetcher] - localfetcher#1 about to shuffle output of map attempt_local1173601391_0001_m_000000_0 decomp: 254 len: 258 to MEMORY
2016-12-12 17:04:35,060 INFO [org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput] - Read 254 bytes from map-output for attempt_local1173601391_0001_m_000000_0
2016-12-12 17:04:35,123 INFO [org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl] - closeInMemoryFile -> map-output of size: 254, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->254
2016-12-12 17:04:35,125 INFO [org.apache.hadoop.mapreduce.task.reduce.EventFetcher] - EventFetcher is interrupted.. Returning
2016-12-12 17:04:35,126 INFO [org.apache.hadoop.mapred.LocalJobRunner] - 1 / 1 copied.
2016-12-12 17:04:35,126 INFO [org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl] - finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
2016-12-12 17:04:35,136 INFO [org.apache.hadoop.mapred.Merger] - Merging 1 sorted segments
2016-12-12 17:04:35,137 INFO [org.apache.hadoop.mapred.Merger] - Down to the last merge-pass, with 1 segments left of total size: 244 bytes
2016-12-12 17:04:35,139 INFO [org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl] - Merged 1 segments, 254 bytes to disk to satisfy reduce memory limit
2016-12-12 17:04:35,139 INFO [org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl] - Merging 1 files, 258 bytes from disk
2016-12-12 17:04:35,140 INFO [org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl] - Merging 0 segments, 0 bytes from memory into reduce
2016-12-12 17:04:35,141 INFO [org.apache.hadoop.mapred.Merger] - Merging 1 sorted segments
2016-12-12 17:04:35,142 INFO [org.apache.hadoop.mapred.Merger] - Down to the last merge-pass, with 1 segments left of total size: 244 bytes
2016-12-12 17:04:35,143 INFO [org.apache.hadoop.mapred.LocalJobRunner] - 1 / 1 copied.
2016-12-12 17:04:35,150 INFO [org.apache.hadoop.conf.Configuration.deprecation] - mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
2016-12-12 17:04:35,158 INFO [org.apache.hadoop.mapred.Task] - Task:attempt_local1173601391_0001_r_000000_0 is done. And is in the process of committing
2016-12-12 17:04:35,160 INFO [org.apache.hadoop.mapred.LocalJobRunner] - 1 / 1 copied.
2016-12-12 17:04:35,160 INFO [org.apache.hadoop.mapred.Task] - Task attempt_local1173601391_0001_r_000000_0 is allowed to commit now
2016-12-12 17:04:35,166 INFO [org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter] - Saved output of task 'attempt_local1173601391_0001_r_000000_0' to file:/D:/Code/MyEclipseJavaCode/myMapReduce/out/secondarySort/_temporary/0/task_local1173601391_0001_r_000000
2016-12-12 17:04:35,167 INFO [org.apache.hadoop.mapred.LocalJobRunner] - reduce > reduce
2016-12-12 17:04:35,167 INFO [org.apache.hadoop.mapred.Task] - Task 'attempt_local1173601391_0001_r_000000_0' done.
2016-12-12 17:04:35,167 INFO [org.apache.hadoop.mapred.LocalJobRunner] - Finishing task: attempt_local1173601391_0001_r_000000_0
2016-12-12 17:04:35,168 INFO [org.apache.hadoop.mapred.LocalJobRunner] - reduce task executor complete.
2016-12-12 17:04:35,248 INFO [org.apache.hadoop.mapreduce.Job] - Job job_local1173601391_0001 running in uber mode : false
2016-12-12 17:04:35,249 INFO [org.apache.hadoop.mapreduce.Job] - map 100% reduce 100%
2016-12-12 17:04:35,251 INFO [org.apache.hadoop.mapreduce.Job] - Job job_local1173601391_0001 completed successfully
2016-12-12 17:04:35,271 INFO [org.apache.hadoop.mapreduce.Job] - Counters: 33
File System Counters
FILE: Number of bytes read=1186
FILE: Number of bytes written=394623
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=18
Map output records=18
Map output bytes=216
Map output materialized bytes=258
Input split bytes=145
Combine input records=0
Combine output records=0
Reduce input groups=4
Reduce shuffle bytes=258
Reduce input records=18
Reduce output records=18
Spilled Records=36
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=0
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=534773760
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters 
Bytes Read=120
File Output Format Counters 
Bytes Written=115
复制代码

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

代码

IntPair.java

复制代码
package zhouls.bigdata.myMapReduce.SecondarySort;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.WritableComparable;


//第一步:自定义IntPair类,将示例数据中的key/value封装成一个整体作为Key,同时实现 WritableComparable 接口并重写其方法。
/**
* 自己定义的key类应该实现WritableComparable接口
*/
public  class IntPair implements WritableComparable<IntPair>{//类似对应于如TextPair
    int first;//第一个成员变量
    int second;//第二个成员变量
    
    public void set(int left, int right){//赋值
        first = left;
        second = right;
    }
    public int getFirst(){//读值
        return first;
    }
    public int getSecond(){//读值
        return second;
    }
    
    //反序列化,从流中的二进制转换成IntPair
    public void readFields(DataInput in) throws IOException{
        first = in.readInt();
        second = in.readInt();
    }
    
    //序列化,将IntPair转化成使用流传送的二进制
    public void write(DataOutput out) throws IOException{
        out.writeInt(first);
        out.writeInt(second);
    }
    
    //key的比较
    public int compareTo(IntPair o){
        // TODO Auto-generated method stub
        if (first != o.first){
            return first < o.first ? -1 : 1;
        }else if (second != o.second)
        {
            return second < o.second ? -1 : 1;
        }else
        {
            return 0;
        }
    }
    
    @Override
    public int hashCode(){
        return first * 157 + second;
    }
    @Override
    public boolean equals(Object right){
        if (right == null)
            return false;
        if (this == right)
            return true;
        if (right instanceof IntPair){
            IntPair r = (IntPair) right;
            return r.first == first && r.second == second;
        }else{
            return false;
        }
    }
}
复制代码

 

 

 

 

 

 

 

SecondarySort.java

复制代码
package zhouls.bigdata.myMapReduce.SecondarySort;

import zhouls.bigdata.myMapReduce.Join.JoinRecordAndStationName;

import java.io.IOException;

import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;



/*
SecondarySort内容是
40    20
40    10
40    30
40    5
40    1
30    30
30    20
30    10
30    1
20    20
20    10
20    1
50    50
50    40
50    30 
50    20
50    10
50    1
*/


public class SecondarySort extends Configured implements Tool{
    // 自定义map
    public static class Map extends Mapper<LongWritable, Text, IntPair, IntWritable>{
        private final IntPair intkey = new IntPair();
        private final IntWritable intvalue = new IntWritable();
        
        public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{
            String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line);
            int left = 0;
            int right = 0;
            if (tokenizer.hasMoreTokens()){
                left = Integer.parseInt(tokenizer.nextToken());
                if (tokenizer.hasMoreTokens())
                    right = Integer.parseInt(tokenizer.nextToken());
                intkey.set(left, right);//设为k2
                intvalue.set(right);//设为v2
                context.write(intkey,intvalue);//写入intkeyk2,intvalue是v2
//                context.write(new IntPair(intkey),new IntWritable(intvalue));等价               
                
            }
        }
    }
    
    
  //第二步:自定义分区函数类FirstPartitioner,根据 IntPair 中的first实现分区。
    /**
    * 分区函数类。根据first确定Partition。
    */
    public static class FirstPartitioner extends Partitioner< IntPair, IntWritable>{
            @Override
            public int getPartition(IntPair key, IntWritable value,int numPartitions){     
                return Math.abs(key.getFirst() * 127) % numPartitions;
            }
    }
    
    
  //第三步:自定义 SortComparator 实现 IntPair 类中的first和second排序。本课程中没有使用这种方法,而是使用 IntPair 中的compareTo()方法实现的。
  //第四步:自定义 GroupingComparator 类,实现分区内的数据分组。
  /**
  *继承WritableComparator
  */
  public static class GroupingComparator extends WritableComparator{
          protected GroupingComparator(){
              super(IntPair.class, true);
          }
          @Override
          //Compare two WritableComparables.
          public int compare(WritableComparable w1, WritableComparable w2){
              IntPair ip1 = (IntPair) w1;
              IntPair ip2 = (IntPair) w2;
              int l = ip1.getFirst();
              int r = ip2.getFirst();
              return l == r ? 0 : (l < r ? -1 : 1);
          }
  }

  
    // 自定义reduce
    public static class Reduce extends Reducer<IntPair, IntWritable, Text, IntWritable>{
        private final Text left = new Text();      
        public void reduce(IntPair key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException{ 
            left.set(Integer.toString(key.getFirst()));//设为k3
            for (IntWritable val : values){
                context.write(left, val);//写入left是k3,val是v3
//                context.write(new Text(left),new IntWritable(val));等价               
            }
        }
    }
   
    
    public int run(String[] args)throws Exception{
         // TODO Auto-generated method stub
        Configuration conf = new Configuration();
        Path mypath=new Path(args[1]);
        FileSystem hdfs = mypath.getFileSystem(conf);
        if (hdfs.isDirectory(mypath)){
            hdfs.delete(mypath, true);
        }
        
        Job job = new Job(conf, "secondarysort");
        job.setJarByClass(SecondarySort.class);
        
        FileInputFormat.setInputPaths(job, new Path(args[0]));//输入路径
        FileOutputFormat.setOutputPath(job, new Path(args[1]));//输出路径

        job.setMapperClass(Map.class);// Mapper
        job.setReducerClass(Reduce.class);// Reducer
        //job.setNumReducerTask(3);
        
        job.setPartitionerClass(FirstPartitioner.class);// 分区函数
        //job.setSortComparatorClass(KeyComparator.Class);//本课程并没有自定义SortComparator,而是使用IntPair自带的排序
        job.setGroupingComparatorClass(GroupingComparator.class);// 分组函数


        job.setMapOutputKeyClass(IntPair.class);
        job.setMapOutputValueClass(IntWritable.class);
        
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);
       
       
        return job.waitForCompletion(true) ? 0 : 1;
    }
    
    /**
     * @param args
     * @throws Exception 
     */
    public static void main(String[] args) throws Exception{
        // TODO Auto-generated method stub
        
//        String[] args0={"hdfs://HadoopMaster:9000/secondarySort/secondarySort.txt",
//                        "hdfs://HadoopMaster:9000/out/secondarySort"};
    
        String[] args0={"./data/secondarySort/secondarySort.txt",
                         "./out/secondarySort"};
        
        
        int ec =ToolRunner.run(new Configuration(),new SecondarySort(),args0);
        System.exit(ec);
    }
}
复制代码

 


本文转自大数据躺过的坑博客园博客,原文链接:http://www.cnblogs.com/zlslch/p/6165256.html,如需转载请自行联系原作者

相关文章
|
4月前
|
开发框架 .NET API
RESTful API 设计与实现:C# 开发者的一分钟入门
【10月更文挑战第5天】本文从零开始,介绍了如何使用 C# 和 ASP.NET Core 设计并实现一个简单的 RESTful API。首先解释了 RESTful API 的概念及其核心原则,然后详细说明了设计 RESTful API 的关键步骤,包括资源识别、URI 设计、HTTP 方法选择、状态码使用和错误处理。最后,通过一个用户管理 API 的示例,演示了如何创建项目、定义模型、实现控制器及运行测试,帮助读者掌握 RESTful API 的开发技巧。
131 7
|
6月前
|
安全 Java API
告别繁琐编码,拥抱Java 8新特性:Stream API与Optional类助你高效编程,成就卓越开发者!
【8月更文挑战第29天】Java 8为开发者引入了多项新特性,其中Stream API和Optional类尤其值得关注。Stream API对集合操作进行了高级抽象,支持声明式的数据处理,避免了显式循环代码的编写;而Optional类则作为非空值的容器,有效减少了空指针异常的风险。通过几个实战示例,我们展示了如何利用Stream API进行过滤与转换操作,以及如何借助Optional类安全地处理可能为null的数据,从而使代码更加简洁和健壮。
174 0
|
4月前
|
API 数据安全/隐私保护 开发者
淘宝 API:关键词搜商品列表接口,助力商家按价格销量排序分析数据
此接口用于通过关键词搜索淘宝商品列表。首先需在淘宝开放平台注册并创建应用获取API权限,之后利用应用密钥和访问令牌调用接口。请求参数包括关键词、页码、每页数量、排序方式及价格区间等。返回结果含总商品数量及具体商品详情。使用时需注意签名验证及官方文档更新。
|
4月前
|
机器学习/深度学习 算法 API
机器学习入门(五):KNN概述 | K 近邻算法 API,K值选择问题
机器学习入门(五):KNN概述 | K 近邻算法 API,K值选择问题
|
5月前
|
网络协议 API Windows
MASM32编程调用 API函数RtlIpv6AddressToString,windows 10 容易,Windows 7 折腾
MASM32编程调用 API函数RtlIpv6AddressToString,windows 10 容易,Windows 7 折腾
|
4月前
|
IDE API 定位技术
Python--API编程:IP地址翻译成实际的物理地址
Python--API编程:IP地址翻译成实际的物理地址
93 0
|
6月前
|
开发者
告别繁琐代码,JSF标签库带你走进高效开发的新时代!
【8月更文挑战第31天】JSF(JavaServer Faces)标准标签库为页面开发提供了大量组件标签,如`&lt;h:inputText&gt;`、`&lt;h:dataTable&gt;`等,简化代码、提升效率并确保稳定性。本文通过示例展示如何使用这些标签实现常见功能,如创建登录表单和展示数据列表,帮助开发者更高效地进行Web应用开发。
66 0
|
6月前
|
前端开发 API 开发者
【React状态管理新思路】Context API入门:从零开始摆脱props钻孔的优雅之道,全面解析与实战案例分享!
【8月更文挑战第31天】React 的 Context API 有效解决了多级组件间状态传递的 &quot;props 钻孔&quot; 问题,使代码更简洁、易维护。本文通过电子商务网站登录状态管理案例,详细介绍了 Context API 的使用方法,包括创建、提供及消费 Context,以及处理多个 Context 的场景,适合各水平开发者学习与应用,提高开发效率和代码质量。
61 0
|
6月前
|
JSON API 数据库
神秘编程力量来袭!Rails 究竟隐藏着怎样的魔力,能构建出强大的 RESTful API?快来一探究竟!
【8月更文挑战第31天】《构建 RESTful API:使用 Rails 进行服务端开发》介绍了如何利用 Ruby on Rails 框架高效构建可扩展的 RESTful API。Rails 采用“约定优于配置”,简化开发流程,通过示例展示了路由定义、控制器设计及模型层交互等内容,帮助开发者快速搭建稳定可靠的服务端。无论小型项目还是大型应用,Rails 均能提供强大支持,提升开发效率。
52 0
|
6月前
|
分布式计算 Hadoop Java
面向开发者的Hadoop编程指南
【8月更文第28天】Hadoop是一个开源软件框架,用于分布式存储和处理大规模数据集。它由Hadoop分布式文件系统(HDFS)和MapReduce编程模型组成。本指南旨在帮助初学者和中级开发者快速掌握Hadoop的基本概念和编程技巧,并通过一些简单的示例来加深理解。
301 0

热门文章

最新文章