【Big Data】Hadoop--MapReduce经典题型实战(单词统计+成绩排序+文档倒插序列)

本文涉及的产品
检索分析服务 Elasticsearch 版,2核4GB开发者规格 1个月
实时计算 Flink 版,5000CU*H 3个月
实时数仓Hologres,5000CU*H 100GB 3个月
简介: 🍊本文使用了3个经典案例进行MapReduce实战🍊参考官方源码,代码风格较优雅🍊解析详细。

 image.gif编辑

 

🍊本文使用了3个经典案例进行MapReduce实战

🍊参考官方源码,代码风格较优雅

🍊解析详细

一、Introduction

MapReduce是一个分布式运算程序的编程框架,核心功能是将用户写的业务逻辑代码和自身默认代码整合成一个完整的分布式运算程序,并发运行在一个Hadoop集群上

其整体架构逻辑如下

Map 读取数据,进行简单数据整理
Shuffle 整合Map的数据
Reduce 计算处理Shuffle中的数据

image.gif编辑

二、WordCount

2.1 题目

统计文件中每个单词出现的个数。左侧为原始数据,右侧为输出数据。

image.gif编辑

2.2 解析

WordCount统计单词个数是最基础的题目,我们除了要完成题目要求之外,代码尽量更加的优雅,因此我们主要参考的是Hadoop官方提供的WordCount案例

数据的走下如下

image.gif编辑

2.3 Mapper

Mapper中需要注意的是 Mapper<LongWritable, Text, Text, IntWritable>

<LongWritable,Text>为输入,<Text,IntWritable>为输出,或许很难理解为什么输出是<LongWritable,Text>,其实Text表示每一行的数据,LongWritable为每一行数据第一个数据在整个文件的偏移量,我们打印一下每次的Text和LongWritable

image.gif编辑

package com.bcn.mapreduce.wordcount;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
/**
 * 输入数据为<单词偏移量,单词>
 * 输出数据为<单词,出现次数>
 */
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    // Just craete one Text and IntWritable object to reduce waste of resources
    Text outK = new Text();
    IntWritable outV = new IntWritable(1);
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // Get one line
        String line = value.toString();
        System.out.println(line);
        System.out.println(key);
        // split the word by space
        String[] words = line.split(" ");
        // output
        for (String word : words) {
            outK.set(word);
            context.write(outK, outV);
        }
    }
}

image.gif

2.4 Reducer

我们关注的还是数据的走向  Reducer <Text, IntWritable,Text,IntWritable> ,<ext,IntWritable>为数据输入,与Mapper的输出是一致的

这里可能很多人为疑惑为什么输入的是<Text, IntWritable>,但是我们重写reduce时却使用了<Text key,Iterable<IntWritable> values>,这是因为中间省略掉了我们看不见的Shuffle阶段

package com.bcn.mapreduce.wordcount;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WordCountReducer extends Reducer <Text, IntWritable,Text,IntWritable> {
    int sum;
    IntWritable outV=new IntWritable();
    @Override
    protected void reduce(Text key,Iterable<IntWritable> values, Context context)throws IOException, InterruptedException {
        // Sum up
        sum =0;
        // The data for example apple,(1,1,1)
        for (IntWritable count:values){
            sum += count.get();
        }
        //Output
        outV.set(sum);
        context.write(key,outV);
    }
}

image.gif

2.4 Dreiver

最后我们设置启动类,也就是Main函数,在其中会配置7套件,这样就可以运行整个MapReduce程序了

package com.bcn.mapreduce.wordcount;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCountDriver {
    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
        // 1.Get the config and job
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);
        // 2.Connect Driver with jar
        job.setJarByClass(WordCountDriver.class);
        // 3.Connect with Mapper、Reducer
        job.setMapperClass(WordCountMapper.class);
        job.setReducerClass(WordCountReducer.class);
        // 4.Set the class of Mapper output
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);
        // 5.Set the class of final output
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        // 6.Set the input and output path
        FileInputFormat.setInputPaths(job, new Path("E:\\Hadoop and Spark\\data\\word.txt"));
        FileOutputFormat.setOutputPath(job, new Path("E:\\Hadoop and Spark\\output\\wordCount"));
        // 7.Submit the job
        boolean result = job.waitForCompletion(true);
        System.exit(result ? 0 : 1);
    }
}

image.gif

三、Grade Sort

3.1 题目

对学生成绩表中总分进行排序(从高到低),若总分相同,则按数学成绩排序(从高到低)

image.gif编辑

3.2 解析

该题与WordCount相比,特殊处为多了一个排序要求,对于排序问题,我们需要对该数据对象建立一个类对象,并重写readFields()、write()、toString、compareTo(),前三者为模板信息,而compareTo()是需要根据业务编写。

3.3 Entity

关于CompareTo方法,Java中默认是升序的,因此在其前加负号即可成为降序

package com.bcn.mapreduce.gradesort;
import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;
import org.apache.hadoop.io.WritableComparable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
@Data
@AllArgsConstructor
@NoArgsConstructor
public class Student implements WritableComparable<Student> {
    private long id;
    private double chinese;
    private double math;
    private double english;
    private double total;
    @Override
    public void readFields(DataInput dataInput) throws IOException {
        this.id=dataInput.readLong();
        this.chinese=dataInput.readDouble();
        this.math=dataInput.readDouble();
        this.chinese=dataInput.readDouble();
        this.english=dataInput.readDouble();
        this.total=dataInput.readDouble();
    }
    @Override
    public void write(DataOutput dataOutput) throws IOException {
        dataOutput.writeLong(this.id);
        dataOutput.writeDouble(this.chinese);
        dataOutput.writeDouble(this.math);
        dataOutput.writeDouble(this.chinese);
        dataOutput.writeDouble(this.english);
        dataOutput.writeDouble(this.total);
    }
    @Override
    public String toString() {
        return this.id+"  ,语文:"+this.chinese+", 数学:"+this.math+", 英语:"+this.english+", 总分:"+this.total;
    }
    @Override
    public int compareTo(Student o) {
        int r1=-Double.compare(this.total, o.total);
        if(r1==0){
            return -Double.compare(this.math, o.math);
        }else {
            return r1;
        }
    }
}

image.gif

3.4 Mapper

注意Mapper的输出是<Student,Text>,由于放入Reducer中,只需要Student这一个信息即可,因此这里的Text只是随意加上去的

package com.bcn.mapreduce.gradesort;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class GradeSortMapper extends Mapper<LongWritable, Text,Student,Text> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        //  Get data
        String[] data=value.toString().split(",");
        long id= Long.parseLong(data[0]);
        double chinese= Double.parseDouble(data[1]);
        double math=Double.parseDouble(data[2]);
        double english=Double.parseDouble(data[3]);
        double total =chinese+math+english;
        Student s=new Student(id,chinese,math,english,total);
        context.write(s,value);
    }
}

image.gif

3.5 Reducer

package com.bcn.mapreduce.gradesort;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class GradeSortReducer extends Reducer<Student, Text, Text, NullWritable> {
    public static int count=0;
    public Text text=new Text();
    @Override
    protected void reduce(Student s, Iterable<Text> values,Context context) throws IOException, InterruptedException {
        for(Text t:values){
            if (GradeSortReducer.count<10){
                count++;
                text.set(s.toString());
                context.write(text,NullWritable.get());
            }
        }
    }
}

image.gif

3.6 Driver

package com.bcn.mapreduce.gradesort;
import com.bcn.mapreduce.wordcount.WordCountDriver;
import com.bcn.mapreduce.wordcount.WordCountMapper;
import com.bcn.mapreduce.wordcount.WordCountReducer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class GradeSortDriver {
    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
        // 1.Get the config and job
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);
        // 2.Connect Driver with jar
        job.setJarByClass(GradeSortDriver.class);
        // 3.Connect with Mapper、Reducer
        job.setMapperClass(GradeSortMapper.class);
        job.setReducerClass(GradeSortReducer.class);
        // 4.Set the class of Mapper output
        job.setMapOutputKeyClass(Student.class);
        job.setMapOutputValueClass(Text.class);
        // 5.Set the class of final output
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);
        // 6.Set the input and output path
        FileInputFormat.setInputPaths(job, new Path("E:\\Hadoop and Spark\\data\\top10input.txt"));
        FileOutputFormat.setOutputPath(job, new Path("E:\\Hadoop and Spark\\output\\top10input"));
        // 7.Submit the job
        boolean result = job.waitForCompletion(true);
        System.exit(result ? 0 : 1);
    }
}

image.gif

四、Document Revere

4.1 题目

有大量的文本(文档、网页),需要建立搜索索引。输入数据为三个文档

image.gif编辑

4.2 解析

与WordCount不同点在于除了要统计每个单词的个数之外,还需要记录每个单词所在的文档。而一个MapReduce可以有多个Map阶段和一个Reduce阶段,遇到这样复杂的业务可以使用两个MapReduce程序串行(使用Partitioner合并成一个也可以,这里使用两个串行比较简单暴力):第一阶段统计单词个数,第二阶段计算每个单词出现在哪个文档中

4.3 Mapper

Mapper1

package com.bcn.mapreduce.document;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import java.io.IOException;
import java.util.StringTokenizer;
public class Document1Mapper extends Mapper<Object, Text, Text, Text> {
    private FileSplit filesplit;
    private Text word = new Text();
    private Text temp = new Text("1");
    @Override
    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
        filesplit = (FileSplit) context.getInputSplit();
        String fileName = filesplit.getPath().getName();
        StringTokenizer itr = new StringTokenizer(value.toString());
        while(itr.hasMoreTokens()){
            word.set(itr.nextToken() + "--" + fileName);
            context.write(word, temp);
        }
    }
}

image.gif

Mapper2

package com.bcn.mapreduce.document;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import java.io.IOException;
import java.util.StringTokenizer;
public class Document2Mapper extends Mapper<LongWritable, Text, Text, Text> {
    private Text Word = new Text();
    private Text Filename = new Text();
    @Override
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        String[] all = line.split(" ");
        String[] wandc = all[0].split("--");
        String word = wandc[0];
        String document = wandc[1];
        String num = all[1];
        Word.set(word);
        Filename.set(document + " -->" + num + " ");
        context.write(Word, Filename);
    }
}

image.gif

4.4 Reducer

Reducer1

package com.bcn.mapreduce.document;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
import java.util.Iterator;
public class Document1Reducer extends Reducer<Text, Text, Text, Text> {
    @Override
    public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
        Iterator<Text> it = values.iterator();
        String s = "";
        StringBuilder wordNum = new StringBuilder();
        if (it.hasNext()) {
            wordNum.append(it.next().toString());
        }
        for (; it.hasNext(); ) {
            wordNum.append(it.next().toString());
        }
        s = s + wordNum.length();
        context.write(key, new Text(s.toString()));
    }
}

image.gif

Reducer2

package com.bcn.mapreduce.document;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
import java.util.Iterator;
public class Document2Reducer extends Reducer<Text, Text, Text, Text> {
    public Text text=new Text();
    @Override
    public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
        StringBuilder filename= new StringBuilder("  ");
        for(Text t :values){
            filename.append(t);
        }
        text.set(filename.toString());
        context.write(key,text);
    }
}

image.gif

4.5 Driver

Driver1

package com.bcn.mapreduce.document;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class Document1Driver {
    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
        // 1.Get the config and job
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);
        // 2.Connect Driver with jar
        job.setJarByClass(Document1Driver.class);
        // 3.Connect with Mapper、Reducer
        job.setMapperClass(Document1Mapper.class);
        job.setReducerClass(Document1Reducer.class);
        // 4.Set the class of Mapper output
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        // 5.Set the class of final output
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        // 6.Set the input and output path
        FileInputFormat.setInputPaths(job, new Path("E:\\Hadoop and Spark\\data\\doc"));
        FileOutputFormat.setOutputPath(job, new Path("E:\\Hadoop and Spark\\output\\doc"));
        // 7.Submit the job
        boolean result = job.waitForCompletion(true);
        System.exit(result ? 0 : 1);
    }
}

image.gif

Driver2

package com.bcn.mapreduce.document;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class Document2Driver {
    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
        // 1.Get the config and job
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);
        // 2.Connect Driver with jar
        job.setJarByClass(Document2Driver.class);
        // 3.Connect with Mapper、Reducer
        job.setMapperClass(Document2Mapper.class);
        job.setReducerClass(Document2Reducer.class);
        // 4.Set the class of Mapper output
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        // 5.Set the class of final output
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        // 6.Set the input and output path
        FileInputFormat.setInputPaths(job, new Path("E:\\Hadoop and Spark\\data\\period1.txt"));
        FileOutputFormat.setOutputPath(job, new Path("E:\\Hadoop and Spark\\output\\doc_result"));
        // 7.Submit the job
        boolean result = job.waitForCompletion(true);
        System.exit(result ? 0 : 1);
    }
}

image.gif


目录
相关文章
|
7月前
|
分布式计算 大数据 Scala
【大数据技术Hadoop+Spark】Spark RDD创建、操作及词频统计、倒排索引实战(超详细 附源码)
【大数据技术Hadoop+Spark】Spark RDD创建、操作及词频统计、倒排索引实战(超详细 附源码)
341 1
|
6月前
|
存储 分布式计算 Hadoop
Hadoop Distributed File System (HDFS): 概念、功能点及实战
【6月更文挑战第12天】Hadoop Distributed File System (HDFS) 是 Hadoop 生态系统中的核心组件之一。它设计用于在大规模集群环境中存储和管理海量数据,提供高吞吐量的数据访问和容错能力。
684 4
|
2月前
Hadoop-09-HDFS集群 JavaClient 代码上手实战!详细附代码 安装依赖 上传下载文件 扫描列表 PUT GET 进度条显示(二)
Hadoop-09-HDFS集群 JavaClient 代码上手实战!详细附代码 安装依赖 上传下载文件 扫描列表 PUT GET 进度条显示(二)
46 3
|
2月前
|
分布式计算 Java Hadoop
Hadoop-09-HDFS集群 JavaClient 代码上手实战!详细附代码 安装依赖 上传下载文件 扫描列表 PUT GET 进度条显示(一)
Hadoop-09-HDFS集群 JavaClient 代码上手实战!详细附代码 安装依赖 上传下载文件 扫描列表 PUT GET 进度条显示(一)
41 2
|
3月前
|
分布式计算 Hadoop Devops
Hadoop集群配置https实战案例
本文提供了一个实战案例,详细介绍了如何在Hadoop集群中配置HTTPS,包括生成私钥和证书文件、配置keystore和truststore、修改hdfs-site.xml和ssl-client.xml文件,以及重启Hadoop集群的步骤,并提供了一些常见问题的故障排除方法。
88 3
Hadoop集群配置https实战案例
|
3月前
|
分布式计算 监控 Hadoop
监控Hadoop集群实战篇
介绍了监控Hadoop集群的方法,包括监控Linux服务器、Hadoop指标、使用Ganglia监控Hadoop集群、Hadoop日志记录、通过Hadoop的Web UI进行监控以及其他Hadoop组件的监控,并提供了相关监控工具和资源的推荐阅读链接。
97 2
|
4月前
|
分布式计算 大数据 Hadoop
揭秘MapReduce背后的魔法:从基础类型到高级格式,带你深入理解这一大数据处理利器的奥秘与实战技巧,让你从此不再是编程门外汉!
【8月更文挑战第17天】MapReduce作为分布式计算模型,是大数据处理的基石。它通过Map和Reduce函数处理大规模数据集,简化编程模型,使开发者聚焦业务逻辑。MapReduce分单阶段和多阶段,支持多种输入输出格式如`TextInputFormat`和`SequenceFileInputFormat`。例如,简单的单词计数程序利用`TextInputFormat`读取文本行并计数;而`SequenceFileInputFormat`适用于高效处理二进制序列文件。合理选择类型和格式可有效解决大数据问题。
71 1
|
4月前
|
分布式计算 大数据 分布式数据库
"揭秘HBase MapReduce高效数据处理秘诀:四步实战攻略,让你轻松玩转大数据分析!"
【8月更文挑战第17天】大数据时代,HBase以高性能、可扩展性成为关键的数据存储解决方案。结合MapReduce分布式计算框架,能高效处理HBase中的大规模数据。本文通过实例展示如何配置HBase集群、编写Map和Reduce函数,以及运行MapReduce作业来计算HBase某列的平均值。此过程不仅限于简单的统计分析,还可扩展至更复杂的数据处理任务,为企业提供强有力的大数据技术支持。
78 1
|
4月前
|
资源调度 分布式计算 监控
【揭秘Hadoop YARN背后的奥秘!】从零开始,带你深入了解YARN资源管理框架的核心架构与实战应用!
【8月更文挑战第24天】Hadoop YARN(Yet Another Resource Negotiator)是Hadoop生态系统中的资源管理器,为Hadoop集群上的应用提供统一的资源管理和调度框架。YARN通过ResourceManager、NodeManager和ApplicationMaster三大核心组件实现高效集群资源利用及多框架支持。本文剖析YARN架构及组件工作原理,并通过示例代码展示如何运行简单的MapReduce任务,帮助读者深入了解YARN机制及其在大数据处理中的应用价值。
100 0
|
6月前
|
分布式计算 Hadoop Java
运行Hadoop自带的wordcount单词统计程序
运行Hadoop自带的wordcount单词统计程序
164 3

相关实验场景

更多