WordCount是一个简单的应用,它读入文本文件,然后统计出字符出现的频率。输入是文本文件,输出也是文本文件,它的每一行包含了一个字符和它出现的频率,用一个制表符隔开。这是一个入门的Map/Reduce编程例子,可以说是Map/Reduce版的Hello,World.
先随便找一个英文的文本文件,重新命名为a01.dat,通过Upload files to DFS,将a01.dat文件上传到DFS中。
在新建项目向导中,新建一个Map/Reduce项目。一个Map/Reduce项目,包含三个主要文件,一个是Map文件,一个是Reduce文件,还有一个是主文件。源代码如下:
Map.java
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Mapper;
public class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
Reduce.java
import java.io.IOException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
public class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
WordCount.java
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path("hdfs://localhost:9000/a01.dat"));
FileOutputFormat.setOutputPath(job, new Path("hdfs://localhost:9000/output"));
job.waitForCompletion(true);
}
}
选择Run As - Run on Hadoop
运行结果存放在output路径下,可以通过http://localhost:50070/查看。
该程序将文本文件的输入,通过Map函数,转换成一组 key,value 有序对。然后根据key,合并成 key,value1,value2....,然后再通过Reducer函数,做累加操作,计算出每个单词的出现次数,生成新的 key,sum 有序对后输出。
手头上有个邮件列表,包含了几万个邮件地址,于是修改了一下map函数,统计各个邮箱的使用情况。修改后的map为:
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] sarray=value.toString().split("@");
word.set(sarray[1]);
context.write(word, one);
}
运行后得到以下结果:
126.com 17230
139.com 573
163.com 35928
21cn.com 1372
citiz.net 223
eyou.com 385
foxmail.com 143
gmail.com 2228
hotmail.com 11021
live.cn 437
msn.com 562
qq.com 22185
sina.com 9671
sina.com.cn 540
sogou.com 222
sohu.com 4106
tom.com 2676
vip.163.com 129
vip.qq.com 589
vip.sina.com 355
vip.sohu.com 285
yahoo.cn 14607
yahoo.com 315
yahoo.com.cn 10770
yahoo.com.hk 252
yeah.net 828