Hadoop MapReduce之wordcount(词频统计)

简介: 1.创建test.log 点击(此处)折叠或打开 [root@sht-sgmhadoopnn-01 mapreduce]# more /tmp/test.
1.创建test.log

点击(此处)折叠或打开

  1. [root@sht-sgmhadoopnn-01 mapreduce]# more /tmp/test.log
  2. 1
  3. 2
  4. 3
  5. a
  6. b
  7. a
  8. v
  9. a a a
  10. abc
  11. 我是谁
  12. %……
  13. %
2.hadoop创建目录及上传

点击(此处)折叠或打开

  1. [root@sht-sgmhadoopnn-01 ~]# hadoop fs -mkdir /testdir
  2. 16/02/28 19:40:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  3. [root@sht-sgmhadoopnn-01 ~]# hadoop fs -put /tmp/test.log /testdir/
  4. 16/02/28 19:40:19 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
3.查看官方封装好的函数,我们选取wordcount

点击(此处)折叠或打开

  1. [root@sht-sgmhadoopnn-01 ~]#cd /hadoop/hadoop-2.7.2/share/hadoop/mapreduce
  2. [root@sht-sgmhadoopnn-01 mapreduce]# hadoop jar hadoop-mapreduce-examples-2.7.2.jar
  3. An example program must be given as the first argument.
  4. Valid program names are:
  5.   aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
  6.   aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
  7.   bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
  8.   dbcount: An example job that count the pageview counts from a database.
  9.   distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
  10.   grep: A map/reduce program that counts the matches of a regex in the input.
  11.   join: A job that effects a join over sorted, equally partitioned datasets
  12.   multifilewc: A job that counts words from several files.
  13.   pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
  14.   pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
  15.   randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
  16.   randomwriter: A map/reduce program that writes 10GB of random data per node.
  17.   secondarysort: An example defining a secondary sort to the reduce.
  18.   sort: A map/reduce program that sorts the data written by the random writer.
  19.   sudoku: A sudoku solver.
  20.   teragen: Generate data for the terasort
  21.   terasort: Run the terasort
  22.   teravalidate: Checking results of terasort
  23.   wordcount: A map/reduce program that counts the words in the input files.
  24.   wordmean: A map/reduce program that counts the average length of the words in the input files.
  25.   wordmedian: A map/reduce program that counts the median length of the words in the input files.
  26.   wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.
4.运行wordcount
# hadoop jar hadoop-mapreduce-examples-2.7.2.jar wordcount /testdir /out1
#                       官方模板jar包              函数    输入目录 输出目录(未创建)

点击(此处)折叠或打开

  1. [root@sht-sgmhadoopnn-01 mapreduce]# hadoop jar hadoop-mapreduce-examples-2.7.2.jar wordcount /testdir /out1
  2. 16/02/28 19:40:50 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  3. 16/02/28 19:40:53 INFO input.FileInputFormat: Total input paths to process : 1
  4. 16/02/28 19:40:53 INFO mapreduce.JobSubmitter: number of splits:1
  5. 16/02/28 19:40:53 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1456590271264_0002
  6. 16/02/28 19:40:54 INFO impl.YarnClientImpl: Submitted application application_1456590271264_0002
  7. 16/02/28 19:40:54 INFO mapreduce.Job: The url to track the job: http://sht-sgmhadoopnn-01:8088/proxy/application_1456590271264_0002/
  8. 16/02/28 19:40:54 INFO mapreduce.Job: Running job: job_1456590271264_0002
  9. 16/02/28 19:41:04 INFO mapreduce.Job: Job job_1456590271264_0002 running in uber mode : false
  10. 16/02/28 19:41:04 INFO mapreduce.Job: map 0% reduce 0%
  11. 16/02/28 19:41:12 INFO mapreduce.Job: map 100% reduce 0%
  12. 16/02/28 19:41:21 INFO mapreduce.Job: map 100% reduce 100%
  13. 16/02/28 19:41:22 INFO mapreduce.Job: Job job_1456590271264_0002 completed successfully
  14. 16/02/28 19:41:22 INFO mapreduce.Job: Counters: 49
  15.         File System Counters
  16.                 FILE: Number of bytes read=102
  17.                 FILE: Number of bytes written=244621
  18.                 FILE: Number of read operations=0
  19.                 FILE: Number of large read operations=0
  20.                 FILE: Number of write operations=0
  21.                 HDFS: Number of bytes read=142
  22.                 HDFS: Number of bytes written=56
  23.                 HDFS: Number of read operations=6
  24.                 HDFS: Number of large read operations=0
  25.                 HDFS: Number of write operations=2
  26.         Job Counters
  27.                 Launched map tasks=1
  28.                 Launched reduce tasks=1
  29.                 Data-local map tasks=1
  30.                 Total time spent by all maps in occupied slots (ms)=5537
  31.                 Total time spent by all reduces in occupied slots (ms)=6555
  32.                 Total time spent by all map tasks (ms)=5537
  33.                 Total time spent by all reduce tasks (ms)=6555
  34.                 Total vcore-milliseconds taken by all map tasks=5537
  35.                 Total vcore-milliseconds taken by all reduce tasks=6555
  36.                 Total megabyte-milliseconds taken by all map tasks=5669888
  37.                 Total megabyte-milliseconds taken by all reduce tasks=6712320
  38.         Map-Reduce Framework
  39.                 Map input records=12
  40.                 Map output records=14
  41.                 Map output bytes=100
  42.                 Map output materialized bytes=102
  43.                 Input split bytes=98
  44.                 Combine input records=14
  45.                 Combine output records=10
  46.                 Reduce input groups=10
  47.                 Reduce shuffle bytes=102
  48.                 Reduce input records=10
  49.                 Reduce output records=10
  50.                 Spilled Records=20
  51.                 Shuffled Maps =1
  52.                 Failed Shuffles=0
  53.                 Merged Map outputs=1
  54.                 GC time elapsed (ms)=79
  55.                 CPU time spent (ms)=2560
  56.                 Physical memory (bytes) snapshot=445992960
  57.                 Virtual memory (bytes) snapshot=1775263744
  58.                 Total committed heap usage (bytes)=306184192
  59.         Shuffle Errors
  60.                 BAD_ID=0
  61.                 CONNECTION=0
  62.                 IO_ERROR=0
  63.                 WRONG_LENGTH=0
  64.                 WRONG_MAP=0
  65.                 WRONG_REDUCE=0
  66.         File Input Format Counters
  67.                 Bytes Read=44
  68.         File Output Format Counters
  69.                 Bytes Written=56
  70. You have mail in /var/spool/mail/root
  71. [root@sht-sgmhadoopnn-01 mapreduce]#
5.验证wordcount,词频统计

点击(此处)折叠或打开

  1. [root@sht-sgmhadoopnn-01 mapreduce]# hadoop fs -ls /out1
  2. 16/02/28 19:43:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  3. Found 2 items
  4. -rw-r--r-- 3 root supergroup 0 2016-02-28 19:41 /out1/_SUCCESS
  5. -rw-r--r-- 3 root supergroup 56 2016-02-28 19:41 /out1/part-r-00000
  6. [root@sht-sgmhadoopnn-01 mapreduce]# hadoop fs -text /out1/part-r-00000
  7. 16/02/28 19:43:38 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  8. % 1
  9. %…… 1
  10. 1 1
  11. 2 1
  12. 3 1
  13. a 5
  14. abc 1
  15. b 1
  16. v 1
  17. 我是谁 1
  18. You have mail in /var/spool/mail/root

目录
相关文章
|
1月前
|
分布式计算 资源调度 Hadoop
大数据-80 Spark 简要概述 系统架构 部署模式 与Hadoop MapReduce对比
大数据-80 Spark 简要概述 系统架构 部署模式 与Hadoop MapReduce对比
62 2
|
1月前
|
分布式计算 资源调度 Hadoop
Hadoop-10-HDFS集群 Java实现MapReduce WordCount计算 Hadoop序列化 编写Mapper和Reducer和Driver 附带POM 详细代码 图文等内容
Hadoop-10-HDFS集群 Java实现MapReduce WordCount计算 Hadoop序列化 编写Mapper和Reducer和Driver 附带POM 详细代码 图文等内容
82 3
|
1月前
|
分布式计算 资源调度 数据可视化
Hadoop-06-Hadoop集群 历史服务器配置 超详细 执行任务记录 JobHistoryServer MapReduce执行记录 日志聚合结果可视化查看
Hadoop-06-Hadoop集群 历史服务器配置 超详细 执行任务记录 JobHistoryServer MapReduce执行记录 日志聚合结果可视化查看
30 1
|
1月前
|
SQL 分布式计算 关系型数据库
Hadoop-24 Sqoop迁移 MySQL到Hive 与 Hive到MySQL SQL生成数据 HDFS集群 Sqoop import jdbc ETL MapReduce
Hadoop-24 Sqoop迁移 MySQL到Hive 与 Hive到MySQL SQL生成数据 HDFS集群 Sqoop import jdbc ETL MapReduce
79 0
|
1月前
|
SQL 分布式计算 关系型数据库
Hadoop-23 Sqoop 数据MySQL到HDFS(部分) SQL生成数据 HDFS集群 Sqoop import jdbc ETL MapReduce
Hadoop-23 Sqoop 数据MySQL到HDFS(部分) SQL生成数据 HDFS集群 Sqoop import jdbc ETL MapReduce
34 0
|
1月前
|
SQL 分布式计算 关系型数据库
Hadoop-22 Sqoop 数据MySQL到HDFS(全量) SQL生成数据 HDFS集群 Sqoop import jdbc ETL MapReduce
Hadoop-22 Sqoop 数据MySQL到HDFS(全量) SQL生成数据 HDFS集群 Sqoop import jdbc ETL MapReduce
44 0
|
1月前
|
分布式计算 Kubernetes Hadoop
大数据-82 Spark 集群模式启动、集群架构、集群管理器 Spark的HelloWorld + Hadoop + HDFS
大数据-82 Spark 集群模式启动、集群架构、集群管理器 Spark的HelloWorld + Hadoop + HDFS
129 6
|
8天前
|
存储 分布式计算 Hadoop
数据湖技术:Hadoop与Spark在大数据处理中的协同作用
【10月更文挑战第27天】在大数据时代,数据湖技术凭借其灵活性和成本效益成为企业存储和分析大规模异构数据的首选。Hadoop和Spark作为数据湖技术的核心组件,通过HDFS存储数据和Spark进行高效计算,实现了数据处理的优化。本文探讨了Hadoop与Spark的最佳实践,包括数据存储、处理、安全和可视化等方面,展示了它们在实际应用中的协同效应。
40 2
|
8天前
|
存储 分布式计算 Hadoop
数据湖技术:Hadoop与Spark在大数据处理中的协同作用
【10月更文挑战第26天】本文详细探讨了Hadoop与Spark在大数据处理中的协同作用,通过具体案例展示了两者的最佳实践。Hadoop的HDFS和MapReduce负责数据存储和预处理,确保高可靠性和容错性;Spark则凭借其高性能和丰富的API,进行深度分析和机器学习,实现高效的批处理和实时处理。
34 1
|
26天前
|
分布式计算 Hadoop 大数据
大数据体系知识学习(一):PySpark和Hadoop环境的搭建与测试
这篇文章是关于大数据体系知识学习的,主要介绍了Apache Spark的基本概念、特点、组件,以及如何安装配置Java、PySpark和Hadoop环境。文章还提供了详细的安装步骤和测试代码,帮助读者搭建和测试大数据环境。
49 1

相关实验场景

更多