开发者社区 问答 正文

E-MapReduce Hadoop Streaming是什么?



python 写hadoop streaming作业


mapper代码如下

  1. #!/usr/bin/env python
  2. import sys
  3. for line in sys.stdin:
  4.     line = line.strip()
  5.     words = line.split()
  6.     for word in words:
  7.         print '%s\t%s' % (word, 1)

reducer代码如下
  1. #!/usr/bin/env python
  2. from operator import itemgetter
  3. import sys
  4. current_word = None
  5. current_count = 0
  6. word = None
  7. for line in sys.stdin:
  8.     line = line.strip()
  9.     word, count = line.split('\t', 1)
  10.     try:
  11.         count = int(count)
  12.     except ValueError:
  13.         continue
  14.     if current_word == word:
  15.         current_count += count
  16.     else:
  17.         if current_word:
  18.             print '%s\t%s' % (current_word, current_count)
  19.         current_count = count
  20.         current_word = word
  21. if current_word == word:
  22.     print '%s\t%s' % (current_word, current_count)

假设mapper代码保存在/home/hadoop/mapper.py, reducer代码保存在/home/hadoop/reducer.py ,输入路径为hdfs文件系统的/tmp/input,输出路径为hdfs文件系统的/tmp/output。则在E-MapReduce集群上提交下面的hadoop命令
hadoop jar/usr/lib/hadoop-current/share/hadoop/tools/lib/hadoop-streaming-*.jar -file/home/hadoop/mapper.py -mapper mapper.py -file /home/hadoop/reducer.py -reducerreducer.py -input /tmp/hosts -output /tmp/output

展开
收起
nicenelly 2017-10-30 16:16:13 1513 分享 版权
0 条回答
写回答
取消 提交回答