python 写hadoop streaming作业
mapper代码如下
- [backcolor=transparent]#!/usr/bin/env python
- [backcolor=transparent]import[backcolor=transparent] sys
- [backcolor=transparent]for[backcolor=transparent] line [backcolor=transparent]in[backcolor=transparent] sys[backcolor=transparent].[backcolor=transparent]stdin[backcolor=transparent]:
- [backcolor=transparent] line [backcolor=transparent]=[backcolor=transparent] line[backcolor=transparent].[backcolor=transparent]strip[backcolor=transparent]()
- [backcolor=transparent] words [backcolor=transparent]=[backcolor=transparent] line[backcolor=transparent].[backcolor=transparent]split[backcolor=transparent]()
- [backcolor=transparent] [backcolor=transparent]for[backcolor=transparent] word [backcolor=transparent]in[backcolor=transparent] words[backcolor=transparent]:
- [backcolor=transparent] [backcolor=transparent]print[backcolor=transparent] [backcolor=transparent]'%s\t%s'[backcolor=transparent] [backcolor=transparent]%[backcolor=transparent] [backcolor=transparent]([backcolor=transparent]word[backcolor=transparent],[backcolor=transparent] [backcolor=transparent]1[backcolor=transparent])
reducer代码如下
- [backcolor=transparent]#!/usr/bin/env python
- [backcolor=transparent]from[backcolor=transparent] [backcolor=transparent]operator[backcolor=transparent] [backcolor=transparent]import[backcolor=transparent] itemgetter
- [backcolor=transparent]import[backcolor=transparent] sys
- [backcolor=transparent]current_word [backcolor=transparent]=[backcolor=transparent] [backcolor=transparent]None
- [backcolor=transparent]current_count [backcolor=transparent]=[backcolor=transparent] [backcolor=transparent]0
- [backcolor=transparent]word [backcolor=transparent]=[backcolor=transparent] [backcolor=transparent]None
- [backcolor=transparent]for[backcolor=transparent] line [backcolor=transparent]in[backcolor=transparent] sys[backcolor=transparent].[backcolor=transparent]stdin[backcolor=transparent]:
- [backcolor=transparent] line [backcolor=transparent]=[backcolor=transparent] line[backcolor=transparent].[backcolor=transparent]strip[backcolor=transparent]()
- [backcolor=transparent] word[backcolor=transparent],[backcolor=transparent] count [backcolor=transparent]=[backcolor=transparent] line[backcolor=transparent].[backcolor=transparent]split[backcolor=transparent]([backcolor=transparent]'\t'[backcolor=transparent],[backcolor=transparent] [backcolor=transparent]1[backcolor=transparent])
- [backcolor=transparent] [backcolor=transparent]try[backcolor=transparent]:
- [backcolor=transparent] count [backcolor=transparent]=[backcolor=transparent] [backcolor=transparent]int[backcolor=transparent]([backcolor=transparent]count[backcolor=transparent])
- [backcolor=transparent] [backcolor=transparent]except[backcolor=transparent] [backcolor=transparent]ValueError[backcolor=transparent]:
- [backcolor=transparent] [backcolor=transparent]continue
- [backcolor=transparent] [backcolor=transparent]if[backcolor=transparent] current_word [backcolor=transparent]==[backcolor=transparent] word[backcolor=transparent]:
- [backcolor=transparent] current_count [backcolor=transparent]+=[backcolor=transparent] count
- [backcolor=transparent] [backcolor=transparent]else[backcolor=transparent]:
- [backcolor=transparent] [backcolor=transparent]if[backcolor=transparent] current_word[backcolor=transparent]:
- [backcolor=transparent] [backcolor=transparent]print[backcolor=transparent] [backcolor=transparent]'%s\t%s'[backcolor=transparent] [backcolor=transparent]%[backcolor=transparent] [backcolor=transparent]([backcolor=transparent]current_word[backcolor=transparent],[backcolor=transparent] current_count[backcolor=transparent])
- [backcolor=transparent] current_count [backcolor=transparent]=[backcolor=transparent] count
- [backcolor=transparent] current_word [backcolor=transparent]=[backcolor=transparent] word
- [backcolor=transparent]if[backcolor=transparent] current_word [backcolor=transparent]==[backcolor=transparent] word[backcolor=transparent]:
- [backcolor=transparent] [backcolor=transparent]print[backcolor=transparent] [backcolor=transparent]'%s\t%s'[backcolor=transparent] [backcolor=transparent]%[backcolor=transparent] [backcolor=transparent]([backcolor=transparent]current_word[backcolor=transparent],[backcolor=transparent] current_count[backcolor=transparent])
假设mapper代码保存在/home/hadoop/mapper.py, reducer代码保存在/home/hadoop/reducer.py , 输入路径为hdfs文件系统的/tmp/input,输出路径为hdfs文件系统的/tmp/output。则在E-MapReduce集群上提交下面的hadoop命令
hadoop jar /usr/lib/hadoop-current/share/hadoop/tools/lib/hadoop-streaming-*.jar -file /home/hadoop/mapper.py -mapper mapper.py -file /home/hadoop/reducer.py -reducer reducer.py -input /tmp/hosts -output /tmp/output