1.各组件版本
Linux ISO:CentOS-6.0-i386-bin-DVD.iso 32位
JDK version:"1.6.0_25-ea"
Hadoop software version:hadoop-0.20.205.0.tar.gz
Hbase version:hbase-0.90.5
Pig version:pig-0.9.2.tar.gz http://mirror.bjtu.edu.cn/apache/pig/pig-0.9.2/pig-0.9.2.tar.gz 北京大学的apache镜像下载,这个版本呢其实不是最新的但和hadoop0.20.2版本匹配,pig版本与hadoop版本也是有配达要求的,请注意你自己的安装的hadoop版本是啥,上网搜索一下对应的版本,当然从上面的网站上可以下载pig全系列,例如pig-0.10.0.tar.gz 这里我就不一一举例了。
2.Pig安装模式
Local模式:实际就是单机模式,pig只能访问本地一台主机,没有分布式,甚至可以不用安装hadoop,所有的命令执行和文件读写都在本地进行,常用于作业实验。
Local模式:只需要配置export
PATH=/usr/java/jdk1.6.0_25/bin:/home/grid/hadoop-0.20.2/bin:/home/grid/pig-0.9.2/bin:$PATH 1个环境变量即可
MapReduce模式:这种模式才是实际应用中的工作模式,它可以将文件上传到HDFS系统中,在使用pig latin语言运行作业时,可以将作业分布在hadoop集群中完成,这也体现了MapReduce的思想,这样我们通过pig客户端连接hadoop集群进行数据管理和分析工作。
需要配置PATH PIG_CLASSPATH hosts文件 启动pig
本次主要介绍MapReduce模式安装,因为这种安装模式在实际中最常用也是最有意义的。
Pig作为hadoop的客户端,Pig安装包可以安装在集群任何节点上,它可以在任何节点上提交作业,我这次安装在master节点上为了是方便了解部署架构。
3.验证Hadoop集群状态
使用shell命令行方式验证
[grid@h1 hadoop-0.20.2]$ bin/hadoop dfsadmin -report
Configured Capacity: 19865944064 (18.5 GB)
Present Capacity: 8833888256 (8.23 GB)
DFS Remaining: 8833495040 (8.23 GB)
DFS Used: 393216 (384 KB)
DFS Used%: 0%
Under replicated blocks: 4
Blocks with corrupt replicas: 0
Missing blocks: 0
-------------------------------------------------
Datanodes available: 2 (2 total, 0 dead) --2个节点存活无shutdown
Name: 192.168.2.103:50010 -- slaves h2
Decommission Status : Normal --状态正常
Configured Capacity: 9932972032 (9.25 GB)
DFS Used: 196608 (192 KB)
Non DFS Used: 5401513984 (5.03 GB)
DFS Remaining: 4531261440(4.22 GB)
DFS Used%: 0%
DFS Remaining%: 45.62%
Last contact: Fri Nov 02 18:58:02 CST 2012
Name: 192.168.2.105:50010 -- slaves h4
Decommission Status : Normal --状态正常
Configured Capacity: 9932972032 (9.25 GB)
DFS Used: 196608 (192 KB)
Non DFS Used: 5630541824 (5.24 GB)
DFS Remaining: 4302233600(4.01 GB)
DFS Used%: 0%
DFS Remaining%: 43.31%
Last contact: Fri Nov 02 18:58:02 CST 2012
[grid@h1 hadoop-0.20.2]$ jps master -> hadoop 和 hbase 都启动了
22926 HQuorumPeer
4709 JobTracker
22977 HMaster
4515 NameNode
4650 SecondaryNameNode
31681 Jps
[grid@h2 tmp]$ jps slave1 -> hadoop 和 hbase 都启动了
17188 TaskTracker
22181 Jps
13800 HRegionServer
13727 HQuorumPeer
17077 DataNode
[grid@h4 logs]$ jps slave2 -> hadoop 和 hbase 都启动了
27829 TaskTracker
19978 Jps
26875 Jps
17119 DataNode
11636 HRegionServer
11557 HQuorumPeer
4.Pig安装与配置
(1)把pig-0.9.2.tar.gz上传到h1:/home/grid/目录下并tar解包
[grid@h1 grid]$ pwd
/home/grid
[grid@h1 grid]$ ll
总用量 46832
-rwxrwxrwx. 1 grid hadoop 44 9月 18 19:10 abc.txt
-rwxrwxrwx. 1 grid hadoop 5519 10月 12 22:09 Exercise_1.jar
drwxr-xr-x. 14 grid hadoop 4096 9月 18 07:05 hadoop-0.20.2
drwxr-xr-x. 10 grid hadoop 4096 10月 28 21:13 hbase-0.90.5
-rwxrw-rw-. 1 grid hadoop 47875717 11月 2 06:44 pig-0.9.2.tar.gz
[grid@h1 grid]$ tar -zxvf pig-0.9.2.tar.gz
[grid@h1 grid]$ ll
总用量 46836
-rwxrwxrwx. 1 grid hadoop 44 9月 18 19:10 abc.txt
-rwxrwxrwx. 1 grid hadoop 5519 10月 12 22:09 Exercise_1.jar
drwxr-xr-x. 14 grid hadoop 4096 9月 18 07:05 hadoop-0.20.2
drwxr-xr-x. 10 grid hadoop 4096 10月 28 21:13 hbase-0.90.5
drwxr-xr-x. 2 grid hadoop 4096 9月 16 19:57 input
drwxr-xr-x. 15 grid hadoop 4096 1月 18 2012 pig-0.9.2
-rwxrw-rw-. 1 grid hadoop 47875717 11月 2 06:44 pig-0.9.2.tar.gz
(2)配置Pig的环境变量 红色字体都是要修改的
[grid@h1 grid]$ vim .bashrc
export JAVA_HOME=/usr --不要写java目录本身,要写上级目录才生效
export JRE_HOME=/usr/java/jdk1.6.0_25/jre
export
PATH=/usr/java/jdk1.6.0_25/bin:/home/grid/hadoop-0.20.2/bin:/home/grid/pig-0.9.2/bin:$PATH
--添加hadoop软件命令目录和pig软件命令目录,作用告诉shell命令行到哪个目录下去找命令or程序
export CLASSPATH=./:/usr/java/jdk1.6.0_25/lib:/usr/java/jdk1.6.0_25/jre/lib
export PIG_CLASSPATH=/home/grid/hadoop-0.20.2/conf --既然是MapReduce模式,就要让Pig软件找到Hadoop集群,这里是告诉pig软件hadoop的配置文件在哪里,通过一系列配置文件core_site.xml hdfs-site.xml mapred-site.xml 可以找到关键参数NameNode 和 JobTracker 的位置以及端口信息,有了这些信息就可以对整个集群进行控制了。
方法二 编辑/home/grid/pig-0.9.2/conf /pig.properties 也可以启动MapReduce模式
添加
fs.default.name= hdfs://h1:9000 找到namenode信息
mapred.job.tracker= h1:9001 找到jobtracker信息
(3)使环境变量生效
[grid@h1 grid]$ source .bashrc 加载环境变量使之生效
(4)查看hosts文件
[grid@h1 grid]$ cat /etc/hosts
192.168.2.102 h1 # Added by NetworkManager
127.0.0.1 localhost.localdomain localhost
::1 h1 localhost6.localdomain6 localhost6
192.168.2.102 h1
192.168.2.103 h2
192.168.2.105 h4
这个文件是主机名和IP地址映射文件,一般在Hadoop集群中都使用主机名进行通信的,在配置文件中也使用主机名进行配置。
(5)启动pig
[grid@h1 grid]$ pig -x mapreduce 也可以只用pig命令进入shell
2012-11-02 20:09:22,149 [main] INFO org.apache.pig.Main - Logging error messages to: /home/grid/pig_1351858162147.log
2012-11-02 20:09:23,314 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://h1:9000 --pig找到namenode
2012-11-02 20:09:27,950 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: h1:9001 --pig找到jobtracker
grunt> quit 退出pig客户端
[grid@h1 grid]$
[grid@h1 grid]$ pig 也可以只用pig命令进入shell
2012-11-02 20:16:17,968 [main] INFO org.apache.pig.Main - Logging error messages to: /home/grid/pig_1351858577966.log
2012-11-02 20:16:18,100 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://h1:9000
2012-11-02 20:16:18,338 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: h1:9001
grunt> help 帮助命令列表
Commands:
<pig latin statement>; - See the PigLatin manual for details: http://hadoop.apache.org/pig
File system commands:
fs <fs arguments> - Equivalent to Hadoop dfs command: http://hadoop.apache.org/common/docs/current/hdfs_shell.html
Diagnostic commands:
describe <alias>[::<alias] - Show the schema for the alias. Inner aliases can be described as A::B.
explain [-script <pigscript>] [-out <path>] [-brief] [-dot] [-param <param_name>=<param_value>]
[-param_file <file_name>] [<alias>] - Show the execution plan to compute the alias or for entire script.
-script - Explain the entire script.
-out - Store the output into directory rather than print to stdout.
-brief - Don't expand nested plans (presenting a smaller graph for overview).
-dot - Generate the output in .dot format. Default is text format.
-param <param_name - See parameter substitution for details.
-param_file <file_name> - See parameter substitution for details.
alias - Alias to explain.
dump <alias> - Compute the alias and writes the results to stdout.
Utility Commands:
exec [-param <param_name>=param_value] [-param_file <file_name>] <script> -
Execute the script with access to grunt environment including aliases.
-param <param_name - See parameter substitution for details.
-param_file <file_name> - See parameter substitution for details.
script - Script to be executed.
run [-param <param_name>=param_value] [-param_file <file_name>] <script> -
Execute the script with access to grunt environment.
-param <param_name - See parameter substitution for details.
-param_file <file_name> - See parameter substitution for details.
script - Script to be executed.
kill <job_id> - Kill the hadoop job specified by the hadoop job id.
set <key> <value> - Provide execution parameters to Pig. Keys and values are case sensitive.
The following keys are supported:
default_parallel - Script-level reduce parallelism. Basic input size heuristics used by default.
debug - Set debug on or off. Default is off.
job.name - Single-quoted name for jobs. Default is PigLatin:<script name>
job.priority - Priority for jobs. Values: very_low, low, normal, high, very_high. Default is normal
stream.skippath - String that contains the path. This is used by streaming.
any hadoop property.
help - Display this message.
quit - Quit the grunt shell.
(6)pig操作命令
自动补全机制(大小写敏感):就跟linux中的命令自动补全一样,当你输入一半的命令按住tab键就可以输出整个命令,但不能补全文件名哦!
grunt> ls 显示根目录的内容
hdfs://h1:9000/user/grid/in <dir> dir表示目录的意思 <r 3>表示文件的意思
hdfs://h1:9000/user/grid/out1 <dir>
hdfs://h1:9000/user/grid/out2 <dir>
grunt> cd in 进入in子目录
grunt> ls
hdfs://h1:9000/user/grid/in/test_1<r 3> 324 324个字节
hdfs://h1:9000/user/grid/in/test_2<r 3> 134 134个字节
grunt> cat test_1 显示test_1文件内容
Apr 23 11:49:54 hostapd: wlan0: STA 14:7d:c5:9e:fb:84
Apr 23 11:49:52 hostapd: wlan0: STA 74:e5:0b:04:28:f2
Apr 23 11:49:50 hostapd: wlan0: STA cc:af:78:cc:d5:5d
Apr 23 11:49:44 hostapd: wlan0: STA cc:af:78:cc:d5:5d
Apr 23 11:49:43 hostapd: wlan0: STA 74:e5:0b:04:28:f2
Apr 23 11:49:42 hostapd: wlan0: STA 14:7d:c5:9e:fb:84
grunt> cat test_2 显示test_2文件内容
13599999999 10086
13899999999 120
13944444444 13800138000
13722222222 13800138000
18800000000 120
13722222222 10086
18944444444 10086
在grunt>中全是绝对路径,没有相对路径的显示
在grunt>中引入了当前目录的概念,可以对当前目录进行记忆和管理
在grunt>中直接对HDFS文件系统操作,不用在写烦琐的HDFS命令了
copyFromLocal 把操作系统中的东西->拷贝->HDFS文件系统中
grunt> copyFromLocal /home/grid/access_log.txt pig/access_log.txt
grunt> ls
hdfs://h1:9000/user/grid/in <dir>
hdfs://h1:9000/user/grid/out1 <dir>
hdfs://h1:9000/user/grid/out2 <dir>
hdfs://h1:9000/user/grid/pig <dir>
grunt> cd pig
grunt> ls
hdfs://h1:9000/user/grid/pig/access_log.txt<r 2> 7118627 字节数对的上
copyToLocal 把HDFS文件系统中的东西->拷贝->操作系统中
grunt> copyToLocal test_1 ttt
grunt> ls
hdfs://h1:9000/user/grid/in/test_1<r 3> 324
hdfs://h1:9000/user/grid/in/test_2<r 3> 134
[grid@h1 grid]$ cat ttt 完美拷贝
Apr 23 11:49:54 hostapd: wlan0: STA 14:7d:c5:9e:fb:84
Apr 23 11:49:52 hostapd: wlan0: STA 74:e5:0b:04:28:f2
Apr 23 11:49:50 hostapd: wlan0: STA cc:af:78:cc:d5:5d
Apr 23 11:49:44 hostapd: wlan0: STA cc:af:78:cc:d5:5d
Apr 23 11:49:43 hostapd: wlan0: STA 74:e5:0b:04:28:f2
Apr 23 11:49:42 hostapd: wlan0: STA 14:7d:c5:9e:fb:84
sh 命令 在grunt>里直接运行操作系统命令
grunt> sh pwd
/home/grid
grunt> sh cat ttt
Apr 23 11:49:54 hostapd: wlan0: STA 14:7d:c5:9e:fb:84
Apr 23 11:49:52 hostapd: wlan0: STA 74:e5:0b:04:28:f2
Apr 23 11:49:50 hostapd: wlan0: STA cc:af:78:cc:d5:5d
Apr 23 11:49:44 hostapd: wlan0: STA cc:af:78:cc:d5:5d
Apr 23 11:49:43 hostapd: wlan0: STA 74:e5:0b:04:28:f2
Apr 23 11:49:42 hostapd: wlan0: STA 14:7d:c5:9e:fb:84
5.Pig案例应用
题目:请使用Pig latin语言处理access_log.txt日志,计算出每个IP的点击数。
我们看一下命令列表,下面是我们常用的pig latin语言
<EOF>
"cat" ...
"fs" ...
"sh" ...
"cd" ...
"cp" ...
"copyFromLocal" ...
"copyToLocal" ...
"dump" ...
"describe" ...
"aliases" ...
"explain" ...
"help" ...
"kill" ...
"ls" ...
"mv" ...
"mkdir" ...
"pwd" ...
"quit" ...
"register" ...
"rm" ...
"rmf" ...
"set" ...
"illustrate" ...
"run" ...
"exec" ...
"scriptDone" ...
"" ...
<EOL> ...
";" ...
grunt> pwd
hdfs://h1:9000/user/grid/pig
grunt> ls
hdfs://h1:9000/user/grid/pig/access_log.txt<r 2> 7118627 这就是我们要处理的文件
grunt> cat access_log.txt 我们来看一下文件的内容之后进行数据分析
119.146.220.12 - - [31/Jan/2012:23:59:51 +0800] "GET /static/js/jquery-1.6.js HTTP/1.1" 404 299 "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
119.146.220.12 - - [31/Jan/2012:23:59:52 +0800] "GET /static/js/floating-jf.js HTTP/1.1" 404 300 "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
119.146.220.12 - - [31/Jan/2012:23:59:55 +0800] "GET /popwin_js.php?fid=53 HTTP/1.1" 404 289 "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
119.146.220.12 - - [31/Jan/2012:23:59:55 +0800] "GET /static/js/smilies.js?AZH HTTP/1.1" 304 - "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
119.146.220.12 - - [31/Jan/2012:23:59:55 +0800] "GET /data/cache/common_smilies_var.js?AZH HTTP/1.1" 304 - "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
数据算法:
这是一部分dataguru上网日志,从日志内容结构看,ip地址是放在前面的,我们只要抽取出ip地址写入一张ip_text表,然后对ip列进行分组相当于分成若干个小表,每个ip集合为一个小表,再单独算出每个小表总行数即ip点击次数。
(1)加载HDFS文件系统中access_log.txt文件内容放到pig的一个关系(表)里,使用空格作为分隔符,只加载ip列即可。
grunt> ip_text = LOAD 'pig/access_log.txt' USING PigStorage(' ') AS (ip:chararray);
ip_text:代表一个关系,一个表,一个变量,这个表中存放了所有ip记录
LOAD 'pig/access_log.txt':要加载的文件
USING PigStorage(' '):使用空格作为分隔符
ip:chararray:表中第一列名ip,数据类型chararray字符型
(2)查看ip_text表结构与内容
一定要仔细,例如命令结尾符不要丢掉,当我们执行一条pig latin语句时,pig自动转换成MapReduce作业对用户来说是透明的,先创建一个jar包,再提交MR job,生成Hadoop job id在执行,最后显示结果!
grunt> DESCRIBE ip_text; 显示表的结构,只有一列,类型为字符型
ip_text: {ip: chararray}
grunt> DUMP ip_text; 显示表的内容,只截取部分内容
creating jar file Job2594979755419279957.jar
1 map-reduce job(s) waiting for submission
HadoopJobId: job_201210121146_0002
(119.146.220.12)
(180.153.227.41)
(180.153.227.44)
(180.153.227.44)
(180.153.227.44)
(221.194.180.166)
(119.146.220.12)
(119.146.220.12)
(119.146.220.12)
(119.146.220.12)
(119.146.220.12)
(119.146.220.12)
(119.146.220.12)
(119.146.220.12)
(119.146.220.12)
(119.146.220.12)
(119.146.220.12)
(119.146.220.12)
(119.146.220.12)
(220.181.94.221)
(119.146.220.12)
(119.146.220.12)
(119.146.220.12)
(119.146.220.12)
(119.146.220.12)
(119.146.220.12)
(3)对ip列进行分组,并查看分组后表的内容和结构,注意关键字大小写
把每个ip集合分成一个个小表,把分组后的结果存放在 group_ip 这个表中
grunt> group_ip = GROUP ip_text BY ip; 按照ip进行分组赋给group_ip表
grunt> DESCRIBE group_ip; 查看group_ip表结构
group_ip: {group: chararray,ip_text: {(ip: chararray)}}
我们一眼就看出group_ip表是一个嵌套表,第一个field是group,这就是分组后的ip值
第二个field是一个嵌套的小表又叫包,是前面分组ip的整个集合
grunt> DUMP group_ip; 又提交一个MR job运行
Pig script settings are added to the job Pig脚本自动转换MR job
creating jar file Job2785495206577164389.jar 创建jar包
jar file Job2785495206577164389.jar created jar包创建完毕
map-reduce job(s) waiting for submission. 提交job
HadoopJobId: job_201210121146_0003 job id:job_201210121146_0003
(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),
(4)统计每个小表总行数即ip点击次数
grunt> count_ip = FOREACH group_ip GENERATE group,COUNT($1) AS count_ip;
FOREACH group_ip:逐行扫描group_ip表,赋给count_ip表
GENERATE group:读取分组ip值
COUNT($1) AS count_ip:统计嵌套小表(包)总行数即ip点击次数,把此列取别名叫count_ip方便倒序排列,$1统计第一列,等价于COUNT(ip_text.ip)
grunt> sort_count_ip = ORDER count_ip BY count_ip DESC; 按照count_ip列从大到小排序
# grunt> sort_count_ip = ORDER count_ip BY count_ip ASC; 从小到大排序
(5)查看sort_count_ip表结构和内容
grunt> DESCRIBE sort_count_ip; 显示表的结构,有二列
sort_count_ip: {group: chararray,count_ip: long} 第一个field是group字符型(分组ip值),第二个field是count_ip长类型(ip点击次数)
grunt> DUMP sort_count_ip; 显示表的内容,只截取部分结果,先输出统计信息后显示结果
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
0.20.2 0.9.2 grid 2012-11-03 21:13:05 2012-11-03 21:18:39 GROUP_BY,ORDER_BY
Success!
Input(s):
Successfully read 28134 records (7118627 bytes) from: "hdfs://h1:9000/user/grid/pig/access_log.txt"
Output(s):
Successfully stored 476 records (14515 bytes) in: "hdfs://h1:9000/tmp/temp1703385752/tmp-1916755802"
Counters:
Total records written : 476
Total bytes written : 14515
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_201210121146_0004 -> job_201210121146_0005,
job_201210121146_0005 -> job_201210121146_0006,
job_201210121146_0006
(218.20.24.203,4597)
(221.194.180.166,4576)
(119.146.220.12,1850)
(117.136.31.144,1647)
(121.28.95.48,1597)
(113.109.183.126,1596)
(182.48.112.2,870)
(120.84.24.200,773)
(61.144.125.162,750)
(27.115.124.75,470)
(115.236.48.226,439)
(59.41.62.100,339)
(89.126.54.40,305)
(114.247.10.132,243)
(125.46.45.78,236)
(220.181.94.221,205)
(218.19.42.168,181)
(118.112.183.164,179)
(116.235.194.89,171)
(6)把sort_count_ip表内容写入HDFS文件系统中,即固化到硬盘存入文件
grunt> STORE sort_count_ip INTO 'pig/sort_count_ip';
Counters:
Total records written : 476
Total bytes written : 8051
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_201210121146_0007 -> job_201210121146_0008,
job_201210121146_0008 -> job_201210121146_0009,
job_201210121146_0009
2012-11-03 21:28:41,520 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
当我们看到Success时就说明我们已经保存成功!
(7)查看保存在HDFS中的结果文件
grunt> ls
hdfs://h1:9000/user/grid/in <dir>
hdfs://h1:9000/user/grid/out1 <dir>
hdfs://h1:9000/user/grid/out2 <dir>
hdfs://h1:9000/user/grid/pig <dir>
grunt> cd pig
grunt> ls
hdfs://h1:9000/user/grid/pig/access_log.txt<r 2> 7118627
hdfs://h1:9000/user/grid/pig/sort_count_ip <dir>
grunt> cat sort_count_ip
218.20.24.203 4597
221.194.180.166 4576
119.146.220.12 1850
117.136.31.144 1647
121.28.95.48 1597
113.109.183.126 1596
182.48.112.2 870
120.84.24.200 773
61.144.125.162 750
27.115.124.75 470
115.236.48.226 439
59.41.62.100 339
89.126.54.40 305
114.247.10.132 243
125.46.45.78 236
220.181.94.221 205
218.19.42.168 181
118.112.183.164 179
116.235.194.89 171
综上我们圆满完成了本次任务
参考文献
http://f.dataguru.cn/forum.php?mod=viewthread&tid=27593&fromuid=303 casliyang
http://f.dataguru.cn/thread-26828-1-3.html sunev_yu
http://f.dataguru.cn/forum.php?mod=viewthread&tid=27866&fromuid=303 chengat1314
http://f.dataguru.cn/thread-27576-1-2.html camel21
http://www.cnblogs.com/siwei1988/archive/2012/07/23/2604710.html