SQOOP:Apache基金会下一个开源产品,Hadoop家族的一个产品,关系型数据库与HDFS文件系统之间进行数据交换,数据迁移的一个工具。
一、环境描述
Mysql版本:mysql-installer-community-5.5.27.1 32位
Mysql for Windows 7 32位:我把mysql数据库安装在了自己win7的笔记本上,这样的好处就是减少了虚拟机 master slave的开销和使用空间还可以多利用一台机器的资源,如果你的虚拟机资源很紧张的话也可以这样部署。
Linux ISO:CentOS-6.0-i386-bin-DVD.iso 32位
JDK version:"1.6.0_25-ea" for linux
Hadoop software version:hadoop-0.20.205.0.tar.gz for linux
Mysql version:mysql-installer-community-5.5.27.1 32位 for windows
sqoop version:sqoop-1.2.0-CDH3B4.tar.gz for linux
主机名
|
IP
|
节点名
|
备注
|
h1
|
192.168.2.102
|
master
|
namenode和jobtracker
|
h2
|
192.168.2.103
|
slave1
|
datanode和tasktracker
|
H4
|
192.168.2.105
|
slave2
|
datanode和tasktracker
|
MySQL部署在宿主环境中:http://f.dataguru.cn/thread-34746-1-1.html 参考飚哥风靡版
二、下载软件安装包
帖子名:hadoop第十周cloudera版sqoop包和hadoop-core-jar包下载
帖子网址:http://f.dataguru.cn/forum.php?mod=viewthread&tid=36867&fromuid=303
欢迎大家下载使用
sqoop-1.2.0-CDH3B4.tar.gz 和 hadoop-core-jar包 mysql-connector-java-5.1.22-bin.jar 包 是我们这次用到的
三、把下载好的文件加载到linux并解压
下载
[grid@h1 ~]$ pwd
/home/grid/
-rwxrw-rw-. 1 grid hadoop 67339212 4月 12 2011 hadoop-0.20.2-CDH3B4.tar.gz
-rwxrw-rw-. 1 grid hadoop 832960 11月 19 16:06 mysql-connector-java-5.1.22-bin.jar
-rwxrw-rw-. 1 grid hadoop 1543137 4月 12 2011 sqoop-1.2.0-CDH3B4.tar.gz
解压包
[grid@h1 ~]$ tar -zxvf hadoop-0.20.2-CDH3B4.tar.gz
[grid@h1 ~]$ tar -zxvf sqoop-1.2.0-CDH3B4.tar.gz
[grid@h1 ~]$ pwd
/home/grid/
drwxr-xr-x. 15 grid hadoop 4096 2月 22 2011 hadoop-0.20.2-CDH3B4 解压后目录
-rwxrw-rw-. 1 grid hadoop 67339212 4月 12 2011 hadoop-0.20.2-CDH3B4.tar.gz
-rwxrw-rw-. 1 grid hadoop 832960 11月 19 16:06 mysql-connector-java-5.1.22-bin.jar
drwxr-xr-x. 11 grid hadoop 4096 2月 22 2011 sqoop-1.2.0-CDH3B4 解压后目录
-rwxrw-rw-. 1 grid hadoop 1543137 4月 12 2011 sqoop-1.2.0-CDH3B4.tar.gz
四、拷贝hadoop-core-0.20.2-CDH3B4.jar和mysql-connector-java-5.1.22-bin.jar到/home/grid/sqoop-1.2.0-CDH3B4/lib/目录下
[grid@h1 ~]$ cd hadoop-0.20.2-CDH3B4
[grid@h1 hadoop-0.20.2-CDH3B4]$ cp hadoop-core-0.20.2-CDH3B4.jar /home/grid/sqoop-1.2.0-CDH3B4/lib/
[grid@h1 grid]$ cp mysql-connector-java-5.1.22-bin.jar /home/grid/sqoop-1.2.0-CDH3B4/lib/
五、配置sqoop-1.2.0-CDH3B4/bin/configure-sqoop文件
[grid@h1 conf]$ cd ../bin
[grid@h1 bin]$ pwd
/home/grid/sqoop-1.2.0-CDH3B4/bin
[grid@h1 bin]$ vim configure-sqoop
注释掉hbase和zookeeper检查(除非你准备使用HABASE等HADOOP上的组件)
# Check: If we can't find our dependencies, give up here.
if [ ! -d "${HADOOP_HOME}" ]; then
echo "Error: $HADOOP_HOME does not exist!"
echo 'Please set $HADOOP_HOME to the root of your Hadoop installation.'
exit 1
fi 只有红色需要修改
#if [ ! -d "${HBASE_HOME}" ]; then
#echo "Error: $HBASE_HOME does not exist!"
#echo 'Please set $HBASE_HOME to the root of your HBase installation.'
#exit 1
#fi
#if [ ! -d "${ZOOKEEPER_HOME}" ]; then
# echo "Error: $ZOOKEEPER_HOME does not exist!"
# echo 'Please set $ZOOKEEPER_HOME to the root of your ZooKeeper installation.'
# exit 1
#fi
六、配置所需环境变量
在哪里执行sqoop,就在哪台机器上设置一下
[grid@h1 grid]$ vim .bashrc 添加
export JAVA_HOME=/usr
export JRE_HOME=/usr/java/jdk1.6.0_25/jre
export PATH=/usr/java/jdk1.6.0_25/bin:/home/grid/hadoop-0.20.2/bin:/home/grid/pig-0.9.2/bin:$PATH
export CLASSPATH=./:/usr/java/jdk1.6.0_25/lib:/usr/java/jdk1.6.0_25/jre/lib
export PIG_CLASSPATH=/home/grid/hadoop-0.20.2/conf
export HIVE_HOME=/home/grid/hive-0.8.1
export HIVE_CONF_DIR=$HIVE_HOME/conf
export HADOOP_HOME=/home/grid/hadoop-0.20.2
作用:让sqoop程序从环境变量里找到hadoop的位置,从而找到hadoop配置文件,知道集群的部署情况
[grid@h1 grid]$ echo $HADOOP_HOME 检查一下没有问题
/home/grid/hadoop-0.20.2
七、配置启动HADOOP集群
H1机器 master
[grid@h1 bin]$ pwd
/home/grid/hadoop-0.20.2/bin
[grid@h1 bin]$ ./start-all.sh
starting namenode, logging to /home/grid/hadoop-0.20.2/bin/../logs/hadoop-grid-namenode-h1.out
h2: starting datanode, logging to /home/grid/hadoop-0.20.2/bin/../logs/hadoop-grid-datanode-h2.out
h4: starting datanode, logging to /home/grid/hadoop-0.20.2/bin/../logs/hadoop-grid-datanode-h4.out
h1: starting secondarynamenode, logging to /home/grid/hadoop-0.20.2/bin/../logs/hadoop-grid-secondarynamenode-h1.out
starting jobtracker, logging to /home/grid/hadoop-0.20.2/bin/../logs/hadoop-grid-jobtracker-h1.out
h2: starting tasktracker, logging to /home/grid/hadoop-0.20.2/bin/../logs/hadoop-grid-tasktracker-h2.out
h4: starting tasktracker, logging to /home/grid/hadoop-0.20.2/bin/../logs/hadoop-grid-tasktracker-h4.out
[grid@h1 bin]$ jps
17191 JobTracker
16955 NameNode
17442 Jps
17121 SecondaryNameNode
H2机器 slave
[grid@h2 ~]$ jps
32523 Jps
17188 TaskTracker
13727 HQuorumPeer
17077 DataNode
H4机器 slave
[grid@h4 ~]$ jps
27829 TaskTracker
26875 Jps
17119 DataNode
31083 Jps
11557 HQuorumPeer
[grid@h1 bin]$ ./hadoop dfsadmin –report 检查hadoop集群状态
Configured Capacity: 19865944064 (18.5 GB)
Present Capacity: 8741523456 (8.14 GB)
DFS Remaining: 8726482944 (8.13 GB)
DFS Used: 15040512 (14.34 MB)
DFS Used%: 0.17%
Under replicated blocks: 4
Blocks with corrupt replicas: 0
Missing blocks: 0
-------------------------------------------------
Datanodes available: 2 (2 total, 0 dead) --2个节点存活无shutdown
Name: 192.168.2.103:50010 -- slaves h2
Decommission Status : Normal --状态正常
Configured Capacity: 9932972032 (9.25 GB)
DFS Used: 7520256 (7.17 MB)
Non DFS Used: 5447561216 (5.07 GB)
DFS Remaining: 4477890560(4.17 GB)
DFS Used%: 0.08%
DFS Remaining%: 45.08%
Last contact: Fri Dec 14 18:10:11 CST 2012
Name: 192.168.2.105:50010 -- slaves h4
Decommission Status : Normal --状态正常
Configured Capacity: 9932972032 (9.25 GB)
DFS Used: 7520256 (7.17 MB)
Non DFS Used: 5676859392 (5.29 GB)
DFS Remaining: 4248592384(3.96 GB)
DFS Used%: 0.08%
DFS Remaining%: 42.77%
Last contact: Fri Dec 14 18:10:11 CST 2012
集群正常启动了
八、启动mysql,创建leo用户进行sqoop连接
1. 必须启动服务才能操作数据库

数据库端口:3306
Mysqll服务名:MySQL55
Mysql状态:已经启动
创建leo用户
grant all privileges on *.* to 'leo'@'%' identified by 'leo' with grant option;
select * from mysql.user;
flush privileges;


九、mysql 中建立sqoop库,test表,添加数据
[grid@h1 bin]$ ping 192.168.2.110 检查linux for windows 的连接性
PING 192.168.2.110 (192.168.2.110) 56(84) bytes of data.
64 bytes from 192.168.2.110: icmp_seq=1 ttl=64 time=14.5 ms
64 bytes from 192.168.2.110: icmp_seq=2 ttl=64 time=3.43 ms
64 bytes from 192.168.2.110: icmp_seq=3 ttl=64 time=9.68 ms
64 bytes from 192.168.2.110: icmp_seq=4 ttl=64 time=0.549 ms
^C
--- 192.168.2.110 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3630ms
rtt min/avg/max/mdev = 0.549/7.063/14.577/5.453 ms
[grid@h1 grid]$ mysql -h192.168.2.110 -uleo –pleo 使用leo用户登录数据库
命令列表
show databases; 显示当前有哪些数据库
create database sqoop; 创建sqoop数据库
use sqoop; 只有打开sqoop数据库才能操作哦
create table leo1 (user_id int, user_name varchar(10),class int); 创建leo1表
insert into leo1 values(1,'leonarding',10); 插入5条记录
insert into leo1 values(2,'wubiao',20);
insert into leo1 values(3,'alan',30);
insert into leo1 values(4,'sun',40);
insert into leo1 values(5,'liyang',50);
show tables; 显示当前数据库中存在哪些表
[grid@h1 grid]$ mysql -h192.168.2.110 -uleo -pleo
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 5
Server version: 5.5.27 MySQL Community Server (GPL)
Copyright (c) 2000, 2010, Oracle and/or its affiliates. All rights reserved.
This software comes with ABSOLUTELY NO WARRANTY. This is free software,
and you are welcome to modify and redistribute it under the GPL v2 license
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| hive |
| mysql |
| performance_schema |
| sakila |
| test |
| world |
+--------------------+
7 rows in set (0.01 sec)
mysql> create database sqoop; 创建sqoop数据库
Query OK, 1 row affected (0.06 sec)
mysql> use sqoop;
Database changed
mysql> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| hive |
| mysql |
| performance_schema |
| sakila |
| sqoop | sqoop数据库已经创建完毕
| test |
| world |
+--------------------+
8 rows in set (0.00 sec)
mysql> create table leo1 (user_id int, user_name varchar(10),class int); 创建leo1表
Query OK, 0 rows affected (1.82 sec)
mysql> insert into leo1 values(1,'leonarding',10);
Query OK, 1 row affected (0.12 sec)
mysql> insert into leo1 values(2,'wubiao',20);
Query OK, 1 row affected (0.06 sec)
mysql> insert into leo1 values(3,'alan',30);
Query OK, 1 row affected (1.02 sec)
mysql> insert into leo1 values(4,'sun',40);
Query OK, 1 row affected (0.05 sec)
mysql> insert into leo1 values(5,'liyang',50);
Query OK, 1 row affected (0.05 sec)
mysql> show tables; sqoop数据库中就有一个leo1表
+-----------------+
| Tables_in_sqoop |
+-----------------+
| leo1 |
+-----------------+
1 row in set (0.00 sec)
mysql> select * from leo1; 表中有5行数据
+---------+------------+-------+
| user_id | user_name | class |
+---------+------------+-------+
| 1 | leonarding | 10 |
| 2 | wubiao | 20 |
| 3 | alan | 30 |
| 4 | sun | 40 |
| 5 | liyang | 50 |
+---------+------------+-------+
5 rows in set (0.00 sec)
十、测试sqoop连接性
[grid@h1 grid]$ sqoop-1.2.0-CDH3B4/bin/sqoop list-databases --connect jdbc:mysql://192.168.2.110:3306/ --username leo --password leo
参数解释:
--connect jdbc:mysql://192.168.2.110:3306/ 指定mysql数据库主机名和端口号
--username leo 数据库用户名
--password leo 数据库密码
12/12/15 00:15:56 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead. 这里提示密码复杂度低安全性差
12/12/15 00:16:16 INFO manager.MySQLManager: Executing SQL statement: SHOW DATABASES 显示所有数据库
information_schema
hive
mysql
performance_schema
sakila
sqoop 这是我们刚才建立的数据库
test
world
从linux上通过sqoop可以正常连接到mysql数据库中
十一、从mysql中导出数据->SQOOP->导入HDFS文件系统
[grid@h1 grid]$ sqoop-1.2.0-CDH3B4/bin/sqoop import --connect jdbc:mysql://192.168.2.110:3306/sqoop --username leo --password leo --table leo1 -m 1
参数解释:
--connect jdbc:mysql://192.168.2.110:3306/sqoop 指定mysql数据库主机名和端口号和数据库名
--username leo 指定数据库用户名
--password leo 指定数据库密码
--table leo1 mysql中即将导出的表
-m 1 指定启动一个map进程,如果表很大,可以启动多个map进程
导入路径 默认/user/grid/leo1/part-m-00000
12/12/15 00:36:30 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
12/12/15 00:36:30 INFO tool.CodeGenTool: Beginning code generation
12/12/15 00:36:30 INFO manager.MySQLManager: Executing SQL statement: SELECT t.* FROM `leo1` AS t LIMIT 1
12/12/15 00:36:30 INFO manager.MySQLManager: Executing SQL statement: SELECT t.* FROM `leo1` AS t LIMIT 1 访问的表
12/12/15 00:36:31 INFO orm.CompilationManager: HADOOP_HOME is /home/grid/hadoop-0.20.2/bin/..
12/12/15 00:36:31 INFO orm.CompilationManager: Found hadoop core jar at: /home/grid/hadoop-0.20.2/bin/../hadoop-0.20.2-core.jar 找到hadoop核心jar
12/12/15 00:36:38 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-grid/compile/8d5e146de1ec99ef7d7ea6789b6b4441/leo1.jar 写入jar包
12/12/15 00:36:39 WARN manager.MySQLManager: It looks like you are importing from mysql.
12/12/15 00:36:39 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
12/12/15 00:36:39 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
12/12/15 00:36:39 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
12/12/15 00:36:39 INFO mapreduce.ImportJobBase: Beginning import of leo1
12/12/15 00:36:43 INFO manager.MySQLManager: Executing SQL statement: SELECT t.* FROM `leo1` AS t LIMIT 1
12/12/15 00:37:05 INFO mapred.JobClient: Running job: job_201212141802_0001 作业编号(开始)
12/12/15 00:37:07 INFO mapred.JobClient: map 0% reduce 0%
12/12/15 00:39:27 INFO mapred.JobClient: map 100% reduce 0%
12/12/15 00:39:29 INFO mapred.JobClient: Job complete: job_201212141802_0001 作业编号(完成)
12/12/15 00:39:29 INFO mapred.JobClient: Counters: 5
12/12/15 00:39:29 INFO mapred.JobClient: Job Counters
12/12/15 00:39:29 INFO mapred.JobClient: Launched map tasks=1 启动一个map进程
12/12/15 00:39:29 INFO mapred.JobClient: FileSystemCounters
12/12/15 00:39:29 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=59
12/12/15 00:39:29 INFO mapred.JobClient: Map-Reduce Framework
12/12/15 00:39:29 INFO mapred.JobClient: Map input records=5 map导入5条记录
12/12/15 00:39:29 INFO mapred.JobClient: Spilled Records=0 无溢出
12/12/15 00:39:29 INFO mapred.JobClient: Map output records=5 map导出5条记录
12/12/15 00:39:29 INFO mapreduce.ImportJobBase: Transferred 59 bytes in 165.1492 seconds (0.3573 bytes/sec) 导出59个字节,用时165秒
12/12/15 00:39:29 INFO mapreduce.ImportJobBase: Retrieved 5 records. 导入HDFS中5行
我们在HDFS中检查一下
[grid@h1 grid]$ hadoop dfs -ls
Found 5 items
drwxr-xr-x - grid supergroup 0 2012-11-02 20:55 /user/grid/in
drwxr-xr-x - grid supergroup 0 2012-12-15 00:39 /user/grid/leo1
drwxr-xr-x - grid supergroup 0 2012-10-12 12:15 /user/grid/out1
drwxr-xr-x - grid supergroup 0 2012-10-13 18:02 /user/grid/out2
drwxr-xr-x - grid supergroup 0 2012-11-03 21:28 /user/grid/pig
[grid@h1 grid]$ hadoop dfs -ls leo1
Found 2 items
drwxr-xr-x - grid supergroup 0 2012-12-15 00:37 /user/grid/leo1/_logs
-rw-r--r-- 2 grid supergroup 59 2012-12-15 00:39 /user/grid/leo1/part-m-00000
[grid@h1 grid]$ hadoop dfs -cat leo1/part-m-00000
1,leonarding,10
2,wubiao,20
3,alan,30
4,sun,40
5,liyang,50
到此我们导入和验证完毕,完成了从mysql数据库成功导入HDFS文件系统
十二、从HDFS中导出数据->SQOOP->导入MYSQL数据库
[grid@h1 grid]$ sqoop-1.2.0-CDH3B4/bin/sqoop export --connect jdbc:mysql://192.168.2.110:3306/sqoop --username leo --password leo --table leo1 --export-dir hdfs://h1:9000/user/grid/leo1/part-m-00000 -m 1
参数解释:
--connect jdbc:mysql://192.168.2.110:3306/sqoop 指定mysql数据库主机名和端口号和数据库名
--username leo 指定数据库用户名
--password leo 指定数据库密码
--table leo1 mysql即将导入的表
-m 1
--export-dir hdfs://h1:9000/user/grid/leo1/part-m-00000 HDFS导出文件路径
12/12/15 01:10:01 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
12/12/15 01:10:01 INFO tool.CodeGenTool: Beginning code generation
12/12/15 01:10:02 INFO manager.MySQLManager: Executing SQL statement: SELECT t.* FROM `leo1` AS t LIMIT 1
12/12/15 01:10:02 INFO manager.MySQLManager: Executing SQL statement: SELECT t.* FROM `leo1` AS t LIMIT 1
12/12/15 01:10:02 INFO orm.CompilationManager: HADOOP_HOME is /home/grid/hadoop-0.20.2/bin/..
12/12/15 01:10:02 INFO orm.CompilationManager: Found hadoop core jar at: /home/grid/hadoop-0.20.2/bin/../hadoop-0.20.2-core.jar
12/12/15 01:10:03 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-grid/compile/d3851be739254c3d3ae5e0e71da52f5c/leo1.jar
12/12/15 01:10:03 INFO mapreduce.ExportJobBase: Beginning export of leo1 始导入
12/12/15 01:10:04 INFO manager.MySQLManager: Executing SQL statement: SELECT t.* FROM `leo1` AS t LIMIT 1 导入到哪张表
12/12/15 01:10:04 INFO input.FileInputFormat: Total input paths to process : 1
12/12/15 01:10:04 INFO input.FileInputFormat: Total input paths to process : 1
12/12/15 01:10:04 INFO mapred.JobClient: Running job: job_201212141802_0002 作业编号(开始)
12/12/15 01:10:05 INFO mapred.JobClient: map 0% reduce 0%
12/12/15 01:12:23 INFO mapred.JobClient: map 100% reduce 0%
12/12/15 01:12:25 INFO mapred.JobClient: Job complete: job_201212141802_0002 作业编号(完成)
12/12/15 01:12:26 INFO mapred.JobClient: Counters: 6
12/12/15 01:12:26 INFO mapred.JobClient: Job Counters
12/12/15 01:12:26 INFO mapred.JobClient: Rack-local map tasks=1
12/12/15 01:12:26 INFO mapred.JobClient: Launched map tasks=1 启动一个map进程
12/12/15 01:12:26 INFO mapred.JobClient: FileSystemCounters
12/12/15 01:12:26 INFO mapred.JobClient: HDFS_BYTES_READ=65
12/12/15 01:12:26 INFO mapred.JobClient: Map-Reduce Framework
12/12/15 01:12:26 INFO mapred.JobClient: Map input records=5
12/12/15 01:12:26 INFO mapred.JobClient: Spilled Records=0
12/12/15 01:12:26 INFO mapred.JobClient: Map output records=5
12/12/15 01:12:26 INFO mapreduce.ExportJobBase: Transferred 65 bytes in 141.9968 seconds (0.4578 bytes/sec) 导出65个字节,用时141秒
12/12/15 01:12:26 INFO mapreduce.ExportJobBase: Exported 5 records.
我们在MYSQL中检查一下,已经成功导入到mysql,现在是10条记录比原来多了5条
mysql> select * from leo1;
+---------+------------+-------+
| user_id | user_name | class |
+---------+------------+-------+
| 1 | leonarding | 10 |
| 2 | wubiao | 20 |
| 3 | alan | 30 |
| 4 | sun | 40 |
| 5 | liyang | 50 |
| 1 | leonarding | 10 |
| 2 | wubiao | 20 |
| 3 | alan | 30 |
| 4 | sun | 40 |
| 5 | liyang | 50 |
+---------+------------+-------+
10 rows in set (0.00 sec)
到此我们导出和验证完毕,完成从HDFS文件系统成功导出到mysql数据库
本文转自 leonarding151CTO博客,原文链接:http://blog.51cto.com/leonarding/1092764,如需转载请自行联系原作者