一、Sqoop服务架构
Apache Sqoop™ is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
简单点说sqoop就是一个连接关系型数据库和hadoop的桥梁,底层跑MR,主要有两个方面(导入和导出):
A. 将关系型数据库的数据导入到Hadoop 及其相关的系统中,如 Hive和HBase
B. 将数据从Hadoop 系统里抽取并导出到关系型数据库
二、Sqoop import讲解
在进行import导入的时候,会出现:
这是因为sqoop缺少java-json.jar包.去http://www.java2s.com/Code/Jar/j/Downloadjavajsonjar.htm下载,然后放到sqoop/lib目录即可。
(1)准备数据
1.准备数据,创建mysql表并加载数据
CREATE TABLE user_info ( id int(11) NOT NULL AUTO_INCREMENT, username varchar(20) DEFAULT NULL, password varchar(20) DEFAULT NULL, PRIMARY KEY (id) );
insert into user_info values(1,'admin','admin'); insert into user_info values(2,'wang','111111'); insert into user_info values(3,'zhang','000000'); insert into user_info values(4,'lili','000000'); insert into user_info values(5,'henry','000000'); insert into user_info values(6,'cherry','000000'); insert into user_info values(7,'ben','111111'); insert into user_info values(8,'leo','000000'); insert into user_info values(9,'test','test'); insert into user_info values(10,'system','000000'); insert into user_info values(11,'xiao','111111');
(2)import相关参数详解
可以查看相关import的参数
sqoop import --help
将mysql中的数据导入到hdfs上,默认的map数是4个,默认地址是hdfs:/user/caizhengjie
bin/sqoop import \ --connect jdbc:mysql://bigdata-pro-m01:3306/db_sqoop \ --username root \ --password 199911 \ --table user_info
将mysql中的数据导入到hdfs上,如果文件存在就给删除掉,指定的map数是1个,指定的地址是hdfs:/user/kfk/sqoop/import/user_info
bin/sqoop import \ --connect jdbc:mysql://bigdata-pro-m01:3306/db_sqoop \ --username root \ --password 199911 \ --table user_info \ --delete-target-dir \ --num-mappers 1 \ --target-dir /user/kfk/sqoop/import/user_info
[caizhengjie@bigdata-pro-m01 hadoop]$ bin/hdfs dfs -text /user/kfk/sqoop/import/user_info/par* 1,admin,admin 2,wang,111111 3,zhang,000000 4,lili,000000 5,henry,000000 6,cherry,000000 7,ben,111111 8,leo,000000 9,test,test 10,system,000000 11,xiao,111111
使用 --direct,可以让传输更快
bin/sqoop import \ --connect jdbc:mysql://bigdata-pro-m01:3306/db_sqoop \ --username root \ --password 199911 \ --table user_info \ --delete-target-dir \ --num-mappers 1 \ --target-dir /user/kfk/sqoop/import/user_info \ --direct
上传到hdfs上的文件格式默认是用逗号隔开的,我们可以自己定义分隔格式
bin/sqoop import \ --connect jdbc:mysql://bigdata-pro-m01:3306/db_sqoop \ --username root \ --password 199911 \ --table user_info \ --delete-target-dir \ --num-mappers 1 \ --target-dir /user/kfk/sqoop/import/user_info \ --fields-terminated-by '\t'
[caizhengjie@bigdata-pro-m01 hadoop]$ bin/hdfs dfs -text /user/kfk/sqoop/import/user_info/par* 1 admin admin 2 wang 111111 3 zhang 000000 4 lili 000000 5 henry 000000 6 cherry 000000 7 ben 111111 8 leo 000000 9 test test 10 system 000000 11 xiao 111111
定义存储在hdfs上的parquet文件格式
bin/sqoop import \ --connect jdbc:mysql://bigdata-pro-m01:3306/db_sqoop \ --username root \ --password 199911 \ --table user_info \ --delete-target-dir \ --num-mappers 1 \ --target-dir /user/kfk/sqoop/import/user_info \ --fields-terminated-by '\t' \ --as-parquetfile
定义snappy压缩格式
bin/sqoop import \ --connect jdbc:mysql://bigdata-pro-m01:3306/db_sqoop \ --username root \ --password 199911 \ --table user_info \ --delete-target-dir \ --num-mappers 1 \ --target-dir /user/kfk/sqoop/import/user_info \ --fields-terminated-by '\t' \ --compress \ --compression-codec org.apache.hadoop.io.compress.SnappyCodec
将id > 5的’username,password’导入到hdfs
bin/sqoop import \ --connect jdbc:mysql://bigdata-pro-m01:3306/db_sqoop \ --username root \ --password 199911 \ --columns 'username,password' \ --table user_info \ --where 'id > 5' \ --delete-target-dir \ --num-mappers 1 \ --target-dir /user/kfk/sqoop/import/user_info \ --fields-terminated-by '\t'
[caizhengjie@bigdata-pro-m01 hadoop]$ bin/hdfs dfs -text /user/kfk/sqoop/import/user_info/par* cherry 000000 ben 111111 leo 000000 test test system 000000 xiao 111111
在原表的基础上增加,但是需要指定列名–check-column ‘id’
bin/sqoop import \ --connect jdbc:mysql://bigdata-pro-m01:3306/db_sqoop \ --username root \ --password 199911 \ --columns 'username,password' \ --table user_info \ --where 'id > 5' \ --num-mappers 1 \ --target-dir /user/kfk/sqoop/import/user_info \ --fields-terminated-by '\t' \ --check-column 'id' \ --incremental append
[caizhengjie@bigdata-pro-m01 hadoop]$ bin/hdfs dfs -text /user/kfk/sqoop/import/user_info/par* cherry 000000 ben 111111 leo 000000 test test system 000000 xiao 111111 cherry 000000 ben 111111 leo 000000 test test system 000000 xiao 111111
在原表的基础上增加,从第10个开始,但是不包含10
bin/sqoop import \ --connect jdbc:mysql://bigdata-pro-m01:3306/db_sqoop \ --username root \ --password 199911 \ --columns 'username,password' \ --table user_info \ --num-mappers 1 \ --target-dir /user/kfk/sqoop/import/user_info \ --fields-terminated-by '\t' \ --check-column 'id' \ --incremental append \ --last-value 10
[caizhengjie@bigdata-pro-m01 hadoop]$ bin/hdfs dfs -text /user/kfk/sqoop/import/user_info/par* cherry 000000 ben 111111 leo 000000 test test system 000000 xiao 111111 cherry 000000 ben 111111 leo 000000 test test system 000000 xiao 111111 xiao 111111
bin/sqoop import \ --connect jdbc:mysql://bigdata-pro-m01:3306/db_sqoop \ --username root \ --password 199911 \ --columns 'username,password' \ --num-mappers 1 \ --target-dir /user/kfk/sqoop/import/user_info \ --fields-terminated-by '\t' \ --check-column 'id' \ --incremental append \ --query 'select * from user_info where id > 5 and $CONDITIONS'
[caizhengjie@bigdata-pro-m01 hadoop]$ bin/hdfs dfs -text /user/kfk/sqoop/import/user_info/par* cherry 000000 ben 111111 leo 000000 test test system 000000 xiao 111111 cherry 000000 ben 111111 leo 000000 test test system 000000 xiao 111111 xiao 111111 6 cherry 7 ben 8 leo 9 test 10 system 11 xiao