DataX 是阿里巴巴集团内被广泛使用的离线数据同步工具/平台,实现包括 MySQL、SQL Server、Oracle、PostgreSQL、HDFS、Hive、HBase、OTS、ODPS 等各种异构数据源之间高效的数据同步功能。
环境搭建:
下载datax数据包
cd /opt/ wget http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
将下载好的压缩包解压
tar zxvf datax.tar.gz
删除隐藏文件
rm -rf /opt/datax/plugin/*/._*
不删除因为文件执行任务会报错
验证是否安装成功
cd /opt/datax/bin/ python datax.py ../job/job.json
问题:
底层采用select * 扫描全表方式,可能会对数据库产生较大影响,风险较高
示例脚本:oracle->hdfs
{ "job": { "content": [ { "reader": { "name": "oraclereader", "parameter": { "column": ["*"], "connection": [ { "jdbcUrl": ["jdbc:oracle:thin:@//ip:port/database"], "table": ["table"] } ], "password": "password", "username": "username" } }, "writer": { "name": "hdfswriter", "parameter": { "column": [ "*" ], "defaultFS": "hdfs://ip:port", "fieldDelimiter": " ", "fileName": "oracle.txt", "fileType": "text", "path": "path", "writeMode": "append" } } } ], "setting": { "speed": { "channel": "1" } } } }