Hadoop简介:
Hadoop是一个分布式基础框架,主要由分布式存储HDFS,分布式计算YARN组成,不仅可以提供数据查询功能,还可以提供海量的数据存储;Hadoop也提供了快速的搜索功能,百万亿条数据,毫秒级查询结果,对于银行来说,可以在最短时间为客户提供实时的数据交易;hadoop也具有数据挖掘功能,可以编写数据挖掘算法,利用交易数据,能够快速定位非法交易记录,为监管带来了方便,提供了有力的技术支撑。hadoop虽然是开源项目,还需要继续完善,但很多大公司都已经投入使用,像国内的淘宝、百度、腾讯,国外的雅虎、亚马逊、Facebook等都在使用hadoop。
部署环境:
系统信息 |
|||
名称 |
版本 |
||
操作系统 |
CentOS release 6.9 (Final) |
||
系统内核 |
2.6.32-696.el6.x86_64 |
||
部署信息 |
|||
IP地址 |
主机名 |
配置信息 |
部署组建 |
192.168.199.132 |
cdh1 |
2c/1g/20G |
NameNode、ResourceManager、HBase、Hive metastore、Impala Catalog、Impala statestore、Sentry
|
192.168.199.133 |
cdh2 |
2c/1g/20G |
DataNode、SecondaryNameNode、NodeManager、HBase、Hive Server2、Impala Server
|
192.168.199.134 |
cdh3 |
2c/1g/20G |
DataNode、HBase、NodeManager、Hive Server2、Impala Server |
第一部分:前期准备
1.配置主机名,分别是cdh1,cdh2,cdh3
2.配置hosts文件,使三个主机互相认识
在cdh1,cdh2,cdh3上配置如下:
192.168.199.132 cdh1
192.168.199.133 cdh2
192.168.199.134 cdh3
3.做好时间同步
在cdh1上,ntpdate cn.pool.ntp.org
在cdh2上,ntpdate cdh1
在cdh3上,ntpdate cdh1
4.相互之间,配置免密码登陆
(1)在cdh2上
生成公钥和私钥密钥对
ssh-keygen -t rsa(一直回车就可以)
将生成的公钥拷贝给cdh1和cdh3
ssh-copy-id -i .ssh/id_rsa.pub cdh1
ssh-copy-id -i .ssh/id_rsa.pub cdh3
ssh-copy-id -i .ssh/id_rsa.pub cdh2
(2)在cdh3上
ssh-keygen -t rsa
ssh-copy-id -i .ssh/id_rsa.pub cdh1
ssh-copy-id -i .ssh/id_rsa.pub cdh2
ssh-copy-id -i .ssh/id_rsa.pub cdh3
(3)在cdh1上
ssh-keygen -t rsa
ssh-copy-id -i .ssh/id_rsa.pub cdh1
ssh-copy-id -i .ssh/id_rsa.pub cdh2
ssh-copy-id -i .ssh/id_rsa.pub cdh5
5.CDH 要求使用 IPv4,IPv6 不支持,在cdh1,cdh2,cdh3上分别禁用,禁用IPv6方法:
vim /etc/sysctl.conf
在最后一行添加如下内容
#disable ipv6
net.ipv6.conf.all.disable_ipv6=1
net.ipv6.conf.default.disable_ipv6=1
net.ipv6.conf.lo.disable_ipv6=1
使其生效:
sysctl -p
最后确认是否已禁用:
cat /proc/sys/net/ipv6/conf/all/disable_ipv6
1
6.关闭防火墙
iptables -F
7.在cdh1,cdh2,cdh3上分别安装jdk1.7
rpm -ivh jdk-7u55-linux-x64.rpm
java -version查看其版本
8.下载cdh5.4.0.tar.gz的压缩包,也可以选择原生态安装
解压之后重新配置yum源,否则后面步骤无法进行yum安装
第二部分:开始部署安装hadoop
1.安装软件包
在cdh1上,安装 hadoop-hdfs-namenode:
yum install hadoop hadoop-hdfs hadoop-client hadoop-doc hadoop-debuginfo hadoop-hdfs-namenode
在cdh2上,安装 hadoop-hdfs-secondarynamenode,hadoop-hdfs-datanode:
yum install hadoop-hdfs-secondarynamenode -y
yum install hadoop hadoop-hdfs hadoop-client hadoop-doc hadoop-debuginfo hadoop-hdfs-datanode -y
在cdh3上,安装 hadoop-hdfs-datanode
yum install hadoop hadoop-hdfs hadoop-client hadoop-doc hadoop-debuginfo hadoop-hdfs-datanode -y
2.修改hadoop配置文件
在NameNode(cdh1)上配置如下参数:
vim /etc/hadoop/conf/core-site.xml
在最后一行<configuration>和</configuration>之间添加如下内容:
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://cdh1:8020</value> </property> </configuration>
vim /etc/hadoop/conf/hdfs-site.xml
在最后一行<configuration>和</configuration>之间添加如下内容:
<configuration> <property> <name>dfs.namenode.name.dir</name> <value>file:///var/lib/hadoop-hdfs/cache/hdfs/dfs/name</value> </property> <property> <name>dfs.permissions.superusergroup</name> <value>hadoop</value> </property> </configuration>
vim /etc/hadoop/conf/hdfs-site.xml
在最后一行<configuration>和</configuration>之间添加如下内容:
<configuration> <property> <name>dfs.namenode.name.dir</name> <value>file:///var/lib/hadoop-hdfs/cache/hdfs/dfs/name</value> </property> <property> <name>dfs.permissions.superusergroup</name> <value>hadoop</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:///data/dfs/nn</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:///data/dfs/dn</value> </property> </configuration>
在NameNode(cdh1)上创建dfs.name.dir目录:
mkdir -p /data/dfs/nn
chown -R hdfs:hdfs /data/dfs/nn
chmod 700 /data/dfs/nn
在DataNode(cdh2,cdh3)上创建dfs.data.dir目录:
mkdir -p /data/dfs/dn
chown -R hdfs:hdfs /data/dfs/dn
在namenode(cdh1)上配置 SecondaryNameNode
vim /etc/hadoop/conf/hdfs-site.xml
<configuration> <property> <name>dfs.namenode.name.dir</name> <value>file:///var/lib/hadoop-hdfs/cache/hdfs/dfs/name</value> </property> <property> <name>dfs.permissions.superusergroup</name> <value>hadoop</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:///data/dfs/nn</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:///data/dfs/dn</value> </property> <property> <name>dfs.secondary.http.address</name> <value>cdh2:50090</value> </property> <property> </configuration>
在namenode(cdh1)上开启WebHDFS
yum install hadoop-httpfs -y
然后修改 /etc/hadoop/conf/core-site.xml配置代理用户:
vim /etc/hadoop/conf/core-site.xml
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://cdh1:8020</value> </property> <property> <name>hadoop.proxyuser.httpfs.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.httpfs.groups</name> <value>*</value> </property> </configuration>
在namenode(cdh1)上配置lzo
cd /etc/yum.repos.d
vim cloudera-gplextras5.repo
[cloudera-gplextras5]
# Packages for Cloudera's GPLExtras, Version 5, on RedHat or CentOS 6 x86_64
name=Cloudera's GPLExtras, Version 5
baseurl=http://archive.cloudera.com/gplextras5/redhat/6/x86_64/gplextras/5/
gpgkey = http://archive.cloudera.com/gplextras5/redhat/6/x86_64/gplextras/RPM-GPG-KEY-cloudera
gpgcheck = 1
yum clean all
yum install hadoop-lzo* impala-lzo -y
vim /etc/hadoop/conf/core-site.xml
在configure之间添加如下
<property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.BZip2Codec,com.hadoop.compression.lzo.LzoCodec, com.hadoop.compression.lzo.LzopCodec</value> </property> <property> <name>io.compression.codec.lzo.class</name> <value>com.hadoop.compression.lzo.LzoCodec</value> </property>
第三部分:启动hdfs
将cdh1上的配置文件同步到每一个节点:
scp -r /etc/hadoop/conf root@cdh2:/etc/hadoop/
scp -r /etc/hadoop/conf root@cdh3:/etc/hadoop/
在cdh1节点格式化NameNode:
sudo -u hdfs hadoop namenode -format
在每个节点启动hdfs:
for x in `ls /etc/init.d/|grep hadoop-hdfs` ; do service $x start ; done
创建 /tmp目录,设置权限为 1777:
sudo -u hdfs hadoop fs -mkdir /tmp
sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
启动 HttpFS 服务:
service hadoop-httpfs start
测试
通过http://192.168.0.102:50070/可以访问 NameNode 页面。
第三部分:安装和配置YARN
安装服务:
在 cdh1 节点安装:
yum install hadoop-yarn hadoop-yarn-resourcemanager -y
在cdh1上安装 historyserver
yum install hadoop-mapreduce-historyserver hadoop-yarn-proxyserver -y
在 cdh2、cdh3 节点安装:
yum install hadoop-yarn hadoop-yarn-nodemanager hadoop-mapreduce -y
在cdh1上修改配置参数
vim /etc/hadoop/conf/mapred-site.xml
在configure之间添加如下参数
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
vim /etc/hadoop/conf/yarn-site.xml
在configure中添加如下参数
<property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>cdh1:8031</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>cdh1:8032</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>cdh1:8030</value> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>cdh1:8033</value> </property> <property> <name>yarn.resourcemanager.webapp.address</name> <value>cdh1:8088</value> </property>
在cdh1的/etc/hadoop/conf/yarn-site.xml 中添加如下配置:
vim /etc/hadoop/conf/yarn-site.xml
<property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> <property> <name>yarn.application.classpath</name> <value> $HADOOP_CONF_DIR, $HADOOP_COMMON_HOME/*, $HADOOP_COMMON_HOME/lib/*, $HADOOP_HDFS_HOME/*, $HADOOP_HDFS_HOME/lib/*, $HADOOP_MAPRED_HOME/*, $HADOOP_MAPRED_HOME/lb/*, $HADOOP_YARN_HOME/*, $HADOOP_YARN_HOME/lib/* </value> </property> <property> <name>yarn.log.aggregation.enable</name> <value>true</value> </property>
在cdh1的/etc/hadoop/conf/yarn-site.xml文件中添加如下配置:
vim /etc/hadoop/conf/yarn-site.xml
<property> <name>yarn.nodemanager.local-dirs</name> <value>/data/yarn/local</value> </property> <property> <name>yarn.nodemanager.log-dirs</name> <value>/data/yarn/logs</value> </property> <property> <name>yarn.nodemanager.remote-app-log-dir</name> <value>/yarn/apps</value> </property>
在cdh1上创建 yarn.nodemanager.local-dirs 和 yarn.nodemanager.log-dirs 参数对应的目录:
mkdir -p /data/yarn/{local,logs}
chown -R yarn:yarn /data/yarn
在 hdfs(cdh1) 上创建 yarn.nodemanager.remote-app-log-dir 对应的目录:
sudo -u hdfs hadoop fs -mkdir -p /yarn/apps
sudo -u hdfs hadoop fs -chown yarn:mapred /yarn/apps
sudo -u hdfs hadoop fs -chmod 1777 /yarn/apps
在 cdh1的/etc/hadoop/conf/mapred-site.xml 中配置 MapReduce History Server:
vim /etc/hadoop/conf/mapred-site.xml
<property> <name>mapreduce.jobhistory.address</name> <value>cdh1:10020</value> </property> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>cdh1:19888</value> </property>
在 /etc/hadoop/conf/core-site.xml 中添加如下参数:
<property> <name>hadoop.proxyuser.mapred.groups</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.mapred.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.yarn.groups</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.yarn.hosts</name> <value>*</value> </property>
并在 cdh1的hdfs 上创建相应的目录:
sudo -u hdfs hadoop fs -mkdir -p /user
sudo -u hdfs hadoop fs -chmod 777 /user
然后,在cdh1的 hdfs 上创建目录并设置权限:
sudo -u hdfs hadoop fs -mkdir -p /user/history
sudo -u hdfs hadoop fs -chmod -R 1777 /user/history
sudo -u hdfs hadoop fs -chown mapred:hadoop /user/history
在cdh1上同步配置文件:
scp -r /etc/hadoop/conf root@cdh2:/etc/hadoop/
scp -r /etc/hadoop/conf root@cdh3:/etc/hadoop/
启动服务
在每个节点启动 YARN :
for x in `ls /etc/init.d/|grep hadoop-yarn` ; do service $x start ; done
测试
访问http://192.168.0.102:8088可以访问 Yarn 的管理页面:
通过http://192.168.0.102:19888/jobhistory可以访问 JobHistory 的管理页面。