源码见:https://github.com/hiszm/hadoop-train
HDFS概述(Hadoop Distributed File System)
- 分布式的
- commodity、low-cost hardware:去中心化IoE
- fault-tolerant:高容错 , 默认采用3副本机制
- high throughput:移动计算比移动数据成本低
- large data sets:大规模的数据集 , 基本都是GB和TB级别
HDFS架构详解
NameNode(master) / DataNodes(slave) HDFS 遵循主/从架构
master/slave
,由 单个 NameNode(NN) 和多个 DataNode(DN) 组成:- NameNode : 负责执行有关 文件系统命名空间
the file system namespace
的操作,大多数文件系统类似 (如 Linux) , 支持 增删改查 文件和目录等。它同时还负责集群元数据的存储,记录着文件中各个数据块的位置信息。 - DataNode:负责提供来自文件系统客户端的读写请求,执行块的创建,删除等操作。
- NameNode : 负责执行有关 文件系统命名空间
- HDFS 将每一个文件存储为一系列块,每个块由多个副本来保证容错,这些块存储在
DN
中, 当然这些块的大小和复制因子可以自行配置( 默认情况下,块大小是 128M,默认复制因子是 3 )。
- 环境运行在 GNU/Linux 中. HDFS 用的是 Java 语言
举例
一个a.txt 共有150M 一个blocksize为128M
则会拆分两个block 一个是block1: 128M ; 另个block2: 22M
那么问题来了, block1 和block2 要存放在哪个DN里面?
这个 对于用户是透明的 , 这个就要用 HDFS来完成
- 文件系统Namespace
- user or an application can create directories and store files inside these directories. (可以CURD)
- The file system namespace hierarchy is similar to most other existing file systems; ( 类似于Linux)
- one can create and remove files, move a file from one directory to another, or rename a file. (可以CURD)
- HDFS supports user quotas and access permissions. HDFS does not support hard links or soft links. (不支持硬链接)
- The NameNode **maintains the file system namespace**(唯一`NN`,多个`DN`)
架构的稳定性
- 心跳机制和重新复制 : 每个 DataNode 定期向 NameNode 发送心跳消息,如果超过指定时间没有收到心跳消息,则将
DataNode
标记为死亡。NameNode 不会将任何新的 IO 请求转发给标记为死亡的 DataNode,也不会再使用这些 DataNode 上的数据。 由于数据不再可用,可能会导致某些块的复制因子小于其指定值,NameNode 会跟踪这些块,并在 必要的时候进行重新复制 。 - 数据的完整性 : 由于存储设备故障等原因,存储在 DataNode 上的数据块也会发生损坏。为了避免读取到已经损坏的数据而导致错误,HDFS 提供了数据完整性校验机制来保证数据的完整性,具体操作如下:当客户端创建 HDFS 文件时,它会计算文件的每个块的
校验和
,并将校验和
存储在同一 HDFS 命名空间下的单独的隐藏文件中。当客户端检索文件内容时,它会验证从每个 DataNode 接收的数据是否与存储在关联校验和文件中的校验和
匹配。如果匹配失败,则证明数据已经损坏,此时客户端会选择从其他 DataNode 获取该块的其他可用副本。 - 元数据的磁盘故障 :
FsImage
和EditLog
是 HDFS 的核心数据,这些数据的意外丢失可能会导致整个 HDFS 服务不可用。为了避免这个问题,可以配置 NameNode 使其支持FsImage
和EditLog
多副本同步,这样FsImage
或EditLog
的任何改变都会引起每个副本FsImage
和EditLog
的同步更新。 - 支持快照 : 快照支持在特定时刻存储数据副本,在数据意外损坏时,可以通过回滚操作恢复到健康的数据状态。
- 心跳机制和重新复制 : 每个 DataNode 定期向 NameNode 发送心跳消息,如果超过指定时间没有收到心跳消息,则将
HDFS副本机制
- It stores each file as a sequence of blocks. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file.( 默认情况下,块大小是 128M,默认复制因子是 3)
- An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later.( 复制因子和块大小, 可以改的)
为了最大限度地减少带宽消耗和读取延迟,HDFS 在执行读取请求时,优先读取距离读取器最近的副本。如果在与读取器节点相同的机架上存在副本,则优先选择该副本。如果 HDFS 群集跨越多个数据中心,则优先选择本地数据中心上的副本。
Linux环境介绍
(base) JackSundeMBP:~ jacksun$ ssh hadoop@192.168.68.200
[hadoop@hadoop000 ~]$ pwd
/home/hadoop
[hadoop@hadoop000 ~]$ ls
app Desktop Downloads maven_resp Pictures README.txt software t.txt
data Documents lib Music Public shell Templates Videos
| 文件名 | 用途 |
| ---- | ---- |
| software | 软件安装包 |
| app | 软件安装目录 |
| data | 数据 |
| lib | jar包 |
| shell | 脚本 |
| maven_resp | maven依赖包 |
[hadoop@hadoop000 ~]$ sudo vi /etc/hosts
192.168.68.200 hadoop000
Hadoop部署
- hadoop是用的是CDH
- CDH相关的软件包下载 :http://archive.cloudera.com/cdh5/cdh/5/
- hadoop的版本 :hadoop-2.6.0-cdh5.15.1
- hive版本: hive-1.1.0-cdh5.15.1
JDK1.8部署详解
- 获得文件
scp jdk_name hadoop@192/168.1.200
- 安装jdk
tar -zvxf jdk_name -C ~/app
- 配置系统环境
vi .bash_profile
PATH=$PATH:$HOME/.local/bin:$HOME/bin
export JAVA_HOME=/home/hadoop/app/jdk1.8.0_91
export PATH=$JAVA_HOME/bin:$PATH
修改生效source .bash_profile
java -version
java version "1.8.0_91"
Java(TM) SE Runtime Environment (build 1.8.0_91-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.91-b14, mixed mode)
打印出上述则安装成功
ssh无密码登陆部署详解
- 创建密钥
ssh-keygen -t rsa
cd .ssh
- 公钥输入到key里面
cat id_rsa.pub >> authorized_keys
-rw------- 1 hadoop hadoop 796 8月 16 06:17 authorized_keys
-rw------- 1 hadoop hadoop 1675 8月 16 06:14 id_rsa
-rw-r--r-- 1 hadoop hadoop 398 8月 16 06:14 id_rsa.pub
-rw-r--r-- 1 hadoop hadoop 1230 8月 16 18:05 known_hosts
id_rsa
私钥id_rsa.pub
公钥
[hadoop@hadoop000 ~]$ ssh localhost
The authenticity of host 'localhost (127.0.0.1)' can't be established.
ECDSA key fingerprint is SHA256:LZvkeJHnqH0AtihqFB2AcQJKwMpH1/DorPi0bIEKcQM.
ECDSA key fingerprint is MD5:9f:b5:f3:bd:f2:aa:61:97:8b:8a:e2:a3:98:5a:e4:3d.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
Last login: Sun Aug 16 18:03:23 2020 from 192.168.1.3
[hadoop@hadoop000 ~]$ ls
app Desktop lib Pictures shell t.txt
authorized_keys Documents maven_resp Public software Videos
data Downloads Music README.txt Templates
[hadoop@hadoop000 ~]$ ssh localhost
Last login: Sun Aug 16 18:05:21 2020 from 127.0.0.1
Hadoop安装目录详解及hadoop-env配置
- 下载 https://download.csdn.net/download/jankin6/12668545
- 解压
tar -zxvf hadoop_name.tar.gz -C ~/app
- 添加环境变量
- 修改配置
配置
JAVA_HOME
[hadoop@hadoop000 hadoop]$ ls
capacity-scheduler.xml httpfs-env.sh mapred-env.sh
configuration.xsl httpfs-log4j.properties mapred-queues.xml.template
container-executor.cfg httpfs-signature.secret mapred-site.xml
core-site.xml httpfs-site.xml mapred-site.xml.template
hadoop-env.cmd kms-acls.xml slaves
hadoop-env.sh kms-env.sh ssl-client.xml.example
hadoop-metrics2.properties kms-log4j.properties ssl-server.xml.example
hadoop-metrics.properties kms-site.xml yarn-env.cmd
hadoop-policy.xml log4j.properties yarn-env.sh
hdfs-site.xml mapred-env.cmd yarn-site.xml
[hadoop@hadoop000 hadoop]$ pwd
/home/hadoop/app/hadoop-2.6.0-cdh5.15.1/etc/hadoop
[hadoop@hadoop000 hadoop]$ sudo vi hadoop-env.sh
-----------------------------
# The java implementation to use.
#export JAVA_HOME=${JAVA_HOME}
export JAVA_HOME=/home/hadoop/app/jdk1.8.0_91
vi ~/.bash_profile
export HADOOP_HOME=/home/hadoop/app/hadoop-2.6.0-cdh5.15.1
export PATH=$HADOOP_HOME/bin:$PATH
cd $HADOOP_HOME/bin
- 目录
[hadoop@hadoop000 hadoop-2.6.0-cdh5.15.1]$ ls
bin etc include LICENSE.txt README.txt src
bin-mapreduce1 examples lib logs sbin
cloudera examples-mapreduce1 libexec NOTICE.txt share
| 目录 | 用途 |
| ---- | ---- |
| bin | hadoop客户端名单 |
| etc/hadoop | hadoop相关的配置文件存放目录 |
| sbin | 启动hadoop相关进程的脚本 |
| share | 常用的例子 |
| | |
HDFS格式化以及启动详解
http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.6.0-cdh5.15.1/
vi etc/hadoop/core-site.xml:
说明这个主节点再这台机器上的8020端口
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop000:8020</value>
</property>
</configuration>
vi etc/hadoop/hdfs-site.xml:
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/app/tmp</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
vi slaves
第一次要执行格式化文件系统,不重复执行: hdfs namenode -format
cd $HADOOP_HOME/bin
相关命令再这里cd $HADOOP_HOME/bin
- 启动集群
$HADOOP_HOME/sbin/start-dfs.sh
验证成功
[hadoop@hadoop000 sbin]$ jps
13607 NameNode
14073 Jps
13722 DataNode
13915 SecondaryNameNode
- 防火墙干扰
http://192.168.1.200:50070
发现jps可以打开但浏览器不行,多半是防火墙
查看防火墙 firewall-cmd --state
关防火墙systemctl stop firewalld.service
[hadoop@hadoop000 sbin]$ firewall-cmd --state
not running
- 停止集群
$HADOOP_HOME/sbin/stop-dfs.sh
- 注意
tart-dfs. sh
等于
hadoop-daemons.sh start namenode
hadoop-daemons.sh start datanode
hadoop-daemons.sh start secondarynamenode
同理stop-dfs.sh
也是
Hadoop命令行操作详解
改了环境变量记得source ~/.bash_profile
[hadoop@hadoop000 bin]$ ./hadoop
Usage: hadoop [--config confdir] COMMAND
where COMMAND is one of:
fs run a generic filesystem user client
version print the version
jar <jar> run a jar file
checknative [-a|-h] check native hadoop and compression libraries availability
distcp <srcurl> <desturl> copy file or directories recursively
archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
classpath prints the class path needed to get the
credential interact with credential providers
Hadoop jar and the required libraries
daemonlog get/set the log level for each daemon
s3guard manage data on S3
trace view and modify Hadoop tracing settings
or
CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
[hadoop@hadoop000 bin]$ ./hadoop fs
Usage: hadoop fs [generic options]
[-appendToFile <localsrc> ... <dst>]
[-cat [-ignoreCrc] <src> ...]
[-chgrp [-R] GROUP PATH...]
[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
[-chown [-R] [OWNER][:[GROUP]] PATH...]
[-copyFromLocal [-f] [-p] [-l] <localsrc> ... <dst>]
[-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-count [-q] [-h] [-v] [-x] <path> ...]
[-cp [-f] [-p | -p[topax]] <src> ... <dst>]
[-df [-h] [<path> ...]]
[-du [-s] [-h] [-x] <path> ...]
[-find <path> ... <expression> ...]
[-get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-getfacl [-R] <path>]
[-getfattr [-R] {-n name | -d} [-e en] <path>]
[-getmerge [-nl] <src> <localdst>]
[-help [cmd ...]]
[-ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [<path> ...]]
[-mkdir [-p] <path> ...]
[-moveFromLocal <localsrc> ... <dst>]
[-moveToLocal <src> <localdst>]
[-mv <src> ... <dst>]
[-put [-f] [-p] [-l] <localsrc> ... <dst>]
[-rm [-f] [-r|-R] [-skipTrash] <src> ...]
[-rmdir [--ignore-fail-on-non-empty] <dir> ...]
[-test -[defsz] <path>]
[-text [-ignoreCrc] <src> ...]
[-touchz <path> ...]
[-usage [cmd ...]]
- 常用命令
hadoop fs -ls /
hadoop fs -cat /
==hadoop fs -text /
hadoop fs -put /
==hadoop fs -copyFromLocal /
hadoop fs -get /README.txt ./
hadoop fs -mkdir /hdfs-test
hadoop fs -mv
hadoop fs -rm
hadoop fs -rmdir
hadoop fs -rmr
==hadoop fs -rm -r
hadoop fs -getmerge
hadoop fs -mkdir /hdfs-test
[hadoop@hadoop000 hadoop-2.6.0-cdh5.15.1]$ hadoop fs -put README.txt /
[hadoop@hadoop000 hadoop-2.6.0-cdh5.15.1]$ hadoop fs -ls /
Found 1 items
-rw-r--r-- 1 hadoop supergroup 1366 2020-08-17 21:35 /README.txt
[hadoop@hadoop000 hadoop-2.6.0-cdh5.15.1]$ hadoop fs -cat /README.txt
......
and our wiki, at:
......
Hadoop Core uses the SSL libraries from the Jetty project written
by mortbay.org.
[hadoop@hadoop000 hadoop-2.6.0-cdh5.15.1]$ hadoop fs -get /README.txt ./
[hadoop@hadoop000 hadoop-2.6.0-cdh5.15.1]$ hadoop fs -mkdir /hdfs-test
[hadoop@hadoop000 hadoop-2.6.0-cdh5.15.1]$ hadoop fs -ls /
Found 2 items
-rw-r--r-- 1 hadoop supergroup 1366 2020-08-17 21:35 /README.txt
drwxr-xr-x - hadoop supergroup 0 2020-08-17 21:48 /hdfs-test
HDFS的存储扩展
上图我们可以看到一个文件被拆了两个块,但是实际存储的在哪里呢?
由此我们得出put
,1个文件分割成n个块,然后再存放再不同的节点的get
,先去n个节点上的n个块上找到对应的数据信息