场景描述
172.19.9.202 主节点 JobManager 主/从
172.19.9.201 从节点 TaskManager 主/从
172.19.9.203 从节点 TaskManager 主/从
一、SSH主节点从节点设置要统一
ssh-keygen -t rsa -P "" 不设置密码 cat /root/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys chmod 600 /root/.ssh/authorized_keys ssh localhost
二、将主节点的密钥添加到从节点(其他节点也一样)
scp /root/.ssh/id_rsa.pub root@172.19.9.203:/root/.ssh/authorized_keys scp /root/.ssh/id_rsa.pub root@172.19.9.201:/root/.ssh/authorized_keys
三、在主节点验证是否可以不输入密码登录两个从节点
ssh 172.19.9.201 ssh 172.19.9.203
以上二、三步分别再其他机器上进行配置
四、进入到172.19.9.202的flink配置目录下
cd /usr/local/flink/flink-1.11.2/conf/
修改flink-conf.yaml内容如下,其他配置视情况而定,密钥设置为指向主节点
jobmanager.rpc.address: 172.19.9.202
1、如果每一台机器都设置为jobmanager.rpc.address: localhost 后 master设置201/203 就是HA
2、如果每一台机器都设置为jobmanager.rpc.address: 172.19.9.202 后
master之能设置202 就是独立集群 一个主节点master。
1. Job Manager的职责
Job Manager负责协调分布式计算节点:也被称为Master 节点。它负责调度任务、协调CheckPoint、故障恢复等。Job
Manager将一个作业分为多个Task,并通过Actor 系统与Task Manager进行相互通信,用于Task的部署、停止、取消。
在高可用部署下会有多个Job Manager,其中有一个Leader、多个Flower。Leader 总是处于Active
状态,为集群提供服务; Flower 处于Standby 状态,在Leader 宕机后会从Flower中选出一个作为Leader
继续为集群提供服务。Job Manager 选举通过ZooKeeper实现。
2. Task Manager的职责
Task Managr也被称为Werker节点:用于执行Job Manager分配的Task (准确来说是SuyTak)。Task Manger将系统资源(CPU、网络、内存)分为多个Task SIot (计算槽) Task运行在具体的Task Slot上: Task Manager通过Actor系统与Job Manager进行相互通信,定期将Task的运行状态和Task Manager 的运行状态提交给Job Managero 多个Task Manager上的Task通过DataStream进行状态计算和结果交互。
定义允许JVM在每个节点上分配的最大主内存量。MB为单位
jobmanager.heap.size: 1024m taskmanager.heap.size: 2048m
修改masters内容如下
172.19.9.202:8081 #172.19.9.201:8081 #172.19.9.203:8081
修改workers内容如下 如果没有主节点202作为worker的话 在ui界面就显示不出来有几个TaskManager
172.19.9.203 172.19.9.202 172.19.9.201
五、下载安装zookeeper
版本为:apache-zookeeper-3.5.6-bin.tar.gz 解压 配置zoo.cfg文件
# The number of milliseconds of each tick tickTime=2000 # The number of ticks that the initial # synchronization phase can take initLimit=10 # The number of ticks that can pass between # sending a request and getting an acknowledgement syncLimit=5 # the directory where the snapshot is stored. # do not use /tmp for storage, /tmp here is just # example sakes. dataDir=/tmp/zookeeper/data dataLogDir=/tmp/zookeeper/log # the port at which the clients will connect clientPort=2181 # #如果像设置集群的话在此处配置IP #server.1=192.168.180.132:2888:3888 #server.2=192.168.180.133:2888:3888 #server.3=192.168.180.134:2888:3888 # # the maximum number of client connections. # increase this if you need to handle more clients #maxClientCnxns=60 # # Be sure to read the maintenance section of the # administrator guide before turning on autopurge. # # http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance # # The number of snapshots to retain in dataDir #autopurge.snapRetainCount=3 # Purge task interval in hours # Set to "0" to disable auto purge feature #autopurge.purgeInterval=1
配置完成之后记得在tmp下创建对应的文件日志输出路径
然后进入到bin路径下启动zookeeper
./zkServer.sh start
六、修改Flink的配置文件flink-conf.yaml 将zookeeper修改为172.19.9.202上的zookeeper
high-availability: zookeeper # high-availability.storageDir: hdfs:///flink/ha/ # 高可用模式下的状态存储地址 high-availability.storageDir: file:///data/flink/checkpoints high-availability.zookeeper.quorum: 172.19.9.202:2181
七、将修改的flink-conf.yaml 、masters、workers三个文件分别传到另外两个flink对应的文件目录下:/usr/local/flink/flink-1.14.0/conf
scp /usr/local/flink/flink-1.14.0/conf/flink-conf.yaml root@172.19.9.201:/usr/local/flink/flink-1.14.0/conf/ scp /usr/local/flink/flink-1.14.0/conf/flink-conf.yaml root@172.19.9.203:/usr/local/flink/flink-1.14.0/conf/ scp /usr/local/flink/flink-1.14.0/conf/masters root@172.19.9.201:/usr/local/flink/flink-1.14.0/conf/ scp /usr/local/flink/flink-1.14.0/conf/masters root@172.19.9.203:/usr/local/flink/flink-1.14.0/conf/ scp /usr/local/flink/flink-1.14.0/conf/workers root@172.19.9.201:/usr/local/flink/flink-1.14.0/conf/ scp /usr/local/flink/flink-1.14.0/conf/workers root@172.19.9.203:/usr/local/flink/flink-1.14.0/conf/
八、启动集群
按照常理说因该配置了ssh只在主节点启动 其他的slaves也就跟着启动了 但我这边202主节点启动 只显示一个 启动203 显示两个 再启动201显示三个 我已经在workers中配置了三个从节点呀 为啥不能一起启动呢 输入的日志是都启动了
203和201都启动后显示三个taskmanager
单独启动202服务器中的./start-cluster.sh显示如下:
[root@localhost bin]# ./start-cluster.sh Starting HA cluster with 3 masters. root@172.19.9.202's password: Starting standalonesession daemon on host localhost.localdomain3. root@172.19.9.201's password: Starting standalonesession daemon on host localhost.localdomain2. root@172.19.9.203's password: Starting standalonesession daemon on host localhost.localdomain4. root@172.19.9.203's password: Starting taskexecutor daemon on host localhost.localdomain4. root@172.19.9.202's password: Starting taskexecutor daemon on host localhost.localdomain3. root@172.19.9.201's password: Starting taskexecutor daemon on host localhost.localdomain2. [root@localhost bin]#
启动203日志如下:
[root@localhost bin]# ./start-cluster.sh Starting HA cluster with 3 masters. [INFO] 1 instance(s) of standalonesession are already running on localhost.localdomain3. Starting standalonesession daemon on host localhost.localdomain3. [INFO] 1 instance(s) of standalonesession are already running on localhost.localdomain2. Starting standalonesession daemon on host localhost.localdomain2. root@172.19.9.203's password: [INFO] 1 instance(s) of standalonesession are already running on localhost.localdomain4. Starting standalonesession daemon on host localhost.localdomain4. root@172.19.9.203's password: [INFO] 1 instance(s) of taskexecutor are already running on localhost.localdomain4. Starting taskexecutor daemon on host localhost.localdomain4. [INFO] 1 instance(s) of taskexecutor are already running on localhost.localdomain3. Starting taskexecutor daemon on host localhost.localdomain3. [INFO] 1 instance(s) of taskexecutor are already running on localhost.localdomain2. Starting taskexecutor daemon on host localhost.localdomain2.
我突然明白了 因为是我将msater中设置了三个服务器地址,说明三个服务器都是主节点 也都是从节点,三主三从,所以单独启动一个主节点并不会同时启动其他两个主节点进程。
那么我把flink的master设置为单个节点202试一下 改完之后单独在202服务器上启动集群start-cluster.sh还只有一个taskmanager:
发现是忘了改201和203 master文件了 但是改完之后还是没有只启动主节点 其他节点仍然不能伴随启动,所以只能手动启动了。