0x00 教程内容
- Dockerfile文件的编写
- 校验Kafka准备工作
- 校验Kafka是否安装成功
0x01 Dockerfile文件的编写
1. 编写Dockerfile文件
为了方便,我复制了一份flume_sny_all的文件,取名kafka_sny_all。
a. Kafka安装步骤
参考文章:D011 复制粘贴玩大数据之安装与配置Kafka集群
其实安装内容都是一样的,这里只是就根据我写的步骤整理了一下
2. 编写Dockerfile文件的关键点
与D010 复制粘贴玩大数据之Dockerfile安装Flume集群的“0x01 3. a. Dockerfile参考文件”相比较,不同点体现在:
具体步骤:
a. 添加安装包并解压(ADD指令会自动解压)
#添加Kafka
ADD ./kafka_2.11-1.0.0.tgz /usr/local/
. 添加环境变量(FLUME_HOME、PATH)
#Kafka环境变量
ENV KAFKA_HOME /usr/local/kafka_2.11-1.0.0
#PATH里面追加内容
$KAFKA_HOME/bin:
c. 添加配置文件(注意给之前的语句加“&& \”,表示未结束)
&& \
mv /tmp/server.properties $KAFKA_HOME/config/server.properties && \
mv /tmp/init_kafka.sh ~/init_kafka.sh
d. 给kafka初始化文件权限
#修改init_kafka.sh权限为700
RUN chmod 700 init_kafka.sh
3. 完整的Dockerfile文件参考
a. 安装hadoop、spark、zookeeper、hbase、hive、flume、kafk
FROM ubuntu MAINTAINER shaonaiyi shaonaiyi@163.com ENV BUILD_ON 2019-01-28 RUN apt-get update -qqy RUN apt-get -qqy install vim wget net-tools iputils-ping openssh-server #添加JDK MAINTAINER shaonaiyi shaonaiyi@163.com ENV BUILD_ON 2019-03-12 RUN apt-get update -qqy RUN apt-get -qqy install vim wget net-tools iputils-ping openssh-server #添加JDK ADD ./jdk-8u161-linux-x64.tar.gz /usr/local/ #添加hadoop ADD ./hadoop-2.7.5.tar.gz /usr/local/ #添加scala ADD ./scala-2.11.8.tgz /usr/local/ #添加spark ADD ./zookeeper-3.4.10.tar.gz /usr/local/ #添加HBase ADD ./hbase-1.2.6-bin.tar.gz /usr/local/ #添加Hive ADD ./apache-hive-2.3.3-bin.tar.gz /usr/local/ #添加Flume ADD ./apache-flume-1.8.0-bin.tar.gz /usr/local/ #添加Kafka ADD ./kafka_2.11-1.0.0.tgz /usr/local/ ENV CHECKPOINT 2019-03-12 #增加JAVA_HOME环境变量 ENV JAVA_HOME /usr/local/jdk1.8.0_161 #hadoop环境变量 ENV HADOOP_HOME /usr/local/hadoop-2.7.5 #scala环境变量 ENV SCALA_HOME /usr/local/scala-2.11.8 #spark环境变量 ENV SPARK_HOME /usr/local/spark-2.2.0-bin-hadoop2.7 #zk环境变量 ENV ZK_HOME /usr/local/zookeeper-3.4.10 #HBase环境变量 ENV HBASE_HOME /usr/local/hbase-1.2.6 #Hive环境变量 ENV HIVE_HOME /usr/local/apache-hive-2.3.3-bin #Flume环境变量 ENV FLUME_HOME /usr/local/apache-flume-1.8.0-bin #Kafka环境变量 ENV KAFKA_HOME /usr/local/kafka_2.11-1.0.0 #将环境变量添加到系统变量中 RUN ssh-keygen -t rsa -f ~/.ssh/id_rsa -P '' && \ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys && \ chmod 600 ~/.ssh/authorized_keys #复制配置到/tmp目录 COPY config /tmp #将配置移动到正确的位置 RUN mv /tmp/ssh_config ~/.ssh/config && \ mv /tmp/profile /etc/profile && \ mv /tmp/masters $SPARK_HOME/conf/masters && \ cp /tmp/slaves $SPARK_HOME/conf/ && \ mv /tmp/spark-defaults.conf $SPARK_HOME/conf/spark-defaults.conf && \ mv /tmp/spark-env.sh $SPARK_HOME/conf/spark-env.sh && \ mv /tmp/hadoop-env.sh $HADOOP_HOME/etc/hadoop/hadoop-env.sh && \ mv /tmp/hdfs-site.xml $HADOOP_HOME/etc/hadoop/hdfs-site.xml && \ mv /tmp/core-site.xml $HADOOP_HOME/etc/hadoop/core-site.xml && \ mv /tmp/yarn-site.xml $HADOOP_HOME/etc/hadoop/yarn-site.xml && \ mv /tmp/mapred-site.xml $HADOOP_HOME/etc/hadoop/mapred-site.xml && \ mv /tmp/master $HADOOP_HOME/etc/hadoop/master && \ mv /tmp/slaves $HADOOP_HOME/etc/hadoop/slaves && \ mv /tmp/start-hadoop.sh ~/start-hadoop.sh && \ mv /tmp/init_zk.sh ~/init_zk.sh && \ mkdir -p /usr/local/hadoop2.7/dfs/data && \ mkdir -p /usr/local/hadoop2.7/dfs/name && \ mkdir -p /usr/local/zookeeper-3.4.10/datadir && \ mkdir -p /usr/local/zookeeper-3.4.10/log && \ mv /tmp/zoo.cfg $ZK_HOME/conf/zoo.cfg && \ mv /tmp/hbase-env.sh $HBASE_HOME/conf/hbase-env.sh && \ mv /tmp/hbase-site.xml $HBASE_HOME/conf/hbase-site.xml && \ mv /tmp/regionservers $HBASE_HOME/conf/regionservers && \ mv /tmp/hive-env.sh $HIVE_HOME/conf/hive-env.sh && \ mv /tmp/flume-env.sh $FLUME_HOME/conf/flume-env.sh && \ mv /tmp/server.properties $KAFKA_HOME/config/server.properties && \ mv /tmp/init_kafka.sh ~/init_kafka.sh RUN echo $JAVA_HOME #设置工作目录 WORKDIR /root #启动sshd服务 RUN /etc/init.d/ssh start #修改start-hadoop.sh权限为700 RUN chmod 700 start-hadoop.sh #修改init_zk.sh权限为700 RUN chmod 700 init_zk.sh #修改init_kafka.sh权限为700 RUN chmod 700 init_kafka.sh #修改root密码 RUN echo "root:shaonaiyi" | chpasswd CMD ["/bin/bash"]
0x02 校验Kafka前准备工作
1. 环境及资源准备
a. 安装Docker
请参考:D001.5 Docker入门(超级详细基础篇)的“0x01 Docker的安装”小节
b. 准备Kafka的安装包,放于与Dockerfile同级目录下
c. 准备Kafka的配置文件(放于config目录下)
cd /home/shaonaiyi/docker_bigdata/kafka_sny_all/config
配置文件一:vi server.properties
# Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additional information regarding copyright ownership. # The ASF licenses this file to You under the Apache License, Version 2.0 # (the "License"); you may not use this file except in compliance with # the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # see kafka.server.KafkaConfig for additional details and defaults ############################# Server Basics ############################# # The id of the broker. This must be set to a unique integer for each broker. broker.id=0 ############################# Socket Server Settings ############################# # The address the socket server listens on. It will get the value returned from # java.net.InetAddress.getCanonicalHostName() if not configured. # FORMAT: # listeners = listener_name://host_name:port # EXAMPLE: # listeners = PLAINTEXT://your.host.name:9092 #listeners=PLAINTEXT://:9092 # Hostname and port the broker will advertise to producers and consumers. If not set, # it uses the value for "listeners" if configured. Otherwise, it will use the value # returned from java.net.InetAddress.getCanonicalHostName(). #advertised.listeners=PLAINTEXT://your.host.name:9092 # Maps listener names to security protocols, the default is for them to be the same. See the config documentation for more details #listener.security.protocol.map=PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL # The number of threads that the server uses for receiving requests from the network and sending responses to the network num.network.threads=3 # The number of threads that the server uses for processing requests, which may include disk I/O num.io.threads=8 # The send buffer (SO_SNDBUF) used by the socket server socket.send.buffer.bytes=102400 # The maximum size of a request that the socket server will accept (protection against OOM) socket.request.max.bytes=104857600 ############################# Log Basics ############################# # A comma seperated list of directories under which to store log files log.dirs=/root/logs/kafka-logs # The default number of log partitions per topic. More partitions allow greater # parallelism for consumption, but this will also result in more files across # the brokers. num.partitions=1 # The number of threads per data directory to be used for log recovery at startup and flushing at shutdown. # This value is recommended to be increased for installations with data dirs located in RAID array. num.recovery.threads.per.data.dir=1 ############################# Internal Topic Settings ############################# # The replication factor for the group metadata internal topics "__consumer_offsets" and "__transaction_state" # For anything other than development testing, a value greater than 1 is recommended for to ensure availability such as 3. offsets.topic.replication.factor=1 transaction.state.log.replication.factor=1 transaction.state.log.min.isr=1 ############################# Log Flush Policy ############################# # Messages are immediately written to the filesystem but by default we only fsync() to sync # the OS cache lazily. The following configurations control the flush of data to disk. # There are a few important trade-offs here: # 3. Throughput: The flush is generally the most expensive operation, and a small flush interval may lead to exceessive seeks. # The number of messages to accept before forcing a flush of data to disk #log.flush.interval.messages=10000 # The maximum amount of time a message can sit in a log before we force a flush #log.flush.interval.ms=1000 ############################# Log Retention Policy ############################# # The following configurations control the disposal of log segments. The policy can # be set to delete segments after a period of time, or after a given size has accumulated. # A segment will be deleted whenever *either* of these criteria are met. Deletion always happens # from the end of the log. # The minimum age of a log file to be eligible for deletion due to age log.retention.hours=168 # A size-based retention policy for logs. Segments are pruned from the log unless the remaining # segments drop below log.retention.bytes. Functions independently of log.retention.hours. #log.retention.bytes=1073741824 # The maximum size of a log segment file. When this size is reached a new log segment will be created. log.segment.bytes=1073741824 # The interval at which log segments are checked to see if they can be deleted according # to the retention policies log.retention.check.interval.ms=300000 ############################# Zookeeper ############################# # Zookeeper connection string (see zookeeper docs for details). # This is a comma separated host:port pairs, each corresponding to a zk # server. e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002". # You can also append an optional chroot string to the urls to specify the # root directory for all kafka znodes. zookeeper.connect=hadoop-master:2181,hadoop-slave1:2181,hadoop-slave2:2181 # Timeout in ms for connecting to zookeeper zookeeper.connection.timeout.ms=6000 ############################# Group Coordinator Settings ############################# # The following configuration specifies the time, in milliseconds, that the GroupCoordinator will delay the initial consumer rebalance. # The rebalance will be further delayed by the value of group.initial.rebalance.delay.ms as new members join the group, up to a maximum of max.poll.interval.ms. # The default value for this is 3 seconds. # We override this to 0 here as it makes for a better out-of-the-box experience for development and testing. # However, in production environments the default value of 3 seconds is more suitable as this will help to avoid unnecessary, and potentially expensive, rebalances during application startup. group.initial.rebalance.delay.ms=0
d. 修改环境变量配置文件(放于config目录下)
配置文件二:vi profile
添加内容:
export KAFKA_HOME=/usr/local/kafka_2.11-1.0.0 export PATH=$PATH:$KAFKA_HOME/bin
e. 编写初始化Kafka脚本并放于config文件夹目录下(需要修改Kafka的 broker.id
)
vi init_kafka.sh
#!/bin/bash # 将某个文件中的"broker.id=0"字符串替换为"broker.id=x",master这句可删除 ssh root@hadoop-master "sed -i 's/broker.id=0/broker.id=0/g' $KAFKA_HOME/config/server.properties" ssh root@hadoop-slave1 "sed -i 's/broker.id=0/broker.id=1/g' $KAFKA_HOME/config/server.properties" ssh root@hadoop-slave2 "sed -i 's/broker.id=0/broker.id=2/g' $KAFKA_HOME/config/server.properties"
0x03 校验Kafka是否安装成功
1. 修改生成容器脚本
a. 修改start_containers.sh文件(样本镜像名称成shaonaiyi/kafka)
vi start_containers.sh
本人把里面的三个shaonaiyi/flume改为了shaonaiyi/kafka
ps:当然,你可以新建一个新的网络,换ip,这里偷懒,用了旧的网络,只换了ip
2. 生成镜像
a. 删除之前的flume集群容器(节省资源),如已删可省略此步
cd /home/shaonaiyi/docker_bigdata/flume_sny_all/config/
如果是复制的,此句可以省略:chmod 700 stop_containers.sh
./stop_containers.sh
b. 生成装好hadoop、spark、zookeeper、hbase、hive、flume、kafka的镜像(如果之前shaonaiyi/flume未删除,则此次会快很多)
cd /home/shaonaiyi/docker_bigdata/kafka_sny_all
docker build -t shaonaiyi/kafka .
3. 生成容器
a. 生成容器(start_containers.sh如果没权限则给权限):
config/start_containers.sh
b. 进入master容器
sh ~/master.sh
4. 启动Kafka
a. 先确保Zookeeper集群已经启动
b. 启动Kafka
第一次启动需要初始化:
./init_kafka.sh
启动命令(三台均执行,自己可以写个脚本来启动!):
kafka-server-start.sh -daemon $KAFKA_HOME/config/server.properties
查看进程:
./jps_all.sh
0xFF 总结
安装很简单,只需要知道步骤,不清楚请参考文章:D011 复制粘贴玩大数据之安装与配置Kafka集群
这个教程又学习了一个新的脚本,按格式,将broker.id=0修改成broker.id=1,按样子学就可以:
"sed -i 's/broker.id=0/broker.id=1/g' $KAFKA_HOME/config/server.properties"
Dockerfile常用指令,请参考文章:D004.1 Dockerfile例子详解及常用指令
到目前为止,已经完成了大数据的基本框架的搭建了,可以嗨森地写原理教程了【破涕而笑.jpg】,当然,还要优化一下的,等有时间再回头优化吧。