
擅长面向对象设计、C++程序开发,在金山和华为呆过,互联网从业十多年,有分布式计算和存储系统经验。
Linux远程批量工具mooon_ssh和mooon_upload使用示例.pdf 目录 目录 1 1. 前言 1 2. 批量执行命令工具:mooon_ssh 2 3. 批量上传文件工具:mooon_upload 2 4. 使用示例 3 4.1. 使用示例1:上传/etc/hosts 3 4.2. 使用示例2:检查/etc/profile文件是否一致 3 4.3. 使用示例3:批量查看crontab 3 4.4. 使用示例4:批量清空crontab 3 4.5. 使用示例5:批量更新crontab 3 4.6. 使用示例6:取远端机器IP 3 4.7. 使用示例7:批量查看kafka进程(环境变量方式) 4 4.8. 使用示例8:批量停止kafka进程(参数方式) 5 5. 如何编译批量工具? 5 5.1. GO版本 5 5.2. C++版本 6 1. 前言 远程批量工具包含: 1) 批量命令工具mooon_ssh; 2) 批量上传文件工具mooon_upload; 3) 批量下载文件工具mooon_download。 可执行二进制包下载地址: https://github.com/eyjian/libmooon/releases 源代码包下载地址: https://github.com/eyjian/libmooon/archive/master.zip 批量工具除由三个工具组成外,还分两个版本: 1) C++版本 2) GO版本 当前C++版本比较成熟,GO版本相当简略,但C++版本依赖C++运行时库,不同环境需要特定编译,而GO版本可不依赖C和C++运行时库,所以不需编译即可应用到广泛的Linux环境。 使用简单,直接执行命令,即会提示用法,如C++版本: $ mooon_ssh parameter[-c]'s value not set usage: -h[]: Connect to the remote machines on the given hosts separated by comma, can be replaced by environment variable 'H', example: -h='192.168.1.10,192.168.1.11' -P[36000/10,65535]: Specifies the port to connect to on the remote machines, can be replaced by environment variable 'PORT' -u[]: Specifies the user to log in as on the remote machines, can be replaced by environment variable 'U' -p[]: The password to use when connecting to the remote machines, can be replaced by environment variable 'P' -t[60/1,65535]: The number of seconds before connection timeout -c[]: The command is executed on the remote machines, example: -c='grep ERROR /tmp/*.log' -v[1/0,2]: Verbosity, how much troubleshooting info to print 2. 批量执行命令工具:mooon_ssh 参数名 默认值 说明 -u 无 用户名参数,可用环境变量U替代 -p 无 密码参数,可用环境变量P替代 -h 无 IP列表参数,可用环境变量H替代 -P 22,可修改源码,编译为常用端口号 SSH端口参数,可用环境变量PORT替代 -c 无 在远程机器上执行的命令,建议单引号方式指定值,除非要执行的命令本身已经包含了单引号有冲突。使用双引号时,要注意转义,否则会被本地shell解释 -v 1 工具输出的详细度 3. 批量上传文件工具:mooon_upload 参数名 默认值 说明 -u 无 用户名参数,可用环境变量U替代 -p 无 密码参数,可用环境变量P替代 -h 无 IP列表参数,可用环境变量H替代 -P 22,可修改源码,编译为常用端口号 SSH端口参数,可用环境变量PORT替代 -s 无 以逗号分隔的,需要上传的本地文件列表,可以带相对或绝对目录 -d 无 文件上传到远程机器的目录,只能为单个目录 4. 使用示例 4.1. 使用示例1:上传/etc/hosts mooon_upload -s=/etc/hosts -d=/etc 4.2. 使用示例2:检查/etc/profile文件是否一致 mooon_ssh -c='md5sum /etc/hosts' 4.3. 使用示例3:批量查看crontab mooon_ssh -c='crontab -l' 4.4. 使用示例4:批量清空crontab mooon_ssh -c='rm -f /tmp/crontab.empty;touch /tmp/crontab.empty' mooon_ssh -c='crontab /tmp/crontab.emtpy' 4.5. 使用示例5:批量更新crontab mooon_ssh -c='crontab /tmp/crontab.online' 4.6. 使用示例6:取远端机器IP 因为awk用单引号,所以参数“-c”的值不能使用单引号,所以内容需要转义,相对其它来说要复杂点: mooon_ssh -c="netstat -ie | awk -F[\\ :]+ 'BEGIN{ok=0;}{if (match(\$0, \"eth1\")) ok=1; if ((1==ok) && match(\$0,\"inet\")) { ok=0; if (7==NF) printf(\"%s\\n\",\$3); else printf(\"%s\\n\",\$4);} }'" 不同的环境,IP在“netstat -ie”输出中的位置稍有不同,所以awk中加了“7==NF”判断,但仍不一定适用于所有的环境。需要转义的字符包含:双引号、美元符和斜杠。 4.7. 使用示例7:批量查看kafka进程(环境变量方式) $ export H=192.168.31.9,192.168.31.10,192.168.31.11,192.168.31.12,192.168.31.13 $ export U=kafka $ export P='123456' $ mooon_ssh -c='/usr/local/jdk/bin/jps -m' [192.168.31.15] 50928 Kafka /data/kafka/config/server.properties 125735 Jps -m [192.168.31.15] SUCCESS [192.168.31.16] 147842 Jps -m 174902 Kafka /data/kafka/config/server.properties [192.168.31.16] SUCCESS [192.168.31.17] 51409 Kafka /data/kafka/config/server.properties 178771 Jps -m [192.168.31.17] SUCCESS [192.168.31.18] 73568 Jps -m 62314 Kafka /data/kafka/config/server.properties [192.168.31.18] SUCCESS [192.168.31.19] 123908 Jps -m 182845 Kafka /data/kafka/config/server.properties [192.168.31.19] SUCCESS ================================ [192.168.31.15 SUCCESS] 0 seconds [192.168.31.16 SUCCESS] 0 seconds [192.168.31.17 SUCCESS] 0 seconds [192.168.31.18 SUCCESS] 0 seconds [192.168.31.19 SUCCESS] 0 seconds SUCCESS: 5, FAILURE: 0 4.8. 使用示例8:批量停止kafka进程(参数方式) $ mooon_ssh -c='/data/kafka/bin/kafka-server-stop.sh' -u=kafka -p='123456' -h=192.168.31.15,192.168.31.16,192.168.31.17,192.168.31.18,192.168.31.19 [192.168.31.15] No kafka server to stop command return 1 [192.168.31.16] No kafka server to stop command return 1 [192.168.31.17] No kafka server to stop command return 1 [192.168.31.18] No kafka server to stop command return 1 [192.168.31.19] No kafka server to stop command return 1 ================================ [192.168.31.15 FAILURE] 0 seconds [192.168.31.16 FAILURE] 0 seconds [192.168.31.17 FAILURE] 0 seconds [192.168.31.18 FAILURE] 0 seconds [192.168.31.19 FAILURE] 0 seconds SUCCESS: 0, FAILURE: 5 5. 如何编译批量工具? 5.1. GO版本 依赖的crypto包,从https://github.com/golang/crypto下载,放到目录$GOPATH/src/golang.org/x或$GOROOT/src/golang.org/x下。注意需要先创建好目录$GOROOT/src/golang.org/x,然后在此目录下解压crypto包。如果下载的包名为crypto-master.zip,则解压后的目录名为crypto-master,需要重命名为crypto。 安装crypto包示例: 1)安装go cd /usr/local tar xzf go1.10.3.linux-386.tar.gz 2)mkdir -p go/golang.org/x 3)cd go/golang.org/x 4)unzip crypto-master.zip 5)mv crypto-master crypto 命令行执行“go help gopath”可了解gopath,或执行“go env”查看当前的设置。编译方法: go build -o mooon_ssh mooon_ssh.go 上述编译会依赖glibc,如果不想依赖,这样编译: go build -o mooon_ssh -ldflags '-linkmode "external" -extldflags "-static"' mooon_ssh.go 5.2. C++版本 C++版本为libmooon组成部分,编译libmooon即可得到mooon_ssh、mooon_upload和mooon_download。但libmooon依赖libssh2,而libssh2又依赖openssl,所以需要先依次安装好openssl和libssh2。 libssh2下载地址:http://www.libssh2.org。 1) openssl编译安装方法 解压后进入openssl源码目录,以版本openssl-1.0.2i为例,依次执行: ./config --prefix=/usr/local/openssl-1.0.2i shared threads make make install ln -s /usr/local/openssl-1.0.2i /usr/local/openssl 2) libssh2编译安装方法 解压后进入libssh2源码目录,以版本libssh2-1.6.0为例,依次执行: ./configure --prefix=/usr/local/libssh2-1.6.0 --with-libssl-prefix=/usr/local/openssl make make install 注意:libssh2和比较新版本的openssl可能存在兼容问题。 3) libmooon编译方法 采用cmake编译,所以需要先安装好cmake,并要求cmake版本不低于2.8.11,可执行“cmake --version”查看cmake版本,当cmake、libssh2和openssl准备好后执行下列命令编译libmooon即可得到批量工具: cmake -DCMAKE_BUILD_TYPE=Debug -DCMAKE_INSTALL_PREFIX=/usr/local/mooon . make make install 在make一步成功后,即可在tools子目录中找到mooon_ssh、mooon_upload和mooon_download,实践中mooon_download可能使用得比较少。
Kafka常用命令收录.pdf 目录 目录 1 1. 前言 2 2. Broker默认端口号 2 3. 启动Kafka 2 4. 创建Topic 2 5. 列出所有Topic 2 6. 删除Topic 3 7. 查看Topic 3 8. 增加topic的partition数 3 9. 生产消息 3 10. 消费消息 4 11. 查看有哪些消费者Group 4 12. 查看新消费者详情 4 13. 查看Group详情 5 14. 删除Group 5 15. 设置consumer group的offset 5 16. RdKafka自带示例 5 17. 平衡leader 6 18. 自带压测工具 6 19. 查看topic指定分区offset的最大值或最小值 6 20. 查看__consumer_offsets 6 21. 获取指定consumer group的位移信息 6 22. 20) 查看kafka的zookeeper 7 23. 如何增加__consumer_offsets的副本数? 9 24. 问题 9 附1:进程监控工具process_monitor.sh 9 附2:批量操作工具 10 附2.1:批量执行命令工具:mooon_ssh 10 附2.2:批量上传文件工具:mooon_upload 11 附2.3:使用示例 11 附3:批量设置broker.id和listeners工具 13 附4:批量设置hostname工具 13 附5:Kafka监控工具kafka-manager 13 附6:kafka的安装 14 附7:__consumer_offsets 15 1. 前言 本文内容主要来自两个方面:一是网上的分享,二是自研的随手记。日记月累,收录kafka各种命令,会持续更新。 在0.9.0.0之后的Kafka,出现了几个新变动,一个是在Server端增加了GroupCoordinator这个角色,另一个较大的变动是将topic的offset 信息由之前存储在zookeeper上改为存储到一个特殊的topic(__consumer_offsets)中。 2. Broker默认端口号 9092,建议安装时,在zookeeper中指定kafka的根目录,比如“/kafka”,而不是直接使用“/”,这样多套kafka也可共享同一个zookeeper集群。 3. 启动Kafka kafka-server-start.sh config/server.properties 后台常驻方式,请带上参数“-daemon”,如: /usr/local/kafka/bin/kafka-server-start.sh -daemon /usr/local/kafka/config/server.properties 4. 创建Topic 参数--topic指定Topic名,--partitions指定分区数,--replication-factor指定备份数: kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test 注意,如果配置文件server.properties指定了kafka在zookeeper上的目录,则参数也要指定,否则会报无可用的brokers,如: kafka-topics.sh --create --zookeeper localhost:2181/kafka --replication-factor 1 --partitions 1 --topic test 5. 列出所有Topic kafka-topics.sh --list --zookeeper localhost:2181 注意,如果配置文件server.properties指定了kafka在zookeeper上的目录,则参数也要指定,否则会报无可用的brokers,如: kafka-topics.sh --list --zookeeper localhost:2181/kafka 输出示例: __consumer_offsets my-replicated-topic test 6. 删除Topic 1) kafka-topics.sh --zookeeper localhost:2181 --topic test --delete 2) kafka-topics.sh --zookeeper localhost:2181/kafka --topic test --delete 3) kafka-run-class.sh kafka.admin.DeleteTopicCommand --zookeeper localhost:2181 --topic test 7. 查看Topic kafka-topics.sh --describe --zookeeper localhost:2181 --topic test 注意,如果配置文件server.properties指定了kafka在zookeeper上的目录,则参数也要指定,否则会报无可用的brokers,如: kafka-topics.sh --describe --zookeeper localhost:2181/kafka --topic test 输出示例: Topic:test PartitionCount:3 ReplicationFactor:2 Configs: Topic: test Partition: 0 Leader: 140 Replicas: 140,214 Isr: 140,214 Topic: test Partition: 1 Leader: 214 Replicas: 214,215 Isr: 214,215 Topic: test Partition: 2 Leader: 215 Replicas: 215,138 Isr: 215,138 8. 增加topic的partition数 kafka-topics.sh --zookeeper localhost:2181 --alter --topic test --partitions 5 9. 生产消息 kafka-console-producer.sh --broker-list localhost:9092 --topic test 10. 消费消息 1) 从头开始 kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning 2) 从尾部开始 kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --offset latest 3) 指定分区 kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --offset latest --partition 1 4) 取指定个数 kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --offset latest --partition 1 --max-messages 1 5) 新消费者(ver>=0.9) kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --new-consumer --from-beginning --consumer.config config/consumer.properties 11. 查看有哪些消费者Group 1) 分ZooKeeper方式(老) kafka-consumer-groups.sh --zookeeper 127.0.0.1:2181/kafka --list 2) API方式(新) kafka-consumer-groups.sh --new-consumer --bootstrap-server 127.0.0.1:9092 --list 输出示例: test console-consumer-37602 console-consumer-75637 console-consumer-59893 12. 查看新消费者详情 仅支持offset存储在zookeeper上的: kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --zkconnect localhost:2181 --group test 13. 查看Group详情 kafka-consumer-groups.sh --new-consumer --bootstrap-server 127.0.0.1:9092 --group test --describe 输出示例: TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID test 1 87 87 0 - - - 14. 删除Group 老版本的ZooKeeper方式可以删除Group,新版本则自动删除,当执行: kafka-consumer-groups.sh --new-consumer --bootstrap-server 127.0.0.1:9092 --group test --delete 输出如下提示: Option '[delete]' is only valid with '[zookeeper]'. Note that there's no need to delete group metadata for the new consumer as the group is deleted when the last committed offset for that group expires. 15. 设置consumer group的offset 执行zkCli.sh进入zookeeper命令行界面,假设需将group为testgroup的topic的offset设置为2018,则:set /consumers/testgroup/offsets/test/0 2018 如果kakfa在zookeeper中的根目录不是“/”,而是“/kafka”,则: set /kafka/consumers/testgroup/offsets/test/0 2018 另外,还可以使用kafka自带工具kafka-run-class.sh kafka.tools.UpdateOffsetsInZK修改,命令用法: kafka.tools.UpdateOffsetsInZK$ [earliest | latest] consumer.properties topic 从用法提示可以看出,只能修改为earliest或latest,没有直接修改zookeeper灵活。 16. RdKafka自带示例 rdkafka_consumer_example -b 127.0.0.1:9092 -g test test rdkafka_consumer_example -e -b 127.0.0.1:9092 -g test test 17. 平衡leader kafka-preferred-replica-election.sh --zookeeper localhost:2181/chroot 18. 自带压测工具 kafka-producer-perf-test.sh --topic test --num-records 100 --record-size 1 --throughput 100 --producer-props bootstrap.servers=localhost:9092 19. 查看topic指定分区offset的最大值或最小值 time为-1时表示最大值,为-2时表示最小值: kafka-run-class.sh kafka.tools.GetOffsetShell --topic hive-mdatabase-hostsltable --time -1 --broker-list 127.0.0.1:9092 --partitions 0 20. 查看__consumer_offsets 需consumer.properties中设置exclude.internal.topics=false: 1) 0.11.0.0之前版本 kafka-console-consumer.sh --topic __consumer_offsets --zookeeper localhost:2181 --formatter "kafka.coordinator.GroupMetadataManager\$OffsetsMessageFormatter" --consumer.config config/consumer.properties --from-beginning 2) 0.11.0.0之后版本(含) kafka-console-consumer.sh --topic __consumer_offsets --zookeeper localhost:2181 --formatter "kafka.coordinator.group.GroupMetadataManager\$OffsetsMessageFormatter" --consumer.config config/consumer.properties --from-beginning 21. 获取指定consumer group的位移信息 需consumer.properties中设置exclude.internal.topics=false: 1) 0.11.0.0版本之前: kafka-simple-consumer-shell.sh --topic __consumer_offsets --partition 11 --broker-list localhost:9091,localhost:9092,localhost:9093 --formatter "kafka.coordinator.GroupMetadataManager\$OffsetsMessageFormatter" 2) 0.11.0.0版本以后(含): kafka-simple-consumer-shell.sh --topic __consumer_offsets --partition 11 --broker-list localhost:9091,localhost:9092,localhost:9093 --formatter "kafka.coordinator.group.GroupMetadataManager\$OffsetsMessageFormatter" 22. 20) 查看kafka的zookeeper 1) 查看Kakfa在zookeeper的根目录 [zk: localhost:2181(CONNECTED) 0] ls /kafka [cluster, controller_epoch, controller, brokers, admin, isr_change_notification, consumers, config] 2) 查看brokers [zk: localhost:2181(CONNECTED) 1] ls /kafka/brokers [ids, topics, seqid] 3) 查看有哪些brokers(214和215等为server.properties中配置的broker.id值): [zk: localhost:2181(CONNECTED) 2] ls /kafka/brokers/ids [214, 215, 138, 139] 4) 查看broker 214,下列数据显示该broker没有设置JMX_PORT: [zk: localhost:2181(CONNECTED) 4] get /kafka/brokers/ids/214 {"listener_security_protocol_map":{"PLAINTEXT":"PLAINTEXT"},"endpoints":["PLAINTEXT://test-204:9092"],"jmx_port":-1,"host":"test-204","timestamp":"1498467464861","port":9092,"version":4} cZxid = 0x200002400 ctime = Mon Jun 26 16:57:44 CST 2017 mZxid = 0x200002400 mtime = Mon Jun 26 16:57:44 CST 2017 pZxid = 0x200002400 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x45b9d9e841f0136 dataLength = 190 numChildren = 0 5) 查看controller,下列数据显示broker 214为controller: [zk: localhost:2181(CONNECTED) 9] get /kafka/controller {"version":1,"brokerid":214,"timestamp":"1498467946988"} cZxid = 0x200002438 ctime = Mon Jun 26 17:05:46 CST 2017 mZxid = 0x200002438 mtime = Mon Jun 26 17:05:46 CST 2017 pZxid = 0x200002438 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x45b9d9e841f0136 dataLength = 56 numChildren = 0 6) 查看kafka集群的id: [zk: localhost:2181(CONNECTED) 13] get /kafka/cluster/id {"version":"1","id":"OCAEJy4qSf29bhwOfO7kNQ"} cZxid = 0x2000023e7 ctime = Mon Jun 26 16:57:28 CST 2017 mZxid = 0x2000023e7 mtime = Mon Jun 26 16:57:28 CST 2017 pZxid = 0x2000023e7 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x0 dataLength = 45 numChildren = 0 7) 查看有哪些topics: [zk: localhost:2181(CONNECTED) 16] ls /kafka/brokers/topics [test, my-replicated-topic, test1, test2, test3, test123, __consumer_offsets, info] 8) 查看topic下有哪些partitions: [zk: localhost:2181(CONNECTED) 19] ls /kafka/brokers/topics/__consumer_offsets/partitions [44, 45, 46, 47, 48, 49, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43] 9) 查看“partition 0”的状态: [zk: localhost:2181(CONNECTED) 22] get /kafka/brokers/topics/__consumer_offsets/partitions/0/state {"controller_epoch":2,"leader":215,"version":1,"leader_epoch":1,"isr":[215,214]} cZxid = 0x2000024c6 ctime = Mon Jun 26 18:02:07 CST 2017 mZxid = 0x200bc4fc3 mtime = Mon Aug 27 18:58:10 CST 2018 pZxid = 0x2000024c6 cversion = 0 dataVersion = 1 aclVersion = 0 ephemeralOwner = 0x0 dataLength = 80 numChildren = 0 23. 如何增加__consumer_offsets的副本数? 可使用kafka-reassign-partitions.sh来增加__consumer_offsets的副本数,方法如下: 构造一JSON文件reassign.json: { "version":1, "partitions":[ {"topic":"__consumer_offsets","partition":0,"replicas":[1,2,3]}, {"topic":"__consumer_offsets","partition":1,"replicas":[2,3,1]}, {"topic":"__consumer_offsets","partition":2,"replicas":[3,1,2]}, {"topic":"__consumer_offsets","partition":3,"replicas":[1,2,3]}, ... {"topic":"__consumer_offsets","partition":100,"replicas":[2,3,1]} ] } 然后执行: kafka-reassign-partitions.sh --zookeeper localhost:2181/kafka --reassignment-json-file reassign.json --execute “[1,2,3]”中的数字为broker.id值。 24. 问题 1) -190,Local: Unknown partition 比如单机版只有一个分区,但prodcue参数的分区值为1等。 2) Rdkafka程序日志“delivery failed. errMsg:[Local: Message timed out]” 附1:进程监控工具process_monitor.sh process_monitor.sh为shell脚本,本身含详细的使用说明和帮助提示。适合放在crontab中,检测到进程不在时,3秒左右时间重拉起。支持不同用户运行相同程序,也支持同一用户带不同参数运行相同程序。 下载网址: https://github.com/eyjian/libmooon/blob/master/shell/process_monitor.sh 使用示例: * * * * * /usr/local/bin/process_monitor.sh "/usr/local/jdk/bin/java kafkaServer" "/data/kafka/bin/kafka-server-start.sh -daemon /data/kafka/config/server.properties" 由于所有的java程序均运行在JVM中,所以程序名均为java,“kafkaServer”用于限定只监控kafka。如果同一用户运行多个kafka实例,则需加端口号区分,并且要求端口号为命令行参数,和“kafkaServer”共同组成匹配模式。 当检测到进程不存在时,则执行第三列的重启指令“/data/kafka/bin/kafka-server-start.sh -daemon /data/kafka/config/server.properties”。 使用示例2,监控zooekeeper: * * * * * /usr/local/bin/process_monitor.sh "/usr/local/jdk/bin/java -Dzookeeper" "/data/zookeeper/bin/zkServer.sh start" 附2:批量操作工具 适用用来批量安装kafka和日常运维。 下载网址: https://github.com/eyjian/libmooon/releases 监控工具有两个版本:一是C++版本,另一是GO版本。当前C++版本比较成熟,GO版本相当简略,但C++版本依赖C++运行时库,不同环境需要特定编译,而GO版本可不依赖C和C++运行时库,所以不需编译即可应用到广泛的Linux环境。 使用简单,直接执行命令,即会提示用法。 附2.1:批量执行命令工具:mooon_ssh 参数名 默认值 说明 -u 无 用户名参数,可用环境变量U替代 -p 无 密码参数,可用环境变量P替代 -h 无 IP列表参数,可用环境变量H替代 -P 22,可修改源码,编译为常用端口号 SSH端口参数,可用环境变量PORT替代 -c 无 在远程机器上执行的命令,建议单引号方式指定值,除非要执行的命令本身已经包含了单引号有冲突。使用双引号时,要注意转义,否则会被本地shell解释 -v 1 工具输出的详细度 附2.2:批量上传文件工具:mooon_upload 参数名 默认值 说明 -u 无 用户名参数,可用环境变量U替代 -p 无 密码参数,可用环境变量P替代 -h 无 IP列表参数,可用环境变量H替代 -P 22,可修改源码,编译为常用端口号 SSH端口参数,可用环境变量PORT替代 -s 无 以逗号分隔的,需要上传的本地文件列表,可以带相对或绝对目录 -d 无 文件上传到远程机器的目录,只能为单个目录 附2.3:使用示例 1) 使用示例1:上传/etc/hosts mooon_upload -s=/etc/hosts -d=/etc 2) 使用示例2:检查/etc/profile文件是否一致 mooon_ssh -c='md5sum /etc/hosts' 3) 使用示例3:批量查看crontab mooon_ssh -c='crontab -l' 4) 使用示例4:批量清空crontab mooon_ssh -c='rm -f /tmp/crontab.empty;touch /tmp/crontab.empty' mooon_ssh -c='crontab /tmp/crontab.emtpy' 5) 使用示例5:批量更新crontab mooon_ssh -c='crontab /tmp/crontab.online' 6) 使用示例6:取远端机器IP 因为awk用单引号,所以参数“-c”的值不能使用单引号,所以内容需要转义,相对其它来说要复杂点: mooon_ssh -c="netstat -ie | awk -F[\\ :]+ 'BEGIN{ok=0;}{if (match(\$0, \"eth1\")) ok=1; if ((1==ok) && match(\$0,\"inet\")) { ok=0; if (7==NF) printf(\"%s\\n\",\$3); else printf(\"%s\\n\",\$4);} }'" 不同的环境,IP在“netstat -ie”输出中的位置稍有不同,所以awk中加了“7==NF”判断,但仍不一定适用于所有的环境。需要转义的字符包含:双引号、美元符和斜杠。 7) 使用示例7:批量查看kafka进程(环境变量方式) $ export H=192.168.31.9,192.168.31.10,192.168.31.11,192.168.31.12,192.168.31.13 $ export U=kafka $ export P='123456' $ mooon_ssh -c='/usr/local/jdk/bin/jps -m' [192.168.31.15] 50928 Kafka /data/kafka/config/server.properties 125735 Jps -m [192.168.31.15] SUCCESS [192.168.31.16] 147842 Jps -m 174902 Kafka /data/kafka/config/server.properties [192.168.31.16] SUCCESS [192.168.31.17] 51409 Kafka /data/kafka/config/server.properties 178771 Jps -m [192.168.31.17] SUCCESS [192.168.31.18] 73568 Jps -m 62314 Kafka /data/kafka/config/server.properties [192.168.31.18] SUCCESS [192.168.31.19] 123908 Jps -m 182845 Kafka /data/kafka/config/server.properties [192.168.31.19] SUCCESS ================================ [192.168.31.15 SUCCESS] 0 seconds [192.168.31.16 SUCCESS] 0 seconds [192.168.31.17 SUCCESS] 0 seconds [192.168.31.18 SUCCESS] 0 seconds [192.168.31.19 SUCCESS] 0 seconds SUCCESS: 5, FAILURE: 0 8) 使用示例8:批量停止kafka进程(参数方式) $ mooon_ssh -c='/data/kafka/bin/kafka-server-stop.sh' -u=kafka -p='123456' -h=192.168.31.15,192.168.31.16,192.168.31.17,192.168.31.18,192.168.31.19 [192.168.31.15] No kafka server to stop command return 1 [192.168.31.16] No kafka server to stop command return 1 [192.168.31.17] No kafka server to stop command return 1 [192.168.31.18] No kafka server to stop command return 1 [192.168.31.19] No kafka server to stop command return 1 ================================ [192.168.31.15 FAILURE] 0 seconds [192.168.31.16 FAILURE] 0 seconds [192.168.31.17 FAILURE] 0 seconds [192.168.31.18 FAILURE] 0 seconds [192.168.31.19 FAILURE] 0 seconds SUCCESS: 0, FAILURE: 5 附3:批量设置broker.id和listeners工具 为shell脚本,有详细的使用说明和帮助提示,依赖mooon_ssh和mooon_upload: https://github.com/eyjian/libmooon/blob/master/shell/set_kafka_id_and_ip.sh 附4:批量设置hostname工具 为shell脚本,有详细的使用说明和帮助提示,依赖mooon_ssh和mooon_upload: https://github.com/eyjian/libmooon/blob/master/shell/set_hostname.sh 附5:Kafka监控工具kafka-manager 官网:https://github.com/yahoo/kafka-manager kafka-manager的数据主要来源两个方便:一是kafka的zookeeper数据,二是kafka的JMX数据。 kafka-manager要求JDK版本不低于1.8,从源码编译kafka-manager相对复杂,但编译拿到二进制包后,只需修改application.conf中的“kafka-manager.zkhosts”值,即可开始启动kafka-manager。“kafka-manager.zkhosts”值,不是kafka的zookeeper配置值,而是kafka-manager自己用的zookeeper配置,所以两者可以为不同的zookeeper,注意值用双引号引起来。 crontab启动示例: JMX_PORT=9999 * * * * * /usr/local/bin/process_monitor.sh "/usr/local/jdk/bin/java kafkaServer" "/data/kafka/bin/kafka-server-start.sh -daemon /data/kafka/config/server.properties" 指定JMX_PORT不是必须的,但建议设置,这样kafka-manager可以更详细的查看brokers。 crontab中启动kafka-manager示例(指定服务端口为8080,不指定的默认值为9000): * * * * * /usr/local/bin/process_monitor.sh "/usr/local/jdk/bin/java kafka-manager" "/data/kafka/kafka-manager/bin/kafka-manager -Dconfig.file=/data/kafka/kafka-manager/conf/application.conf -Dhttp.port=8080 > /dev/null 2>&1" process_monitor.sh下载: https://github.com/eyjian/libmooon/blob/master/shell/process_monitor.sh 注意crontab的用户密码有效,crontab才能正常执行。 附6:kafka的安装 最基本的两个配置项为server.properties文件中的: 1) Broker.id 2) zookeeper.connect 其中broker.id每个节点要求不同,zookeeper.connect值建议指定目录,不要直接放在zookeeper根目录下。另外也建议设置listeners值,不然需要客户端配置hostname和IP的映射关系。 因broker.id和listeners的原因,每个节点的server.properties不一致,可利用工具set_kafka_id_and_ip.sh实现批量的替换,以简化kafka集群的部署。set_kafka_id_and_ip.sh下载地址:https://github.com/eyjian/libmooon/blob/master/shell/set_kafka_id_and_ip.sh。 crontab中启动kafka示例: JMX_PORT=9999 * * * * * /usr/local/bin/process_monitor.sh "/usr/local/jdk/bin/java kafkaServer" "/data/kafka/bin/kafka-server-start.sh -daemon /data/kafka/config/server.properties" 设置JMX_PORT是为方便kafka-manager管理kafka。 附7:__consumer_offsets __consumer_offsets是kafka内置的Topic,在0.9.0.0之后的Kafka,将topic的offset 信息由之前存储在zookeeper上改为存储到内置的__consumer_offsets中。 server.properties中的配置项num.partitions和default.replication.factor对__consumer_offsets无效,而是受offsets.topic.num.partitions和offsets.topic.replication.factor两个控制。
版本:redis-3.0.5 redis-3.2.0 redis-3.2.9 redis-4.0.11 参考:http://redis.io/topics/cluster-tutorial。 集群部署交互式命令行工具:https://github.com/eyjian/redis-tools/tree/master/deploy 集群运维命令行工具:https://github.com/eyjian/redis-tools/tree/master 批量操作工具:https://github.com/eyjian/libmooon/releases 目录 目录 1 1. 前言 2 2. 部署计划 2 3. 目录结构 2 4. 编译安装 3 5. 修改系统参数 3 5.1. 修改最大可打开文件数 3 5.2. TCP监听队列大小 4 5.3. OOM相关:vm.overcommit_memory 5 5.4. /sys/kernel/mm/transparent_hugepage/enabled 5 6. 配置redis 5 7. 启动redis实例 8 8. 创建和启动redis cluster前的准备工作 9 8.1. 安装ruby 9 8.2. 安装rubygems 9 8.3. 安装redis-3.0.0.gem 9 9. redis-trib.rb 10 10. 创建和启动redis集群 10 10.1. 复制redis-trib.rb 10 10.2. 创建redis cluster 11 10.3. ps aux|grep redis 12 11. redis cluster client 13 11.1. 命令行工具redis-cli 13 11.2. 从slaves读数据 13 11.3. jedis(java cluster client) 13 11.4. r3c(C++ cluster client) 14 12. 新增节点 14 12.1. 添加一个新主(master)节点 14 12.2. 添加一个新从(slave)节点 15 13. 删除节点 15 14. master机器硬件故障 17 15. 检查节点状态 17 16. 变更主从关系 17 17. slots相关命令 17 17.1. 迁移slosts 18 17.2. redis-trib.rb rebalance 18 18. 人工主备切换 18 19. 查看集群信息 19 20. 禁止指定命令 20 21. 各版本配置文件 20 22. 大压力下Redis参数调整要点 20 23. 问题排查 22 1. 前言 本文参考官方文档而成:http://redis.io/topics/cluster-tutorial。经测试,安装过程也适用于redis-3.2.0、redis-4.0.11等。 Redis运维工具和部署工具:https://github.com/eyjian/redis-tools。 2. 部署计划 依据官网介绍,部署6个redis节点,为3主3从。3台物理机每台都创建2个redis节点: 服务端口 IP地址 配置文件名 6379 192.168.0.251 redis-6379.conf 6379 192.168.0.252 redis-6379.conf 6379 192.168.0.253 redis-6379.conf 6380 192.168.0.251 redis-6380.conf 6380 192.168.0.252 redis-6380.conf 6380 192.168.0.253 redis-6380.conf 疑问:3台物理机,会不会主和从节点分布在同一个物理机上? 3. 目录结构 redis.conf为从https://raw.githubusercontent.com/antirez/redis/3.0/redis.conf下载的配置文件。redis-6379.conf和redis-6380.conf指定了服务端口,两者均通过include复用(包含)了redis.conf。 本文将redis安装在/data/redis(每台机器完全相同,同一台机器上的多个节点对应相同的目录和文件,并建议将bin目录加入到环境变量PATH中,以简化后续的使用): /data/redis |-- bin | |-- redis-benchmark | |-- redis-check-aof | |-- redis-check-dump | |-- redis-cli | |-- redis-sentinel -> redis-server | `-- redis-server |-- conf | |-- redis-6379.conf | |-- redis-6380.conf | `-- redis.conf `-- log 3 directories, 9 files 4. 编译安装 打开redis的Makefile文件,可以看到如下内容: PREFIX?=/usr/local INSTALL_BIN=$(PREFIX)/bin INSTALL=install Makefile中的“?=”表示,如果该变量之前没有定义过,则赋值为/usr/local,否则什么也不做。 如果不设置环境变量PREFIX或不修改Makefile中的值,则默认安装到/usr/local/bin目录下。建议不要使用默认配置,而是指定安装目录,如/data/redis-3.0.5: $ make $ make install PREFIX=/data/redis-3.0.5 $ ln -s /data/redis-3.0.5 /data/redis $ mkdir /data/redis/conf $ mkdir /data/redis/log $ mkdir /data/redis/data 5. 修改系统参数 5.1. 修改最大可打开文件数 修改文件/etc/security/limits.conf,加入以下两行: * soft nofile 102400 * hard nofile 102400 # End of file 其中102400为一个进程最大可以打开的文件个数,当与RedisServer的连接数多时,需要设定为合适的值。 有些环境修改后,root用户需要重启机器才生效,而普通用户重新登录后即生效。如果是crontab,则需要重启crontab,如:service crond restart,有些平台可能是service cron restart。 有些环境下列设置即可让root重新登录即生效,而不用重启机器: root soft nofile 102400 root hard nofile 102400 # End of file 但是要小心,有些环境上面这样做,可能导致无法ssh登录,所以在修改时最好打开两个窗口,万一登录不了还可自救。 如何确认更改对一个进程生效?按下列方法(其中$PID为被查的进程ID): $ cat /proc/$PID/limits 系统关于/etc/security/limits.conf文件的说明: #This file sets the resource limits for the users logged in via PAM. #It does not affect resource limits of the system services. PAM:全称“Pluggable Authentication Modules”,中文名“插入式认证模块”。/etc/security/limits.conf实际为pam_limits.so(位置:/lib/security/pam_limits.so)的配置文件,只针对单个会话。要使用limits.conf生效,必须保证pam_limits.so被加入到了启动文件中。 注释说明只对通过PAM登录的用户生效,与PAM相关的文件(均位于/etc/pam.d目录下): /etc/pam.d/login /etc/pam.d/sshd /etc/pam.d/crond 如果需要设置Linux用户的密码策略,可以修改文件/etc/login.defs,但这个只对新增的用户有效,如果要影响已有用户,可使用命令chage。 5.2. TCP监听队列大小 即TCP listen的backlog大小,“/proc/sys/net/core/somaxconn”的默认值一般较小如128,需要修改大一点,比如改成32767。立即生效还可以使用命令:sysctl -w net.core.somaxconn=32767。 要想永久生效,需要在文件/etc/sysctl.conf中增加一行:net.core.somaxconn = 32767,然后执行命令“sysctl -p”以生效。 Redis配置项tcp-backlog的值不能超过somaxconn的大小。 5.3. OOM相关:vm.overcommit_memory 如果“/proc/sys/vm/overcommit_memory”的值为0,则会表示开启了OOM。可以设置为1关闭OOM,设置方法请参照net.core.somaxconn完成。 5.4. /sys/kernel/mm/transparent_hugepage/enabled 默认值为“[always] madvise never”,建议设置为never,以开启内核的“Transparent Huge Pages (THP)”特性,设置后redis进程需要重启。为了永久生效,请将“echo never > /sys/kernel/mm/transparent_hugepage/enabled”加入到文件/etc/rc.local中。 什么是Transparent Huge Pages?为提升性能,通过大内存页来替代传统的4K页,使用得管理虚拟地址数变少,加快从虚拟地址到物理地址的映射,以及摒弃内存页面的换入换出以提高内存的整体性能。内核Kernel将程序缓存内存中,每页内存以2M为单位。相应的系统进程为khugepaged。 在Linux中,有两种方式使用Huge Pages,一种是2.6内核引入的HugeTLBFS,另一种是2.6.36内核引入的THP。HugeTLBFS主要用于数据库,THP广泛应用于应用程序。 一般可以在rc.local或/etc/default/grub中对Huge Pages进行设置。 6. 配置redis 从https://raw.githubusercontent.com/antirez/redis/3.0/redis.conf下载配置文件(也可直接复制源代码包中的redis.conf,然后在它的基础上进行修改),在这个基础上,进行如下表所示的修改(配置文件名redis-6379.conf中的6379建议设置为实际使用的端口号): 配置项 值 配置文件 说明 include redis.conf redis-6379.conf 引用公共的配置文件,建议为全路径值 port 6379 客户端连接端口,并且总有一个刚好大于10000的端口,这个大的端口用于主从复制和集群内部通讯。 cluster-config-file nodes-6379.conf 默认放在dir指定的目录 pidfile /var/run/redis-6379.pid 只有当daemonize值为yes时,才有意义;并且这个要求对目录/var/run有写权限,否则可以考虑设置为/tmp/redis-6379.pid。 dir /data/redis/data/6379 dbfilename dump-6379.rdb 位于dir指定的目录下 appendonly yes appendfilename "appendonly-6379.aof" logfile /data/redis/log/redis-6379.log 日志文件,包含目录和文件名,注意redis不会自动滚动日志文件 include redis.conf redis-6380.conf 引用公共的配置文件 port 6380 cluster-config-file nodes-6380.conf 默认放在dir指定的目录 pidfile /var/run/redis-6380.pid dir /data/redis/data/6380 AOF和RDB文件存放目录 dbfilename dump-6380.rdb RDB文件名 appendfilename appendonly-6380.aof AOF文件名 logfile /data/redis/log/redis-6380.log loglevel verbose redis.conf (公共配置文件) 日志级别,建议为notice,另外注意redis是不会滚动日志文件的,每次写日志都是先打开日志文件再写日志再关闭方式 maxclients 10000 最大连接数 timeout 0 客户端多长(秒)时间没发包过来关闭它,0表示永不关闭 cluster-enabled yes 表示以集群方式运行,为no表示以非集群方式运行 cluster-node-timeout 15000 单位为毫秒: repl-ping-slave-period+ (cluster-node-timeout* cluster-slave-validity-factor) 判断节点失效(fail)之前,允许不可用的最大时长(毫秒),如果master不可用时长超过此值,则会被failover。不能太小,建议默认值15000 cluster-slave-validity-factor 0 如果要最大的可用性,值设置为0。定义slave和master失联时长的倍数,如果值为0,则只要失联slave总是尝试failover,而不管与master失联多久。失联最大时长:(cluster-slave-validity-factor*cluster-node-timeout) repl-timeout 10 该配置项的值要求大于repl-ping-slave-period的值 repl-ping-slave-period 1 定义slave多久(秒)ping一次master,如果超过repl-timeout指定的时长都没有收到响应,则认为master挂了 slave-read-only yes slave是否只读 slave-serve-stale-data yes 当slave与master断开连接,slave是否继续提供服务 slave-priority 100 slave权重值,当master挂掉,只有权重最大的slave接替master aof-use-rdb-preamble 4.0新增配置项,用于控制是否启用RDB-AOF混用,值为no表示关闭 appendonly yes 当同时写AOF或RDB,则redis启动时只会加载AOF,AOF包含了全量数据。如果当队列使用,入队压力又很大,建议设置为no appendfsync no 可取值everysec,其中no表示由系统自动,当写压力很大时,建议设置为no,否则容易造成整个集群不可用 daemonize yes 相关配置项pidfile protected-mode no 3.2.0新增的配置项,默认值为yes,限制从其它机器登录Redis server,而只能从127.0.0.1登录。为保证redis-trib.rb工具的正常运行,需要设置为no,完成后可以改回yes,但每次使用redis-trib.rb都需要改回为no。要想从非127.0.0.1访问也需要改为no。 tcp-backlog 32767 取值不能超过系统的/proc/sys/net/core/somaxconn auto-aof-rewrite-percentage 100 设置自动rewite AOF文件(手工rewrite只需要调用命令BGREWRITEAOF) auto-aof-rewrite-min-size 64mb 触发rewrite的AOF文件大小,只有大于此大小时才会触发rewrite no-appendfsync-on-rewrite yes 子进程在做rewrite时,主进程不调用fsync(由内核默认调度) stop-writes-on-bgsave-error yes 如果因为磁盘故障等导致保存rdb失败,停止写操作,可设置为NO。 cluster-require-full-coverage no 为no表示有slots不可服务时其它slots仍然继续服务 maxmemory 26843545600 设置最大的内存,单位为字节 maxmemory-policy volatile-lru 设置达到最大内存时的淘汰策略 client-output-buffer-limit 设置master端的客户端缓存,三种:normal、slave和pubsub cluster-migration-barrier 1 最少slave数,用来保证集群中不会有裸奔的master。当某个master节点的slave节点挂掉裸奔后,会从其他富余的master节点分配一个slave节点过来,确保每个master节点都有至少一个slave节点,不至于因为master节点挂掉而没有相应slave节点替换为master节点导致集群崩溃不可用。 repl-backlog-size 1mb 当slave失联时的,环形复制缓区大小,值越大可容忍更长的slave失联时长 repl-backlog-ttl slave失联的时长达到该值时,释放backlog缓冲区 save save 900 1 save 300 10 save 60 10000 刷新快照(RDB)到磁盘的策略,根据实际调整值,“save 900 1”表示900秒后至少有1个key被修改才触发save操作,其它类推。 注意执行flushall命令也会产生RDB文件,不过是空文件。 如果不想生成RDB文件,可以将save全注释掉。 7. 启动redis实例 登录3台物理机,启动两个redis实例(启动之前,需要创建好配置中的各目录): 1) redis-server redis-6379.conf 2) redis-server redis-6380.conf 可以写一个启动脚本start-redis-cluster.sh: #!/bin/sh REDIS_HOME=/data/redis $REDIS_HOME/bin/redis-server $REDIS_HOME/conf/redis-6379.conf $REDIS_HOME/bin/redis-server $REDIS_HOME/conf/redis-6380.conf 8. 创建和启动redis cluster前的准备工作 上一步启动的redis只是单机版本,在启动redis cluster之前,需要完成如下一些依赖的安装。在此之后,才可以创建和启动redis cluster。 8.1. 安装ruby 安装命令:yum install ruby 安装过程中,如提示“[y/d/N]”,请选“y”然后回车。 查看版本: $ ruby --version ruby 2.0.0p353 (2013-11-22) [x86_64-linux] 也可以从Ruby官网https://www.ruby-lang.org下载安装包(如ruby-2.3.1.tar.gz)来安装Ruby。截至2016/5/13,Ruby的最新稳定版本为Ruby 2.3.1。 8.2. 安装rubygems 安装命令:yum install rubygems 如果不使用yum安装,也可以手动安装RubyGems,RubyGems是一个Ruby包管理框架,它的下载网址:https://rubygems.org/pages/download。 比如下载安装包rubygems-2.6.4.zip后解压,然后进入解压生成的目录,里面有个setup.rb文件,以root用户执行:ruby setup.rb安装RubyGems。 8.3. 安装redis-3.0.0.gem 安装命令:gem install -l redis-3.0.0.gem 安装之前,需要先下载好redis-3.0.0.gem。 redis-3.0.0.gem官网:https://rubygems.org/gems/redis/versions/3.0.0 redis-3.0.0.gem下载网址:https://rubygems.org/downloads/redis-3.0.0.gem redis-3.3.0.gem官网:https://rubygems.org/gems/redis/versions/3.3.0 redis-3.3.3.gem官网:https://rubygems.org/gems/redis/versions/3.3.3 redis-4.0.1.gem官网:https://rubygems.org/gems/redis/versions/4.0.1 集群的创建只需要在一个节点上操作,所以只需要在一个节点上安装redis-X.X.X.gem即可。 # gem install -l redis-3.3.3.gem Successfully installed redis-3.3.3 Parsing documentation for redis-3.3.3 Installing ri documentation for redis-3.3.3 Done installing documentation for redis after 1 seconds 1 gem installed 9. redis-trib.rb redis-trib.rb是redis官方提供的redis cluster管理工具,使用ruby实现。 10. 创建和启动redis集群 10.1. 复制redis-trib.rb 将redis源代码的src目录下的集群管理程序redis-trib.rb复制到/data/redis/bin目录,并将bin目录加入到环境变量PATH中,以简化后续的操作。 redis-trib.rb用法(不带任何参数执行redis-trib.rb即显示用法): $ ./redis-trib.rb Usage: redis-trib rebalance host:port --auto-weights --timeout --pipeline --use-empty-masters --weight --threshold --simulate add-node new_host:new_port existing_host:existing_port --slave --master-id reshard host:port --timeout --pipeline --yes --slots --to --from check host:port set-timeout host:port milliseconds call host:port command arg arg .. arg fix host:port --timeout info host:port create host1:port1 ... hostN:portN --replicas import host:port --replace --copy --from help (show this help) del-node host:port node_id For check, fix, reshard, del-node, set-timeout you can specify the host and port of any working node in the cluster. 10.2. 创建redis cluster 创建命令(3主3从): redis-trib.rb create --replicas 1 192.168.0.251:6379 192.168.0.252:6379 192.168.0.253:6379 192.168.0.251:6380 192.168.0.252:6380 192.168.0.253:6380 ? 参数说明: 1) create 表示创建一个redis cluster集群。 2) --replicas 1 表示为集群中的每一个主节点指定一个从节点,即一比一的复制。\ 运行过程中,会有个提示,输入yes回车即可。从屏幕输出,可以很容易地看出哪些是主(master)节点,哪些是从(slave)节点: >>> Creating cluster Connecting to node 192.168.0.251:6379: OK /usr/local/share/gems/gems/redis-3.0.0/lib/redis.rb:182: warning: wrong element type nil at 0 (expected array) /usr/local/share/gems/gems/redis-3.0.0/lib/redis.rb:182: warning: ignoring wrong elements is deprecated, remove them explicitly /usr/local/share/gems/gems/redis-3.0.0/lib/redis.rb:182: warning: this causes ArgumentError in the next release >>> Performing hash slots allocation on 6 nodes... Using 3 masters: 192.168.0.251:6379 192.168.0.252:6379 192.168.0.253:6379 Adding replica 192.168.0.252:6380 to 192.168.0.251:6379 Adding replica 192.168.0.251:6380 to 192.168.0.252:6379 Adding replica 192.168.0.253:6380 to 192.168.0.253:6379 M: 150f77d1000003811fb3c38c3768526a0b25ec31 192.168.0.251:6379 slots:0-5460 (5461 slots) master M: de461d3337b17d2119b79024d57d8b119e7320a6 192.168.0.252:6379 slots:5461-10922 (5462 slots) master M: faf50658fb7b0bae64cee5371da782e0f4919eee 192.168.0.253:6379 slots:10923-16383 (5461 slots) master S: c567db02cc40eebf577f71f703214dd2f4f26dfb 192.168.0.251:6380 replicates de461d3337b17d2119b79024d57d8b119e7320a6 S: 284f8196b250ad9ac272316db84a07bebf661ab7 192.168.0.252:6380 replicates 150f77d1000003811fb3c38c3768526a0b25ec31 S: 39fdef9fd5778dc94d8add819789d7d73ca06899 192.168.0.253:6380 replicates faf50658fb7b0bae64cee5371da782e0f4919eee Can I set the above configuration? (type 'yes' to accept): yes >>> Nodes configuration updated >>> Assign a different config epoch to each node >>> Sending CLUSTER MEET messages to join the cluster Waiting for the cluster to join.... >>> Performing Cluster Check (using node 192.168.0.251:6379) M: 150f77d1000003811fb3c38c3768526a0b25ec31 192.168.0.251:6379 slots:0-5460 (5461 slots) master M: de461d3337b17d2119b79024d57d8b119e7320a6 192.168.0.252:6379 slots:5461-10922 (5462 slots) master M: faf50658fb7b0bae64cee5371da782e0f4919eee 192.168.0.253:6379 slots:10923-16383 (5461 slots) master M: c567db02cc40eebf577f71f703214dd2f4f26dfb 192.168.0.251:6380 slots: (0 slots) master replicates de461d3337b17d2119b79024d57d8b119e7320a6 M: 284f8196b250ad9ac272316db84a07bebf661ab7 192.168.0.252:6380 slots: (0 slots) master replicates 150f77d1000003811fb3c38c3768526a0b25ec31 M: 39fdef9fd5778dc94d8add819789d7d73ca06899 192.168.0.253:6380 slots: (0 slots) master replicates faf50658fb7b0bae64cee5371da782e0f4919eee [OK] All nodes agree about slots configuration. >>> Check for open slots... >>> Check slots coverage... [OK] All 16384 slots covered. 10.3. ps aux|grep redis [test@test-168-251 ~]$ ps aux|grep redis test 3824 0.7 5.9 6742404 3885144 ? Ssl 2015 1639:13 /data/redis/bin/redis-server *:6379 [cluster] test 3831 0.5 3.9 6709636 2618536 ? Ssl 2015 1235:43 /data/redis/bin/redis-server *:6380 [cluster] 停止redis实例,直接使用kill命令即可,如:kill 3831,重启和单机版相同,经过上述一系列操作后,重启会自动转换成cluster模式。。 11. redis cluster client 11.1. 命令行工具redis-cli 官方提供的命令行客户端工具,在单机版redis基础上指定参数“-c”即可。以下是在192.168.0.251上执行redis-cli的记录: $ ./redis-cli -c -p 6379 127.0.0.1:6379> set foo bar -> Redirected to slot [12182] located at 192.168.0.253:6379 OK 192.168.0.253:6379> set hello world -> Redirected to slot [866] located at 192.168.0.251:6379 OK 192.168.0.251:6379> get foo -> Redirected to slot [12182] located at 192.168.0.253:6379 "bar" 192.168.0.253:6379> get hello -> Redirected to slot [866] located at 192.168.0.251:6379 "world" 查看集群中的节点: 192.168.0.251:6379> cluster nodes 11.2. 从slaves读数据 默认不能从slaves读取数据,但建立连接后,执行一次命令READONLY ,即可从slaves读取数据。如果想再次恢复不能从slaves读取数据,可以执行下命令READWRITE。 11.3. jedis(java cluster client) 官网:https://github.com/xetorthio/jedis 编程示例: SetHostAndPort> jedisClusterNodes = new HashSet(); //Jedis Cluster will attempt to discover cluster nodes automatically jedisClusterNodes.add(new HostAndPort("127.0.0.1", 7379)); JedisCluster jc = new JedisCluster(jedisClusterNodes); jc.set("foo", "bar"); String value = jc.get("foo"); 11.4. r3c(C++ cluster client) 官网:https://github.com/eyjian/r3c 12. 新增节点 12.1. 添加一个新主(master)节点 先以单机版配置和启动好redis-server,然后执行命令: ./redis-trib.rb add-node 127.0.0.1:7006 127.0.0.1:7000 执行上面这条命令时,可能遇到错误“[ERR] Sorry, can't connect to node 127.0.0.1:7006”。引起该问题的原因可能是因为ruby的版本过低(运行ruby -v可以查看ruby的版本),可以尝试升级ruby再尝试,比如ruby 1.8.7版本就需要升级。对于Redis 3.0.5和Redis 3.2.0,使用Ruby 2.3.1操作正常。请注意升级到最新版本的ruby也可能遇到这个错误。 另一个会引起这个问题的原因是从Redis 3.2.0版本开始引入了“保护模式(protected mode),防止redis-cli远程访问”,仅限redis-cli绑定到127.0.0.1才可以连接Redis server。 为了完成添加新主节点,可以暂时性的关闭保护模式,使用redis-cli,不指定-h参数(但可以指定-p参数,或者-h参数值为127.0.0.1)进入操作界面:CONFIG SET protected-mode no。 注意7006是新增的节点,而7000是已存在的节点(可为master或slave)。如果需要将7006变成某master的slave节点,执行命令: cluster replicate 3c3a0c74aae0b56170ccb03a76b60cfe7dc1912e 新加入的master节点上没有任何数据(slots,运行redis命令cluster nodes可以看到这个情况)。当一个slave想成为master时,由于这个新的master节点不管理任何slots,它不参与选举。 可以使用工具redis-trib.rb的resharding特性为这个新master节点分配slots,如: redis-trib.rb reshard 127.0.0.1:7000,其中7000为集群中任意一个节点即可,redis-trib.rb将自动发现其它节点。 在reshard过程中,将会询问reshard多少slots: How many slots do you want to move (from 1 to 16384)?,取值范围为1~16384,其中16384为redis cluster的拥有的slots总数,比如想只移动100个,输入100即可。如果迁移的slots数量多,应当设置redis-trib.rb的超时参数--timeout值大一点。否则,迁移过程中易遇到超时错误“[ERR] Calling MIGRATE: IOERR error or timeout reading to target instance”,导致只完成部分,可能会造成数据丢失。 接着,会提示“What is the receiving node ID?”,输入新加入的master节点ID。过程中如果遇到错误“Sorry, can't connect to node 10.225.168.253:6380”,则可能需要暂时先关闭相应的保护模式。 如果在迁移过程遇到下面这样的错误: >>> Check for open slots... [WARNING] Node 192.168.0.3:6379 has slots in importing state (5461). [WARNING] Node 192.168.0.5:6380 has slots in migrating state (5461). [WARNING] The following slots are open: 5461 可以考虑使用命令“redis-trib.rb fix 192.168.0.3:6379”尝试修复。需要显示有节点处于migrating或importing状态,可以登录到相应的节点,使用命令“cluster setslot 5461 stable”修改,参数5461为问题显示的slot的ID。 12.2. 添加一个新从(slave)节点 ./redis-trib.rb add-node --slave 127.0.0.1:7006 127.0.0.1:7000 注意这种方式,如果添加了多个slave节点,可能导致master的slaves不均衡,比如一些有3个slave,其它只1个slave。可以在slave节点上执行redis命令“CLUSTER REPLICATE”进行调整,让它成为其它master的slave。“CLUSTER REPLICATE”带一个参数,即master ID,注意使用redis-cli -c登录到slave上执行。 上面方法没有指定7006的master,而是随机指定。下面方法可以明确指定为哪个master的slave: ./redis-trib.rb add-node --slave --master-id 3c3a0c74aae0b56170ccb03a76b60cfe7dc1912e 127.0.0.1:7006 127.0.0.1:7000 其中“127.0.0.1:7006”是新节点,“127.0.0.1:7000”是集群中已有的节点。 13. 删除节点 从集群中删除一个节点: ./redis-trib.rb del-node 127.0.0.1:7000 第一个参数为集群中任意一个节点,第二个参数为需要删除节点的ID。 成功删除后,提示: $./redis-trib.rb del-node 127.0.0.1:6380 f49a2bda05e81aa343adb9924775ba95a1f4236e >>> Removing node f49a2bda05e81aa343adb9924775ba95a1f4236e from cluster 127.0.0.1:6379 /usr/local/share/gems/gems/redis-3.0.0/lib/redis.rb:182: warning: wrong element type nil at 0 (expected array) 。。。 。。。 /usr/local/share/gems/gems/redis-3.0.0/lib/redis.rb:182: warning: this causes ArgumentError in the next release >>> Sending CLUSTER FORGET messages to the cluster... >>> SHUTDOWN the node. 在这里会停顿几分钟,通知并等待被删除节点退出(exit),被删除节点在将数据写到RDB文件中后退出,所以停顿时长和写RDB文件时长有关,数据量越大时间就越长。5~6G的RAID1的SATA盘数据大概需要45秒左右。 被删除节点日志: 15577:S 06 Sep 20:06:37.774 - Accepted 10.49.126.98:14669 15577:S 06 Sep 20:06:38.741 # User requested shutdown... 15577:S 06 Sep 20:06:38.741 * Calling fsync() on the AOF file. 15577:S 06 Sep 20:06:38.742 * Saving the final RDB snapshot before exiting. 15577:S 06 Sep 20:07:19.683 * DB saved on disk 15577:S 06 Sep 20:07:19.683 * Removing the pid file. 15577:S 06 Sep 20:07:19.683 # Redis is now ready to exit, bye bye... 成功后不用再调用“CLUSTER FORGET”,否则报错: $ redis-cli -c CLUSTER FORGET aa6754a093ea4047f92cc0ea77f1859553bc5c57 (error) ERR Unknown node aa6754a093ea4047f92cc0ea77f1859553bc5c57 如果待删除节点已经不能连接,则调用CLUSTER FORGET剔除(可能需要在所有机器上执行一次FORGET): CLUSTER FORGET 注意如果是删除一个master节点,则需要先将它管理的slots的迁走,然后才可以删除它。 如果是master或slave机器不能连接,比如硬件故障导致无法启动,这个时候做不了del-node,只需要直接做CLUSTER 即可,在FORGET后,节点状态变成handshake。 !!!请注意,需要在所有node上执行一次“CLUSTER FORGET”,否则可能遇到被剔除node的总是处于handshake状态。 如果有部分node没有执行到FORGET,导致有部分node还处于fail状态,则在一些node将看到待剔除节点仍然处于handshake状态,并且nodeid在不断变化,所以需要在所有node上执行“CLUSTER FORGET”。 如果一个节点处于“:0 master,fail,noaddr”状态,执行“del-node”会报错: [ERR] No such node ID 80560d0d97a0b3fa975203350516437b58251745 这种情况下,只需要执行“CLUSTER FORGET”将其剔除即可(注意,需要在所有节点上执行一次,不然未执行的节点上可能仍然看得到“:0 master,fail,noaddr”): # redis-cli -c -p 1383 cluster nodes 80560d0d97a0b3fa975203350516437b58251745 :0 master,fail,noaddr - 1528947205054 1528947203553 0 disconnected fa7bbbf7d48389409ce05d303272078c3a6fd44f 127.0.0.1:1379 slave 689f7c1ae71ea294c4ad7c5d1b32ae4e78e27915 0 1535871825187 138 connected c1a9d1d23438241803ec97fbd765737df80f402a 127.0.0.1:1381 slave f03b1008988acbb0f69d96252decda9adf747be9 0 1535871826189 143 connected 50003ccd5885771196e717e27011140e7d6c94e0 127.0.0.1:1385 slave f03b1008988acbb0f69d96252decda9adf747be9 0 1535871825688 143 connected f6080015129eada3261925cc1b466f1824263358 127.0.0.1:1380 slave 4e932f2a3d80de29798660c5ea62e473e63a6630 0 1535871825388 145 connected 4e932f2a3d80de29798660c5ea62e473e63a6630 127.0.0.1:1383 myself,master - 0 0 145 connected 5458-10922 689f7c1ae71ea294c4ad7c5d1b32ae4e78e27915 127.0.0.1:1382 master - 0 1535871826490 138 connected 0-1986 1988-5457 f03b1008988acbb0f69d96252decda9adf747be9 127.0.0.1:1384 master - 0 1535871825187 143 connected 1987 10923-16383 14. master机器硬件故障 这种情况下,master机器可能无法启动,导致其上的master无法连接,master将一直处于“master,fail”状态,如果是slave则处于“slave,fail”状态。 如果是master,则会它的slave变成了master,因此只需要添加一个新的从节点作为原slave(已变成master)的slave节点。完成后,通过CLUSTER FORGET将故障的master或slave从集群中剔除即可。 !!!请注意,需要在所有node上执行一次“CLUSTER FORGET”,否则可能遇到被剔除node的总是处于handshake状态。 15. 检查节点状态 redis-trib.rb check 127.0.0.1:6380 如发现如下这样的错误: [WARNING] Node 192.168.0.11:6380 has slots in migrating state (5461). [WARNING] The following slots are open: 5461 可以使用redis命令取消slots迁移(5461为slot的ID): cluster setslot 5461 stable 需要注意,须登录到192.168.0.11:6380上执行redis的setslot子命令。 16. 变更主从关系 使用命令cluster replicate,参数为master节点ID,注意不是IP和端口,在被迁移的slave上执行该命令。 17. slots相关命令 CLUSTER ADDSLOTS slot1 [slot2] ... [slotN] CLUSTER DELSLOTS slot1 [slot2] ... [slotN] CLUSTER SETSLOT slot NODE node CLUSTER SETSLOT slot MIGRATING node CLUSTER SETSLOT slot IMPORTING node 17.1. 迁移slosts 官方参考:https://redis.io/commands/cluster-setslot。 示例:将值为8的slot从源节点A迁移到目标节点B,有如下两种方法: 在目标节点B上执行:CLUSTER SETSLOT 8 IMPORTING src-A-node-id 或 在源节点A上执行:CLUSTER SETSLOT 8 MIGRATING dst-B-node-id 上述操作只是将slot标记为迁移状态,完成迁移还需要执行(在目标node上执行): CLUSTER SETSLOT NODE 其中node-id为目标的Node ID,取消迁移使用“CLUSTER SETSLOT STABLE”。 操作示例: # 将值为11677的slot迁到192.168.31.3:6379 $ redis-cli -c -h 192.168.31.3 -p 6379 CLUSTER SETSLOT 11677 IMPORTING 216e0069af11eca91465394b2ad7bf1c27f5f7fe OK $ redis-cli -c -h 192.168.31.3 -p 6379 CLUSTER SETSLOT 11677 NODE 4e149c72aff2b6651370ead476dd70c8cf9e3e3c OK 17.2. redis-trib.rb rebalance 当有增减节点时,可以使用命令: redis-trib.rb rebalance 192.168.0.31:6379 --auto-weights 做一次均衡,简单点可以只指定两个参数:“192.168.0.31:6379”为集群中已知的任何一个节点,参数“-auto-weights”表示自动权重。 18. 人工主备切换 在需要的slaves节点上执行命令:CLUSTER FAILOVER。如果人工发起failover,则其它master会收到“Failover auth granted to 4291f18b5e9729e832ed15ceb6324ce5dfc2ffbe for epoch 31”,每次epoch值增一。 23038:M 06 Sep 20:31:24.815 # Failover auth granted to 4291f18b5e9729e832ed15ceb6324ce5dfc2ffbe for epoch 31 当出现下面两条日志时,表示failover完成: 23038:M 06 Sep 20:32:44.019 * FAIL message received from ea28f68438e5bb79c26a9cb2135241f11d7a50ba about 5e6ffacb2c5d5761e39aba5270fbf48f296cb5ee 23038:M 06 Sep 20:32:58.487 * Clear FAIL state for node 5e6ffacb2c5d5761e39aba5270fbf48f296cb5ee: slave is reachable again. 原master收到failover后的日志: 35475:M 06 Sep 20:35:43.396 - DB 0: 16870482 keys (7931571 volatile) in 50331648 slots HT. 35475:M 06 Sep 20:35:43.396 - 1954 clients connected (1 slaves), 5756515544 bytes in use 35475:M 06 Sep 20:35:48.083 # Manual failover requested by slave 58a40dbe01e1563773724803854406df04c62724. 35475:M 06 Sep 20:35:48.261 # Failover auth granted to 58a40dbe01e1563773724803854406df04c62724 for epoch 32 35475:M 06 Sep 20:35:48.261 - Client closed connection 10.51.147.216:7388为failover前的slave, 10.51.147.216:7388的ID为58a40dbe01e1563773724803854406df04c62724 35475:M 06 Sep 20:35:48.261 # Connection with slave 10.51.147.216:7388 lost. 35475:M 06 Sep 20:35:48.278 # Configuration change detected. Reconfiguring myself as a replica of 58a40dbe01e1563773724803854406df04c62724 35475:S 06 Sep 20:35:48.280 - Client closed connection 35475:S 06 Sep 20:35:48.408 - DB 0: 16870296 keys (7931385 volatile) in 50331648 slots HT. 35475:S 06 Sep 20:35:48.408 - 1953 clients connected (0 slaves), 5722753736 bytes in use 35475:S 06 Sep 20:35:48.408 * Connecting to MASTER 10.51.147.216:7388 35475:S 06 Sep 20:35:48.408 * MASTER SLAVE sync started 35475:S 06 Sep 20:35:48.408 * Non blocking connect for SYNC fired the event. 35475:S 06 Sep 20:35:48.408 * Master replied to PING, replication can continue... 35475:S 06 Sep 20:35:48.408 * Partial resynchronization not possible (no cached master) 35475:S 06 Sep 20:35:48.459 * Full resync from master: 36beb63d32b3809039518bf4f3e4e10de227f3ee:16454238619 35475:S 06 Sep 20:35:48.493 - Client closed connection 35475:S 06 Sep 20:35:48.880 - Client closed connection 19. 查看集群信息 对应的redis命令为:cluster info,示例: 127.0.0.1:6381> cluster info cluster_state:ok 所有slots正常则显示为OK,否则为error cluster_slots_assigned:16384 多少slots被分配了,即多少被master管理了,16384为全部slots cluster_slots_ok:16384 有多少slots是正常的 cluster_slots_pfail:0 有多少slots可能处于异常状态,处于这个状态并不表示有问题,仍能继续提供服务 cluster_slots_fail:0 有多少slots处于异常状态,需要修复才能服务 cluster_known_nodes:10 集群中的节点数 cluster_size:3 集群中master个数 cluster_current_epoch:11 本地的当前时间变量,用于故障切换时生成独一无二的增量版本号 cluster_my_epoch:0 cluster_stats_messages_sent:4049 通过集群消息总线发送的消息总数 cluster_stats_messages_received:4051 通过过集通过群消息总线收到的消息总数 20. 禁止指定命令 KEYS命令很耗时,FLUSHDB和FLUSHALL命令可能导致误删除数据,所以线上环境最好禁止使用,可以在Redis配置文件增加如下配置: rename-command KEYS "" rename-command FLUSHDB "" rename-command FLUSHALL "" 21. 各版本配置文件 https://raw.githubusercontent.com/antirez/redis/3.0/redis.conf https://raw.githubusercontent.com/antirez/redis/3.2.9/redis.conf https://raw.githubusercontent.com/antirez/redis/4.0/redis.conf https://raw.githubusercontent.com/antirez/redis/4.0.1/redis.conf https://raw.githubusercontent.com/antirez/redis/4.0.3/redis.conf https://raw.githubusercontent.com/antirez/redis/4.0.5/redis.conf https://raw.githubusercontent.com/antirez/redis/4.0.9/redis.conf https://raw.githubusercontent.com/antirez/redis/4.0.11/redis.conf 22. 大压力下Redis参数调整要点 参数 建议最小值 说明 repl-ping-slave-period 10 每10秒ping一次 repl-timeout 60 60秒超时,也就是ping十次 cluster-node-timeout 15000 repl-backlog-size 1GB Master对slave的队列大小 appendfsync no 让系统自动刷 save 大压力下,调大参数值,以减少写RDB带来的压力: "900 20 300 200 60 200000" appendonly 对于队列,建议单独建立集群,并且设置该值为no 为何大压力下要这样调整? 最重要的原因之一Redis的主从复制,两者复制共享同一线程,虽然是异步复制的,但因为是单线程,所以也十分有限。如果主从间的网络延迟不是在0.05左右,比如达到0.6,甚至1.2等,那么情况是非常糟糕的,因此同一Redis集群一定要部署在同一机房内。 这些参数的具体值,要视具体的压力而定,而且和消息的大小相关,比如一条200~500KB的流水数据可能比较大,主从复制的压力也会相应增大,而10字节左右的消息,则压力要小一些。大压力环境中开启appendfsync是十分不可取的,容易导致整个集群不可用,在不可用之前的典型表现是QPS毛刺明显。 这么做的目的是让Redis集群尽可能的避免master正常时触发主从切换,特别是容纳的数据量很大时,和大压力结合在一起,集群会雪崩。 当Redis日志中,出现大量如下信息,即可能意味着相关的参数需要调整了: 22135:M 06 Sep 14:17:05.388 * FAIL message received from 1d07e208db56cfd7395950ca66e03589278b8e12 about e438a338e9d9834a6745c12931950da87e360ca2 22135:M 06 Sep 14:17:07.551 * FAIL message received from ae8f6e7e0ab16b04414c8f3d08b58c0aa268b467 about d6eb06e9d118c120d3961a659972a1d0191a8652 22135:M 06 Sep 14:17:08.438 # Failover auth granted to f7d6b2c72fa3b801e7dcfe0219e73383d143dd0f for epoch 285 (We can vote for this slave) 有投票资格的node: 1)为master 2)至少有一个slot 3)投票node的epoch不能小于node自己当前的epoch(reqEpoch 4)node没有投票过该epoch(already voted for epoch) 5)投票node不能为master(it is a master node) 6)投票node必须有一个master(I don't know its master) 7)投票node的master处于fail状态(its master is up) 22135:M 06 Sep 14:17:19.844 # Failover auth denied to 534b93af6ba45a7033dbf38c8f47cd688514125a: already voted for epoch 285 如果一个node又联系上了,则它当是一个slave,或者无slots的master时,直接清除FAIL标志;但如果是一个master,则当“(now - node->fail_time) > (server.cluster_node_timeout * CLUSTER_FAIL_UNDO_TIME_MULT)”时,也清除FAIL标志,定义在cluster.h中(cluster.h:#define CLUSTER_FAIL_UNDO_TIME_MULT 2 /* Undo fail if master is back. */) 22135:M 06 Sep 14:17:29.243 * Clear FAIL state for node d6eb06e9d118c120d3961a659972a1d0191a8652: master without slots is reachable again. 如果消息类型为fail。 22135:M 06 Sep 14:17:31.995 * FAIL message received from f7d6b2c72fa3b801e7dcfe0219e73383d143dd0f about 1ba437fa1683a8caafd38ff977e5fbabdaf84fd6 22135:M 06 Sep 14:17:32.496 * FAIL message received from 1d07e208db56cfd7395950ca66e03589278b8e12 about d7942cfe636b25219c6d56aa72828fcfde2ee261 22135:M 06 Sep 14:17:32.968 # Failover auth granted to 938d9ae2de278938beda1d39185608b02d3b31ec for epoch 286 22135:M 06 Sep 14:17:33.177 # Failover auth granted to d9dadf3342006e2c92def3071ca0a76390be62b0 for epoch 287 22135:M 06 Sep 14:17:36.336 * Clear FAIL state for node 1ba437fa1683a8caafd38ff977e5fbabdaf84fd6: master without slots is reachable again. 22135:M 06 Sep 14:17:36.855 * Clear FAIL state for node d7942cfe636b25219c6d56aa72828fcfde2ee261: master without slots is reachable again. 22135:M 06 Sep 14:17:38.419 * Clear FAIL state for node e438a338e9d9834a6745c12931950da87e360ca2: is reachable again and nobody is serving its slots after some time. 22135:M 06 Sep 14:17:54.954 * FAIL message received from ae8f6e7e0ab16b04414c8f3d08b58c0aa268b467 about 7990d146cece7dc83eaf08b3e12cbebb2223f5f8 22135:M 06 Sep 14:17:56.697 * FAIL message received from 1d07e208db56cfd7395950ca66e03589278b8e12 about fbe774cdbd2acd24f9f5ea90d61c607bdf800eb5 22135:M 06 Sep 14:17:57.705 # Failover auth granted to e1c202d89ffe1c61b682e28071627635974c84a7 for epoch 288 22135:M 06 Sep 14:17:57.890 * Clear FAIL state for node 7990d146cece7dc83eaf08b3e12cbebb2223f5f8: slave is reachable again. 22135:M 06 Sep 14:17:57.892 * Clear FAIL state for node fbe774cdbd2acd24f9f5ea90d61c607bdf800eb5: master without slots is reachable again. 23. 问题排查 1) 如果最后一条日志为“16367:M 08 Jun 14:48:15.560 # Server started, Redis version 3.2.0”,节点状态始终终于fail状态,则可能是aof文件损坏了,这时可以使用工具edis-check-aof --fix进行修改,如: ../../bin/redis-check-aof --fix appendonly-6380.aof 0x a1492b9b: Expected prefix ' AOF analyzed: size=2705928192, ok_up_to=2705927067, diff=1125 This will shrink the AOF from 2705928192 bytes, with 1125 bytes, to 2705927067 bytes Continue? [y/N]: y 2) in `call': ERR Slot 16011 is already busy (Redis::CommandError) 将所有节点上的配置项cluster-config-file指定的文件删除,然后重新启;或者在所有节点上执行下FLUSHALL命令。 另外,如果使用主机名而不是IP,也可能遇到这个错误,如:“redis-trib.rb create --replicas 1 redis1:6379 redis2:6379 redis3:6379 redis4:6379 redis5:6379 redis6:6379”,可能也会得到错误“ERR Slot 16011 is already busy (Redis::CommandError)”。 3) for lack of backlog (Slave request was: 51875158284) 默认值: # redis-cli config get repl-timeout A) "repl-timeout" B) "10" # redis-cli config get client-output-buffer-limit A) "client-output-buffer-limit" B) "normal 0 0 0 slave 268435456 67108864 60 pubsub 33554432 8388608 60" 增大: redis-cli config set "client-output-buffer-limit" "normal 0 0 0 slave 2684354560 671088640 60 pubsub 33554432 8388608 60" 4) 复制中断场景 A) master的slave缓冲区达到限制的硬或软限制大小,与参数client-output-buffer-limit相关; B) 复制时间超过repl-timeout指定的值,与参数repl-timeout相关。 slave反复循环从master复制,如果调整以上参数仍然解决不了,可以尝试删除slave上的aof和rdb文件,然后再重启进程复制,这个时候可能能正常完成复制。 5) 日志文件出现:Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis. 考虑优化以下配置项: no-appendfsync-on-rewrite值设为yes repl-backlog-size和client-output-buffer-limit调大一点 6) 日志文件出现:MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error. 考虑设置stop-writes-on-bgsave-error值为“no”。 7) Failover auth granted to 当日志大量反反复复出现下列内容时,很可能表示master和slave间同步和通讯不顺畅,导致无效的failover和状态变更,这个时候需要调大相关参数值,容忍更长的延迟,因此也特别注意集群内所有节点间的网络延迟要尽可能的小,最好达到0.02ms左右的水平,调大参数的代价是主备切换变迟钝。 Slave日志: 31019:S 06 Sep 11:07:24.169 * Connecting to MASTER 10.5.14.8:6379 31019:S 06 Sep 11:07:24.169 * MASTER SLAVE sync started 31019:S 06 Sep 11:07:24.169 # Start of election delayed for 854 milliseconds (rank #0, offset 5127277817). 31019:S 06 Sep 11:07:24.169 * Non blocking connect for SYNC fired the event. 31019:S 06 Sep 11:07:25.069 # Starting a failover election for epoch 266. 31019:S 06 Sep 11:07:29.190 * Clear FAIL state for node ae8f6e7e0ab16b04414c8f3d08b58c0aa268b467: is reachable again and nobody is serving its slots after some time. 31019:S 06 Sep 11:07:29.191 * Master replied to PING, replication can continue... 31019:S 06 Sep 11:07:29.191 * Clear FAIL state for node f7d6b2c72fa3b801e7dcfe0219e73383d143dd0f: is reachable again and nobody is serving its slots after some time. 31019:S 06 Sep 11:07:29.192 * Trying a partial resynchronization (request ea2261c827fbc54135a95f707046581a55dff133:5127277818). 31019:S 06 Sep 11:07:29.192 * Successful partial resynchronization with master. 31019:S 06 Sep 11:07:29.192 * MASTER SLAVE sync: Master accepted a Partial Resynchronization. 31019:S 06 Sep 11:07:29.811 * Clear FAIL state for node e438a338e9d9834a6745c12931950da87e360ca2: is reachable again and nobody is serving its slots after some time. 31019:S 06 Sep 11:07:37.680 * FAIL message received from 5b41f7860cc800e65932e92d1d97c6c188138e56 about 3114cec541c5bcd36d712cd6c9f4c5055510e386 31019:S 06 Sep 11:07:43.710 * Clear FAIL state for node 3114cec541c5bcd36d712cd6c9f4c5055510e386: slave is reachable again. 31019:S 06 Sep 11:07:48.119 * FAIL message received from 7d61af127c17d9c19dbf9af0ac8f7307f1c96c4b about e1c202d89ffe1c61b682e28071627635974c84a7 31019:S 06 Sep 11:07:49.410 * FAIL message received from 5b41f7860cc800e65932e92d1d97c6c188138e56 about d9dadf3342006e2c92def3071ca0a76390be62b0 31019:S 06 Sep 11:07:53.352 * Clear FAIL state for node d9dadf3342006e2c92def3071ca0a76390be62b0: slave is reachable again. 31019:S 06 Sep 11:07:57.147 * Clear FAIL state for node e1c202d89ffe1c61b682e28071627635974c84a7: slave is reachable again. 31019:S 06 Sep 11:08:36.516 * FAIL message received from ae8f6e7e0ab16b04414c8f3d08b58c0aa268b467 about 938d9ae2de278938beda1d39185608b02d3b31ec 31019:S 06 Sep 11:08:41.900 * Clear FAIL state for node 938d9ae2de278938beda1d39185608b02d3b31ec: slave is reachable again. 31019:S 06 Sep 11:08:46.380 * FAIL message received from d7942cfe636b25219c6d56aa72828fcfde2ee261 about fbe774cdbd2acd24f9f5ea90d61c607bdf800eb5 31019:S 06 Sep 11:08:46.531 * Marking node 7990d146cece7dc83eaf08b3e12cbebb2223f5f8 as failing (quorum reached). 31019:S 06 Sep 11:09:01.882 * Clear FAIL state for node 7990d146cece7dc83eaf08b3e12cbebb2223f5f8: master without slots is reachable again. 31019:S 06 Sep 11:09:01.883 * Clear FAIL state for node fbe774cdbd2acd24f9f5ea90d61c607bdf800eb5: master without slots is reachable again. 31019:S 06 Sep 11:09:06.538 * FAIL message received from e438a338e9d9834a6745c12931950da87e360ca2 about d7942cfe636b25219c6d56aa72828fcfde2ee261 31019:S 06 Sep 11:09:06.538 * FAIL message received from e438a338e9d9834a6745c12931950da87e360ca2 about 1ba437fa1683a8caafd38ff977e5fbabdaf84fd6 31019:S 06 Sep 11:09:12.555 * Clear FAIL state for node 1ba437fa1683a8caafd38ff977e5fbabdaf84fd6: is reachable again and nobody is serving its slots after some time. 31019:S 06 Sep 11:09:12.555 * Clear FAIL state for node d7942cfe636b25219c6d56aa72828fcfde2ee261: master without slots is reachable again. 31019:S 06 Sep 11:09:15.565 * Marking node 534b93af6ba45a7033dbf38c8f47cd688514125a as failing (quorum reached). 31019:S 06 Sep 11:09:16.599 * FAIL message received from 0a92bd7472c9af3e52f9185eac1bd1bbf36146e6 about e1c202d89ffe1c61b682e28071627635974c84a7 31019:S 06 Sep 11:09:22.262 * Clear FAIL state for node 534b93af6ba45a7033dbf38c8f47cd688514125a: slave is reachable again. 31019:S 06 Sep 11:09:27.906 * Clear FAIL state for node e1c202d89ffe1c61b682e28071627635974c84a7: is reachable again and nobody is serving its slots after some time. 31019:S 06 Sep 11:09:50.744 * FAIL message received from ae8f6e7e0ab16b04414c8f3d08b58c0aa268b467 about e1c202d89ffe1c61b682e28071627635974c84a7 31019:S 06 Sep 11:09:55.141 * FAIL message received from 5b41f7860cc800e65932e92d1d97c6c188138e56 about d9dadf3342006e2c92def3071ca0a76390be62b0 31019:S 06 Sep 11:09:55.362 * FAIL message received from 7d61af127c17d9c19dbf9af0ac8f7307f1c96c4b about 938d9ae2de278938beda1d39185608b02d3b31ec 31019:S 06 Sep 11:09:55.557 * FAIL message received from ae8f6e7e0ab16b04414c8f3d08b58c0aa268b467 about 1d07e208db56cfd7395950ca66e03589278b8e12 31019:S 06 Sep 11:09:55.578 * FAIL message received from ae8f6e7e0ab16b04414c8f3d08b58c0aa268b467 about 144347d5a51acf047887fe81f22e8f7705c911ec 31019:S 06 Sep 11:09:56.521 * Marking node 534b93af6ba45a7033dbf38c8f47cd688514125a as failing (quorum reached). 31019:S 06 Sep 11:09:57.996 * Clear FAIL state for node 1d07e208db56cfd7395950ca66e03589278b8e12: slave is reachable again. 31019:S 06 Sep 11:09:58.329 * FAIL message received from 5b41f7860cc800e65932e92d1d97c6c188138e56 about 0a92bd7472c9af3e52f9185eac1bd1bbf36146e6 31019:S 06 Sep 11:10:09.239 * Clear FAIL state for node 144347d5a51acf047887fe81f22e8f7705c911ec: slave is reachable again. 31019:S 06 Sep 11:10:09.812 * Clear FAIL state for node d9dadf3342006e2c92def3071ca0a76390be62b0: is reachable again and nobody is serving its slots after some time. 31019:S 06 Sep 11:10:13.549 * Clear FAIL state for node 534b93af6ba45a7033dbf38c8f47cd688514125a: slave is reachable again. 31019:S 06 Sep 11:10:13.590 * FAIL message received from 716f2e2dd9792eaf4ee486794c9797fa6e1c9650 about 1ba437fa1683a8caafd38ff977e5fbabdaf84fd6 31019:S 06 Sep 11:10:13.591 * FAIL message received from f7d6b2c72fa3b801e7dcfe0219e73383d143dd0f about d7942cfe636b25219c6d56aa72828fcfde2ee261 31019:S 06 Sep 11:10:14.316 * Clear FAIL state for node e1c202d89ffe1c61b682e28071627635974c84a7: is reachable again and nobody is serving its slots after some time. 31019:S 06 Sep 11:10:15.108 * Clear FAIL state for node d7942cfe636b25219c6d56aa72828fcfde2ee261: slave is reachable again. 31019:S 06 Sep 11:10:17.588 * Clear FAIL state for node 938d9ae2de278938beda1d39185608b02d3b31ec: slave is reachable again. 31019:S 06 Sep 11:10:32.622 * Clear FAIL state for node 0a92bd7472c9af3e52f9185eac1bd1bbf36146e6: slave is reachable again. 31019:S 06 Sep 11:10:32.623 * FAIL message received from 5b41f7860cc800e65932e92d1d97c6c188138e56 about 3114cec541c5bcd36d712cd6c9f4c5055510e386 31019:S 06 Sep 11:10:32.623 * Clear FAIL state for node 3114cec541c5bcd36d712cd6c9f4c5055510e386: slave is reachable again. Master日志: 31014:M 06 Sep 14:08:54.083 * Background saving terminated with success 31014:M 06 Sep 14:09:55.093 * 10000 changes in 60 seconds. Saving... 31014:M 06 Sep 14:09:55.185 * Background saving started by pid 41395 31014:M 06 Sep 14:11:00.269 # Disconnecting timedout slave: 10.15.40.9:6018 31014:M 06 Sep 14:11:00.269 # Connection with slave 10.15.40.9:6018 lost. 41395:C 06 Sep 14:11:01.141 * DB saved on disk 41395:C 06 Sep 14:11:01.259 * RDB: 5 MB of memory used by copy-on-write 31014:M 06 Sep 14:11:01.472 * Background saving terminated with success 31014:M 06 Sep 14:11:11.525 * FAIL message received from 1d07e208db56cfd7395950ca66e03589278b8e12 about 534b93af6ba45a7033dbf38c8f47cd688514125a 31014:M 06 Sep 14:11:23.039 * FAIL message received from 1ba437fa1683a8caafd38ff977e5fbabdaf84fd6 about d78845370c98b3ce4cfc02e8d3e233a9a1d84a83 31014:M 06 Sep 14:11:23.541 * Clear FAIL state for node 534b93af6ba45a7033dbf38c8f47cd688514125a: slave is reachable again. 31014:M 06 Sep 14:11:23.813 * Slave 10.15.40.9:6018 asks for synchronization 31014:M 06 Sep 14:11:23.813 * Partial resynchronization request from 10.15.40.9:6018 accepted. Sending 46668 bytes of backlog starting from offset 5502672944. 31014:M 06 Sep 14:11:23.888 # Failover auth granted to 7d61af127c17d9c19dbf9af0ac8f7307f1c96c4b for epoch 283 31014:M 06 Sep 14:11:32.464 * FAIL message received from d6eb06e9d118c120d3961a659972a1d0191a8652 about 3114cec541c5bcd36d712cd6c9f4c5055510e386 31014:M 06 Sep 14:11:47.616 * Clear FAIL state for node d78845370c98b3ce4cfc02e8d3e233a9a1d84a83: master without slots is reachable again. 31014:M 06 Sep 14:11:55.515 * FAIL message received from d6eb06e9d118c120d3961a659972a1d0191a8652 about ae8f6e7e0ab16b04414c8f3d08b58c0aa268b467 31014:M 06 Sep 14:11:57.135 # Failover auth granted to ae8f6e7e0ab16b04414c8f3d08b58c0aa268b467 for epoch 284 31014:M 06 Sep 14:12:01.766 * Clear FAIL state for node ae8f6e7e0ab16b04414c8f3d08b58c0aa268b467: slave is reachable again. 31014:M 06 Sep 14:12:08.753 * Clear FAIL state for node 3114cec541c5bcd36d712cd6c9f4c5055510e386: master without slots is reachable again. 31014:M 06 Sep 14:16:02.070 * 10 changes in 300 seconds. Saving... 31014:M 06 Sep 14:16:02.163 * Background saving started by pid 13832 31014:M 06 Sep 14:17:18.443 * FAIL message received from ae8f6e7e0ab16b04414c8f3d08b58c0aa268b467 about d6eb06e9d118c120d3961a659972a1d0191a8652 31014:M 06 Sep 14:17:18.443 # Failover auth granted to f7d6b2c72fa3b801e7dcfe0219e73383d143dd0f for epoch 285 31014:M 06 Sep 14:17:29.272 # Connection with slave client id #40662 lost. 31014:M 06 Sep 14:17:29.273 # Failover auth denied to 534b93af6ba45a7033dbf38c8f47cd688514125a: already voted for epoch 285 31014:M 06 Sep 14:17:29.278 * Slave 10.15.40.9:6018 asks for synchronization 31014:M 06 Sep 14:17:29.278 * Partial resynchronization request from 10.15.40.9:6018 accepted. Sending 117106 bytes of backlog starting from offset 5502756264. 13832:C 06 Sep 14:17:29.850 * DB saved on disk 13832:C 06 Sep 14:17:29.970 * RDB: 7 MB of memory used by copy-on-write 31014:M 06 Sep 14:17:38.449 * FAIL message received from f7d6b2c72fa3b801e7dcfe0219e73383d143dd0f about 1ba437fa1683a8caafd38ff977e5fbabdaf84fd6 31014:M 06 Sep 14:17:38.449 * FAIL message received from 1d07e208db56cfd7395950ca66e03589278b8e12 about d7942cfe636b25219c6d56aa72828fcfde2ee261 31014:M 06 Sep 14:17:38.449 # Failover auth denied to 938d9ae2de278938beda1d39185608b02d3b31ec: reqEpoch (286) 31014:M 06 Sep 14:17:38.449 # Failover auth granted to d9dadf3342006e2c92def3071ca0a76390be62b0 for epoch 287 31014:M 06 Sep 14:17:38.449 * Background saving terminated with success 31014:M 06 Sep 14:17:38.450 * Clear FAIL state for node d7942cfe636b25219c6d56aa72828fcfde2ee261: master without slots is reachable again. 31014:M 06 Sep 14:17:38.450 * Clear FAIL state for node 1ba437fa1683a8caafd38ff977e5fbabdaf84fd6: master without slots is reachable again. 31014:M 06 Sep 14:17:38.452 * Clear FAIL state for node d6eb06e9d118c120d3961a659972a1d0191a8652: slave is reachable again. 31014:M 06 Sep 14:17:54.985 * FAIL message received from ae8f6e7e0ab16b04414c8f3d08b58c0aa268b467 about 7990d146cece7dc83eaf08b3e12cbebb2223f5f8 31014:M 06 Sep 14:17:56.729 * FAIL message received from 1d07e208db56cfd7395950ca66e03589278b8e12 about fbe774cdbd2acd24f9f5ea90d61c607bdf800eb5 31014:M 06 Sep 14:17:57.737 # Failover auth granted to e1c202d89ffe1c61b682e28071627635974c84a7 for epoch 288 31014:M 06 Sep 14:17:57.922 * Clear FAIL state for node fbe774cdbd2acd24f9f5ea90d61c607bdf800eb5: master without slots is reachable again. 31014:M 06 Sep 14:17:57.923 * Clear FAIL state for node 7990d146cece7dc83eaf08b3e12cbebb2223f5f8: slave is reachable again.
调整以下参数,可以大幅度改善Redis集群的稳定性: 为何大压力下要这样调整? 最重要的原因之一Redis的主从复制,两者复制共享同一线程,虽然是异步复制的,但因为是单线程,所以也十分有限。如果主从间的网络延迟不是在0.05左右,比如达到0.6,甚至1.2等,那么情况是非常糟糕的,因此同一Redis集群一定要部署在同一机房内。 这些参数的具体值,要视具体的压力而定,而且和消息的大小相关,比如一条200~500KB的流水数据可能比较大,主从复制的压力也会相应增大,而10字节左右的消息,则压力要小一些。大压力环境中开启appendfsync是十分不可取的,容易导致整个集群不可用,在不可用之前的典型表现是QPS毛刺明显。 这么做的目的是让Redis集群尽可能的避免master正常时触发主从切换,特别是容纳的数据量很大时,和大压力结合在一起,集群会雪崩。 当Redis日志中,出现大量如下信息,即可能意味着相关的参数需要调整了: 22135:M 06 Sep 14:17:05.388 * FAIL message received from 1d07e208db56cfd7395950ca66e03589278b8e12 about e438a338e9d9834a6745c12931950da87e360ca2 22135:M 06 Sep 14:17:07.551 * FAIL message received from ae8f6e7e0ab16b04414c8f3d08b58c0aa268b467 about d6eb06e9d118c120d3961a659972a1d0191a8652 22135:M 06 Sep 14:17:08.438 # Failover auth granted to f7d6b2c72fa3b801e7dcfe0219e73383d143dd0f for epoch 285 (We can vote for this slave) 有投票资格的node: 1)为master 2)至少有一个slot 3)投票node的epoch不能小于node自己当前的epoch(reqEpoch 4)node没有投票过该epoch(already voted for epoch) 5)投票node不能为master(it is a master node) 6)投票node必须有一个master(I don't know its master) 7)投票node的master处于fail状态(its master is up) 22135:M 06 Sep 14:17:19.844 # Failover auth denied to 534b93af6ba45a7033dbf38c8f47cd688514125a: already voted for epoch 285 如果一个node又联系上了,则它当是一个slave,或者无slots的master时,直接清除FAIL标志;但如果是一个master,则当“(now - node->fail_time) > (server.cluster_node_timeout * CLUSTER_FAIL_UNDO_TIME_MULT)”时,也清除FAIL标志,定义在cluster.h中(cluster.h:#define CLUSTER_FAIL_UNDO_TIME_MULT 2 /* Undo fail if master is back. */) 22135:M 06 Sep 14:17:29.243 * Clear FAIL state for node d6eb06e9d118c120d3961a659972a1d0191a8652: master without slots is reachable again. 如果消息类型为fail。 22135:M 06 Sep 14:17:31.995 * FAIL message received from f7d6b2c72fa3b801e7dcfe0219e73383d143dd0f about 1ba437fa1683a8caafd38ff977e5fbabdaf84fd6 22135:M 06 Sep 14:17:32.496 * FAIL message received from 1d07e208db56cfd7395950ca66e03589278b8e12 about d7942cfe636b25219c6d56aa72828fcfde2ee261 22135:M 06 Sep 14:17:32.968 # Failover auth granted to 938d9ae2de278938beda1d39185608b02d3b31ec for epoch 286 22135:M 06 Sep 14:17:33.177 # Failover auth granted to d9dadf3342006e2c92def3071ca0a76390be62b0 for epoch 287 22135:M 06 Sep 14:17:36.336 * Clear FAIL state for node 1ba437fa1683a8caafd38ff977e5fbabdaf84fd6: master without slots is reachable again. 22135:M 06 Sep 14:17:36.855 * Clear FAIL state for node d7942cfe636b25219c6d56aa72828fcfde2ee261: master without slots is reachable again. 22135:M 06 Sep 14:17:38.419 * Clear FAIL state for node e438a338e9d9834a6745c12931950da87e360ca2: is reachable again and nobody is serving its slots after some time. 22135:M 06 Sep 14:17:54.954 * FAIL message received from ae8f6e7e0ab16b04414c8f3d08b58c0aa268b467 about 7990d146cece7dc83eaf08b3e12cbebb2223f5f8 22135:M 06 Sep 14:17:56.697 * FAIL message received from 1d07e208db56cfd7395950ca66e03589278b8e12 about fbe774cdbd2acd24f9f5ea90d61c607bdf800eb5 22135:M 06 Sep 14:17:57.705 # Failover auth granted to e1c202d89ffe1c61b682e28071627635974c84a7 for epoch 288 22135:M 06 Sep 14:17:57.890 * Clear FAIL state for node 7990d146cece7dc83eaf08b3e12cbebb2223f5f8: slave is reachable again. 22135:M 06 Sep 14:17:57.892 * Clear FAIL state for node fbe774cdbd2acd24f9f5ea90d61c607bdf800eb5: master without slots is reachable again.
下载位置:https://github.com/eyjian/libmooon/tree/master/shell #!/bin/bash # 统计UPD丢包工具 # 可选参数1:统计间隔(单位:秒,默认10秒) # 可选参数2:是否输出丢包为0的记录,注意有参数1时,参数2才会生效 # # 运行结果会写日志,日志文件优先存工具相同的目录, # 但如果没有权限,则选择当前目录, # 当前目录无权限,则存tmp目录, # 如果tmp目录还无权限,则报错退出。 # # 输出格式:统计日期 统计时间 丢包数 # 输出示例: # 2018-09-03 17:22:49 5 # 2018-09-03 17:22:51 3 # # 可用UDP测试工具:https://iperf.fr/ flag=0 stat_seconds=10 if test $# -gt 2; then echo "Usage: `basename $0` [seconds] [0|1]" exit 1 fi if test $# -gt 1; then flag=$2 # 值为1表示输出丢包为0的记录 fi if test $# -gt 0; then stat_seconds=$1 fi # 下段不允许出错 set -e # 日志文件 basedir=$(dirname $(readlink -f $0)) logname=`basename $0 .sh` logfile=$basedir/$logname.log if test ! -w $basedir; then basedir=`pwd` logfile=$basedir/$logname if test ! -w $basedir; then basedir=/tmp logfile=$basedir/$logname fi fi # 备份日志文件 bak_logfile=$logfile.bak if test -f $logfile; then rm --interactive=never $logfile touch $logfile fi # 恢复 set +e # 统计哪些网卡,不填写则自动取 #ethX_array=() # #if test $# -eq 0; then # ethX_array=(`cat /proc/net/dev| awk -F[\ \:]+ '/eth/{printf("%s\n",$2);}'`) #else # ethX_array=($*) #fi old_num_errors=0 for ((;;)) do # 相关命令: # 1) 查看队列中的包数:netstat –alupt # 2) 查看socket读缓冲区大小:cat /proc/sys/net/core/rmem_default # 3) 查看socket读缓冲区大小:cat /proc/sys/net/core/wmem_default # 4) 查看网卡队列大小:ethtool -g eth1 # 5) 查看arp缓存队列大小:cat /proc/sys/net/ipv4/neigh/eth1/unres_qlen # 6) 查看CPU负载:mpstat -P ALL 1 或 vmstat 1 或 top 或 htop 或uptime # # 取得丢包数 # 命令“cat /proc/net/snmp | grep Udp”比命令“netstat –su”好 # num_drops=`netstat -su | awk -F[\ ]+ 'BEGIN{flag=0;}{ if ($0=="Udp:") flag=1; if ((flag==1) && (match($0, "packet receive errors"))) printf("%s\n", $2); }'` num_errors=`cat /proc/net/snmp | awk -F'[ ]'+ 'BEGIN{ line=0; }/Udp/{ ++line; if (2==line) printf("%s\n", $4); }'` if test $old_num_errors -eq 0; then old_num_errors=$num_errors elif test $num_errors -ge $old_num_errors; then num_drops=$(($num_errors - $old_num_errors)) if test $flag -eq 1 -o $num_drops -ne 0; then line="`date '+%Y-%m-%d %H:%M:%S'` $num_drops" # 得到日志文件大小(5368709120 = 5 \* 1024 \* 1024 \* 1024) logfile_size=`ls -l --time-style=long-iso $logfile 2>/dev/null| awk -F[\ ]+ '{ printf("%s\n", $5); }'` if test ! -z "$logfile_size"; then if test $logfile_size -gt 5368709120; then echo $line | tee -a $logfile mv $logfile $bak_logfile rm -f $logfile fi fi echo $line | tee -a $logfile fi fi sleep $stat_seconds done
指定集群中任意一个节点,查看集群中所有节点当前已用物理内存、配置的最大物理内存和系统物理内存。 源码(可从下载): #!/bin/bash # Query the memory of all nodes in a cluster # # Output example: # $ ./query_redis_cluster.sh 192.168.0.31.21:6379 # [192.168.0.31.21:6379] Used: 788.57M Max: 15.00G System: 125.56G # [192.168.0.31.22:6380] Used: 756.98M Max: 15.00G System: 125.56G # [192.168.0.31.23:6380] Used: 743.93M Max: 15.00G System: 125.56G # [192.168.0.31.24:6380] Used: 21.73M Max: 15.00G System: 125.56G # [192.168.0.31.25:6380] Used: 819.11M Max: 15.00G System: 125.56G # [192.168.0.31.24:6379] Used: 771.70M Max: 15.00G System: 125.56G # [192.168.0.31.26:6379] Used: 920.77M Max: 15.00G System: 125.56G # [192.168.0.31.27:6380] Used: 889.09M Max: 15.00G System: 125.27G # [192.168.0.31.28:6379] Used: 741.24M Max: 15.00G System: 125.56G # [192.168.0.31.29:6380] Used: 699.55M Max: 15.00G System: 125.56G # [192.168.0.31.27:6379] Used: 752.89M Max: 15.00G System: 125.27G # [192.168.0.31.21:6380] Used: 716.05M Max: 15.00G System: 125.56G # [192.168.0.31.23:6379] Used: 784.82M Max: 15.00G System: 125.56G # [192.168.0.31.26:6380] Used: 726.40M Max: 15.00G System: 125.56G # [192.168.0.31.25:6379] Used: 726.09M Max: 15.00G System: 125.56G # [192.168.0.31.29:6379] Used: 844.59M Max: 15.00G System: 125.56G # [192.168.0.31.28:6380] Used: 14.00M Max: 15.00G System: 125.56G # [192.168.0.31.22:6379] Used: 770.13M Max: 15.00G System: 125.56G REDIS_CLI=${REDIS_CLI:-redis-cli} REDIS_IP=${REDIS_IP:-127.0.0.1} REDIS_PORT=${REDIS_PORT:-6379} function usage() { echo "Usage: `basename $0` redis_node" echo "Example: `basename $0` 127.0.0.1:6379" } # with a parameter: single redis node if test $# -ne 1; then usage exit 1 fi eval $(echo "$1" | awk -F[\:] '{ printf("REDIS_IP=%s\nREDIS_PORT=%s\n",$1,$2) }') if test -z "$REDIS_IP" -o -z "$REDIS_PORT"; then echo "Parameter error" usage exit 1 fi # 确保redis-cli可用 which "$REDIS_CLI" > /dev/null 2>&1 if test $? -ne 0; then echo "\`redis-cli\` not exists or not executable" exit 1 fi redis_nodes=`redis-cli -h $REDIS_IP -p $REDIS_PORT cluster nodes | awk -F[\ \:\@] '!/ERR/{ printf("%s:%s\n",$2,$3); }'` if test -z "$redis_nodes"; then # standlone $REDIS_CLI -h $REDIS_IP -p $REDIS_PORT FLUSHALL else # cluster for redis_node in $redis_nodes; do if test ! -z "$redis_node"; then eval $(echo "$redis_node" | awk -F[\:] '{ printf("redis_node_ip=%s\nredis_node_port=%s\n",$1,$2) }') if test ! -z "$redis_node_ip" -a ! -z "$redis_node_port"; then items=(`$REDIS_CLI -h $redis_node_ip -p $redis_node_port INFO MEMORY 2>&1 | tr '\r' ' '`) used_memory_rss_human=0 maxmemory_human=0 total_system_memory_human=0 for item in "${items[@]}" do eval $(echo "$item" | awk -F[\:] '{ printf("name=%s\nvalue=%s\n",$1,$2) }') if test "$name" = "used_memory_rss_human"; then used_memory_rss_human=$value elif test "$name" = "maxmemory_human"; then maxmemory_human=$value elif test "$name" = "total_system_memory_human"; then total_system_memory_human=$value fi done echo -e "[\033[1;33m${redis_node_ip}:${redis_node_port}\033[m]\tUsed: \033[0;32;32m$used_memory_rss_human\033[m\tMax: \033[0;32;32m$maxmemory_human\033[m\tSystem: \033[0;32;32m$total_system_memory_human\033[m" fi fi done fi
使用之前准备工作: 1)配置好与端口无关的公共redis.conf文件,和工具放在同一目录下 2)配置好与端口相关的模板redis-PORT.conf文件,也和工具放在同一目录下(部署时PORT会被替换成具体的端口号) 3)配置好组成集群的节点文件redis_cluster.nodes,也和工具放在同一目录下 redis_cluster.nodes的文件格式为每行一个组成Redis集群的节点,支持“#”打头的注释行,格式示例: 127.0.0.1 6381 127.0.0.1 6382 127.0.0.1 6383 127.0.0.1 6384 127.0.0.1 6385 127.0.0.1 6386 4)创建好安装redis的目录(可建筑批量工具mooon_ssh完成,deploy_redis_cluster.sh主要也是利用了该批量工具) 5)其它更详细的可以直接看源代码,有详细的说明。 建立将https://github.com/eyjian/redis-tools/tree/master/deploy下载到一个目录,运行deploy_redis_cluster.sh工具时,它会提示各种前置条件,比如redis-cli是否可用等。 源码(可从https://github.com/eyjian/redis-tools下载): #!/bin/bash # 源代码:https://github.com/eyjian/redis-tools # a tool to deploy a redis cluster # # 自动化部署redis集群工具, # 远程操作即可,不需登录到Redis集群中的任何机器。 # # 以root用户批量创建用户redis示例: # export H=192.168.0.5,192.168.0.6,192.168.0.7,192.168.0.8,192.168.0.9 # export U=root # export P='root^1234' # mooon_ssh -c='groupadd redis; useradd -g redis -m redis; echo "redis:redis#1234"|chpasswd' # # 批量创建redis安装目录/data/redis-4.0.11,并设置owner为用户redis,用户组为redis示例: # mooon_ssh -c='mkdir /data/redis-4.0.11;ln -s /data/redis-4.0.11 /data/redis;chown redis:redis /data/redis*' # # 可使用process_monitor.sh监控redis-server进程重启: # https://github.com/eyjian/libmooon/blob/master/shell/process_monitor.sh # 使用示例: # * * * * * /usr/local/bin/process_monitor.sh "/usr/local/redis/bin/redis-server 6379" "/usr/local/redis/bin/redis-server /usr/local/redis/conf/redis-6379.conf" # * * * * * /usr/local/bin/process_monitor.sh "/usr/local/redis/bin/redis-server 6380" "/usr/local/redis/bin/redis-server /usr/local/redis/conf/redis-6380.conf" # 可在/tmp目录找到process_monitor.sh的运行日志,当对应端口的进程不在时,5秒内即会重启对应端口的进程。 # # 运行参数: # 参数1 SSH端口 # 参数2 安装用户 # 参数3 安装用户密码 # 参数4 安装目录 # # 前置条件(可借助批量工具mooon_ssh和mooon_upload完成): # 1)安装用户已经创建好 # 2)安装用户密码已经设置好 # 3)安装目录已经创建好,并且目录的owner为安装用户 # 4)执行本工具的机器上安装好了ruby,且版本号不低于2.0.0 # 5)执行本工具的机器上安装好了redis-X.Y.Z.gem,且版本号不低于redis-3.0.0.gem # # 6)同目录下存在以下几个可执行文件: # 6.1)redis-server # 6.2)redis-cli # 6.3)redis-check-rdb # 6.4)redis-check-aof # 6.5)redis-trib.rb # # 7)同目录下存在以下两个配置文件: # 7.1)redis.conf # 7.2)redis-PORT.conf # 其中redis.conf为公共配置文件, # redis-PORT.conf为指定端口的配置文件模板, # 同时,需要将redis-PORT.conf文件中的目录和端口分别使用INSTALLDIR和REDISPORT替代,示例: # include INSTALLDIR/conf/redis.conf # pidfile INSTALLDIR/bin/redis-REDISPORT.pid # logfile INSTALLDIR/log/redis-REDISPORT.log # port REDISPORT # dbfilename dump-REDISPORT.rdb # dir INSTALLDIR/data/REDISPORT # # 其中INSTALLDIR将使用参数4的值替换, # 而REDISPORT将使用redis_cluster.nodes中的端口号替代 # # 配置文件redis_cluster.nodes,定义了安装redis的节点 # 文件格式(以“#”打头的为注释): # 每行由IP和端口号组成,两者间可以:空格、逗号、分号、或TAB符分隔 # # 依赖: # 1)mooon_ssh 远程操作多台机器批量命令工具 # 2)mooon_upload 远程操作多台机器批量上传工具 # 3)https://raw.githubusercontent.com/eyjian/libmooon # 4)libmooon又依赖libssh2(http://www.libssh2.org/) BASEDIR=$(dirname $(readlink -f $0)) REDIS_CLUSTER_NODES=$BASEDIR/redis_cluster.nodes # 批量命令工具 MOOON_SSH=mooon_ssh # 批量上传工具 MOOON_UPLOAD=mooon_upload # 创建redis集群工具 REDIS_TRIB=$BASEDIR/redis-trib.rb # redis-server REDIS_SERVER=$BASEDIR/redis-server # redis-cli REDIS_CLI=$BASEDIR/redis-cli # redis-check-aof REDIS_CHECK_AOF=$BASEDIR/redis-check-aof # redis-check-rdb REDIS_CHECK_RDB=$BASEDIR/redis-check-rdb # redis.conf REDIS_CONF=$BASEDIR/redis.conf # redis-PORT.conf REDIS_PORT_CONF=$BASEDIR/redis-PORT.conf # 全局变量 # 组成redis集群的总共节点数 num_nodes=0 # 组成redis集群的所有IP数组 redis_node_ip_array=() # 组成redis集群的所有节点数组(IP+port构造一个redis节点) redis_node_array=() # 用法 function usage() { echo -e "\033[1;33mUsage\033[m: `basename $0` \033[0;32;32mssh-port\033[m install-user \033[0;32;32minstall-user-password\033[m install-dir" echo -e "\033[1;33mExample\033[m: `basename $0` \033[0;32;32m22\033[m redis \033[0;32;32mredis^1234\033[m /usr/local/redis-4.0.11" } # 需要指定五个参数 if test $# -ne 4; then usage echo "" exit 1 fi ssh_port="$1" install_user="$2" install_user_password="$3" install_dir="$4" echo -e "[ssh port] \033[1;33m$ssh_port\033[m" echo -e "[install user] \033[1;33m$install_user\033[m" echo -e "[install directory] \033[1;33m$install_dir\033[m" echo "" # 检查ruby是否可用 which ruby > /dev/null 2>&1 if test $? -eq 0; then echo -e "Checking ruby OK" else echo -e "ruby \033[0;32;31mnot exists or not executable\033[m" echo "https://www.ruby-lang.org" echo -e "Exit now\n" exit 1 fi # 检查gem是否可用 which gem > /dev/null 2>&1 if test $? -eq 0; then echo -e "Checking gem OK" else echo -e "gem \033[0;32;31mnot exists or not executable\033[m" echo "https://rubygems.org/pages/download" echo -e "Exit now\n" exit 1 fi # 检查mooon_ssh是否可用 which "$MOOON_SSH" > /dev/null 2>&1 if test $? -eq 0; then echo -e "Checking $MOOON_SSH OK" else echo -e "$MOOON_SSH \033[0;32;31mnot exists or not executable\033[m" echo "There are two versions: C++ and GO:" echo "https://github.com/eyjian/libmooon/releases" echo "https://raw.githubusercontent.com/eyjian/libmooon/master/tools/mooon_ssh.cpp" echo "https://raw.githubusercontent.com/eyjian/libmooon/master/tools/mooon_ssh.go" echo -e "Exit now\n" exit 1 fi # 检查mooon_upload是否可用 which "$MOOON_UPLOAD" > /dev/null 2>&1 if test $? -eq 0; then echo -e "Checking $MOOON_UPLOAD OK" else echo -e "$MOOON_UPLOAD \033[0;32;31mnot exists or not executable\033[m" echo "There are two versions: C++ and GO:" echo "https://github.com/eyjian/libmooon/releases" echo "https://raw.githubusercontent.com/eyjian/libmooon/master/tools/mooon_upload.cpp" echo "https://raw.githubusercontent.com/eyjian/libmooon/master/tools/mooon_upload.go" echo -e "Exit now\n" exit 1 fi # 检查redis-trib.rb是否可用 which "$REDIS_TRIB" > /dev/null 2>&1 if test $? -eq 0; then echo -e "Checking $REDIS_TRIB OK" else echo -e "$REDIS_TRIB \033[0;32;31mnot exists or not executable\033[m" echo -e "Exit now\n" exit 1 fi # 检查redis-server是否可用 which "$REDIS_SERVER" > /dev/null 2>&1 if test $? -eq 0; then echo -e "Checking $REDIS_SERVER OK" else echo -e "$REDIS_SERVER \033[0;32;31mnot exists or not executable\033[m" echo -e "Exit now\n" exit 1 fi # 检查redis-cli是否可用 which "$REDIS_CLI" > /dev/null 2>&1 if test $? -eq 0; then echo -e "Checking $REDIS_CLI OK" else echo -e "$REDIS_CLI \033[0;32;31mnot exists or not executable\033[m" echo -e "Exit now\n" exit 1 fi # 检查redis-check-aof是否可用 which "$REDIS_CHECK_AOF" > /dev/null 2>&1 if test $? -eq 0; then echo -e "Checking $REDIS_CHECK_AOF OK" else echo -e "$REDIS_CHECK_AOF \033[0;32;31mnot exists or not executable\033[m" echo -e "Exit now\n" exit 1 fi # 检查redis-check-rdb是否可用 which "$REDIS_CHECK_RDB" > /dev/null 2>&1 if test $? -eq 0; then echo -e "Checking $REDIS_CHECK_RDB OK" else echo -e "$REDIS_CHECK_RDB \033[0;32;31mnot exists or not executable\033[m" echo -e "Exit now\n" exit 1 fi # 检查redis.conf是否可用 if test -r "$REDIS_CONF"; then echo -e "Checking $REDIS_CONF OK" else echo -e "$REDIS_CONF \033[0;32;31mnot exists or not readable\033[m" echo -e "Exit now\n" exit 1 fi # 检查redis-PORT.conf是否可用 if test -r "$REDIS_PORT_CONF"; then echo -e "Checking $REDIS_PORT_CONF OK" else echo -e "$REDIS_PORT_CONF \033[0;32;31mnot exists or not readable\033[m" echo -e "Exit now\n" exit 1 fi # 解析redis_cluster.nodes文件, # 从而得到组成redis集群的所有节点。 function parse_redis_cluster_nodes() { redis_nodes_str= redis_nodes_ip_str= while read line do # 删除前尾空格 line=`echo "$line" | xargs` if test -z "$line" -o "$line" = "#"; then continue fi # 跳过注释 begin_char=${line:0:1} if test "$begin_char" = "#"; then continue fi # 取得IP和端口 eval $(echo "$line" | awk -F[\ \:,\;\t]+ '{ printf("ip=%s\nport=%s\n",$1,$2); }') # IP和端口都必须有 if test ! -z "$ip" -a ! -z "$port"; then if test -z "$redis_nodes_ip_str"; then redis_nodes_ip_str=$ip else redis_nodes_ip_str="$redis_nodes_ip_str,$ip" fi if test -z "$redis_nodes_str"; then redis_nodes_str="$ip:$port" else redis_nodes_str="$redis_nodes_str,$ip:$port" fi fi done if test -z "$redis_nodes_ip_str"; then num_nodes=0 else # 得到IP数组redis_node_ip_array redis_node_ip_array=`echo "$redis_nodes_ip_str" | tr ',' '\n' | sort | uniq` # 得到节点数组redis_node_array redis_node_array=`echo "$redis_nodes_str" | tr ',' '\n' | sort | uniq` for redis_node in ${redis_node_array[@]}; do num_nodes=$((++num_nodes)) echo "$redis_node" done fi } # check redis_cluster.nodes if test ! -r $REDIS_CLUSTER_NODES; then echo -e "File $REDIS_CLUSTER_NODES \033[0;32;31mnot exits\033[m" echo "" echo -e "\033[0;32;32mFile format\033[m (columns delimited by space, tab, comma, semicolon or colon):" echo "IP1 port1" echo "IP2 port2" echo "" echo -e "\033[0;32;32mExample\033[m:" echo "127.0.0.1 6381" echo "127.0.0.1 6382" echo "127.0.0.1 6383" echo "127.0.0.1 6384" echo "127.0.0.1 6385" echo "127.0.0.1 6386" echo -e "Exit now\n" exit 1 else echo -e "\033[0;32;32m" parse_redis_cluster_nodes echo -e "\033[m" if test $num_nodes -lt 1; then echo -e "Checking $REDIS_CLUSTER_NODES \033[0;32;32mfailed\033[m: no any node" echo -e "Exit now\n" exit 1 else echo -e "Checking $REDIS_CLUSTER_NODES OK, the number of nodes is \033[1;33m${num_nodes}\033[m" fi fi # 确认后再继续 while true do # 组成一个redis集群至少需要六个节点 if test $num_nodes -lt 6; then echo -e "\033[0;32;32mAt least 6 nodes are required to create a redis cluster\033[m" fi # 提示是否继续 echo -en "Are you sure to continue? [\033[1;33myes\033[m/\033[1;33mno\033[m]" read -r -p " " input if test "$input" = "no"; then echo -e "Exit now\n" exit 1 elif test "$input" = "yes"; then echo "Starting to install ..." echo "" break fi done # 是否先清空安装目录再安装? clear_install_directory= while true do echo -en "Clear install directory? [\033[1;33myes\033[m/\033[1;33mno\033[m]" read -r -p " " clear_install_directory if test "$clear_install_directory" = "no"; then echo "" break elif test "$clear_install_directory" = "yes"; then echo "" break fi done # 安装公共的,包括可执行程序文件和公共配置文件 function install_common() { redis_ip="$1" # 检查安装目录是否存在,且有读写权限 echo "$MOOON_SSH -h=$redis_ip -P=$ssh_port -u=$install_user -p=$install_user_password -c=\"test -d $install_dir && test -r $install_dir && test -w $install_dir && test -x $install_dir\"" $MOOON_SSH -h=$redis_ip -P=$ssh_port -u=$install_user -p=$install_user_password -c="test -d $install_dir && test -r $install_dir && test -w $install_dir && test -x $install_dir" if test $? -ne 0; then echo "" echo -e "Directory $install_dir \033[1;33mnot exists or no (rwx) permission\033[m" echo -e "Exit now\n" exit 1 fi # 清空安装目录 if test "$clear_install_directory" = "yes"; then echo "" echo "$MOOON_SSH -h=$redis_ip -P=$ssh_port -u=$install_user -p=$install_user_password -c=\"killall -q -w -u $install_user redis-server\"" $MOOON_SSH -h=$redis_ip -P=$ssh_port -u=$install_user -p=$install_user_password -c="killall -q -w -u $install_user redis-server" echo "$MOOON_SSH -h=$redis_ip -P=$ssh_port -u=$install_user -p=$install_user_password -c=\"rm -fr $install_dir/*\"" $MOOON_SSH -h=$redis_ip -P=$ssh_port -u=$install_user -p=$install_user_password -c="rm -fr $install_dir/*" if test $? -ne 0; then echo -e "Exit now\n" exit 1 fi fi # 创建公共目录(create directory) echo "" echo "$MOOON_SSH -h=$redis_ip -P=$ssh_port -u=$install_user -p=$install_user_password -c=\"cd $install_dir;mkdir -p bin conf log data\"" $MOOON_SSH -h=$redis_ip -P=$ssh_port -u=$install_user -p=$install_user_password -c="cd $install_dir;mkdir -p bin conf log data" if test $? -ne 0; then echo -e "Exit now\n" exit 1 fi # 上传公共配置文件(upload configuration files) echo "" echo "$MOOON_UPLOAD -h=$redis_ip -P=$ssh_port -u=$install_user -p=$install_user_password -s=redis.conf -d=$install_dir/conf" $MOOON_UPLOAD -h=$redis_ip -P=$ssh_port -u=$install_user -p=$install_user_password -s=redis.conf -d=$install_dir/conf if test $? -ne 0; then echo -e "Exit now\n" exit 1 fi # 上传公共执行文件(upload executable files) echo "" echo "$MOOON_UPLOAD -h=$redis_ip -P=$ssh_port -u=$install_user -p=$install_user_password -s=redis-server,redis-cli,redis-check-aof,redis-check-rdb -d=$install_dir/bin" $MOOON_UPLOAD -h=$redis_ip -P=$ssh_port -u=$install_user -p=$install_user_password -s=redis-server,redis-cli,redis-check-aof,redis-check-rdb,redis-trib.rb -d=$install_dir/bin if test $? -ne 0; then echo -e "Exit now\n" exit 1 fi } # 安装节点配置文件 function install_node_conf() { redis_ip="$1" redis_port="$2" # 生成节点配置文件 cp redis-PORT.conf redis-$redis_port.conf sed -i "s|INSTALLDIR|$install_dir|g;s|REDISPORT|$redis_port|g" redis-$redis_port.conf # 创建节点数据目录(create data directory for the given node) echo "" echo "$MOOON_SSH -h=$redis_ip -P=$ssh_port -u=$install_user -p=$install_user_password -c=\"cd $install_dir;mkdir -p data/$redis_port\"" $MOOON_SSH -h=$redis_ip -P=$ssh_port -u=$install_user -p=$install_user_password -c="cd $install_dir;mkdir -p data/$redis_port" if test $? -ne 0; then rm -f redis-$redis_port.conf echo -e "Exit now\n" exit 1 fi # 上传节点配置文件(upload configuration files) echo "" echo "$MOOON_UPLOAD -h=$redis_ip -P=$ssh_port -u=$install_user -p=$install_user_password -s=redis-$redis_port.conf -d=$install_dir/conf" $MOOON_UPLOAD -h=$redis_ip -P=$ssh_port -u=$install_user -p=$install_user_password -s=redis-$redis_port.conf -d=$install_dir/conf if test $? -ne 0; then rm -f redis-$redis_port.conf echo -e "Exit now\n" exit 1 fi rm -f redis-$redis_port.conf } function start_redis_node() { redis_ip="$1" redis_port="$2" # 启动redis实例(start redis instance) echo "" echo "$MOOON_SSH -h=$redis_ip -P=$ssh_port -u=$install_user -p=$install_user_password -c=\"$install_dir/bin/redis-server $install_dir/conf/redis-$redis_port.conf\"" $MOOON_SSH -h=$redis_ip -P=$ssh_port -u=$install_user -p=$install_user_password -c="nohup $install_dir/bin/redis-server $install_dir/conf/redis-$redis_port.conf > /dev/null 2>&1 &" if test $? -ne 0; then echo -e "Exit now\n" exit 1 fi } # 安装公共的,包括可执行程序文件和公共配置文件 echo "" echo -e "\033[1;33m================================\033[m" for redis_node_ip in $redis_node_ip_array; do echo -e "[\033[1;33m$redis_node_ip\033[m] Installing common ..." install_common $redis_node_ip done # 安装节点配置文件 echo "" echo -e "\033[1;33m================================\033[m" for redis_node in ${redis_node_array[@]}; do node_ip= node_port= eval $(echo "$redis_node" | awk -F[\ \:,\;\t]+ '{ printf("node_ip=%s\nnode_port=%s\n",$1,$2); }') if test -z "$node_ip" -o -z "$node_port"; then continue fi echo -e "[\033[1;33m$node_ip:$node_port\033[m] Installing node ..." install_node_conf $node_ip $node_port done # 确认后再继续 echo "" echo -e "\033[1;33m================================\033[m" while true do echo -en "Start redis? [\033[1;33myes\033[m/\033[1;33mno\033[m]" read -r -p " " input if test "$input" = "no"; then echo "" exit 1 elif test "$input" = "yes"; then echo "Starting to start redis ..." echo "" break fi done # 启动redis实例(start redis instance) for redis_node in ${redis_node_array[@]}; do eval $(echo "$redis_node" | awk -F[\ \:,\;\t]+ '{ printf("node_ip=%s\nnode_port=%s\n",$1,$2); }') if test -z "$node_ip" -o -z "$node_port"; then continue fi echo -e "[\033[1;33m$node_ip:$node_port\033[m] Starting node ..." start_redis_node $node_ip $node_port done echo "" echo -e "\033[1;33m================================\033[m" echo "Number of nodes: $num_nodes" if test $num_nodes -lt 6; then echo "Number of nodes less than 6, can not create redis cluster" echo -e "Exit now\n" exit 1 else redis_nodes_str=`echo "$redis_nodes_str" | tr ',' ' '` # 确认后再继续 echo "" while true do echo -en "Create redis cluster? [\033[1;33myes\033[m/\033[1;33mno\033[m]" read -r -p " " input if test "$input" = "no"; then echo "" exit 1 elif test "$input" = "yes"; then echo "Starting to create redis cluster with $redis_nodes_str ... ..." echo "" break fi done # 创建redis集群(create redis cluster) # redis-trib.rb create --replicas 1 $REDIS_TRIB create --replicas 1 $redis_nodes_str echo -e "Exit now\n" exit 0 fi
工具的作用: 1)比“cluster nodes”更为直观的显示结果 2)指出落在同一个IP上的master 3)指出落在同一个IP上的master和slave对 运行效果图: 源代码(可从https://github.com/eyjian/redis-tools下载): 点击(此处)折叠或打开 #!/bin/bash # 源码位置:https://github.com/eyjian/redis-tools # # Redis集群部署注意事项: # 在一个物理IP上部署多个redis实例时,要避免: # 1) 一对master和slave出现在同一物理IP上(影响:物理机挂掉,部分keys彻底不可用) # 2) 同一物理IP上出现多个master(影响:物理机挂掉,将导致两对切换) # # 使用示例(带一个参数): # show_redis_map.sh 192.168.0.21:6380 # # 检查Redis集群master和slave映射关系的命令行工具: # 1) 查看是否多master出现在同一IP; # 2) 查看一对master和slave出现在同一IP。 # # 当同一IP出现二个或多个master,则相应的行标星显示, # 如果一对master和slave出现在同一IP上,则相应的行标星显示。 # # 输出效果: # [01][MASTER] 192.168.0.21:6379 00cc3f37d938ee8ba672bc77b71d8e0a3881a98b # [02][MASTER] 192.168.0.22:6379 1115713e3c311166207f3a9f1445b4e32a9202d7 # [03][MASTER] 192.168.0.23:6379 5cb6946f46ccdf543e5a1efada6806f3df72b727 # [04][MASTER] 192.168.0.24:6379 b91b1309b05f0dcc1e3a2a9521b8c00702999744 # [05][MASTER] 192.168.0.25:6379 00a1ba8e5cb940ba4171e0f4415b91cea96977bc # [06][MASTER] 192.168.0.26:6379 64facb201cc5c7d8cdccb5fa211af5e1a04a9786 # [07][MASTER] 192.168.0.27:6379 f119780359c0e43d19592db01675df2f776181b1 # [08][MASTER] 192.168.0.28:6379 d374e28578967f96dcb75041e30a5a1e23693e56 # [09][MASTER] 192.168.0.29:6380 a153d2071251657004dbe77abd10e2de7f0a209a # # [01][SLAVE=>MASTER] 192.168.0.21:6380 => 192.168.0.28:6379 # [02][SLAVE=>MASTER] 192.168.0.22:6380 => 192.168.0.25:6379 # [03][SLAVE=>MASTER] 192.168.0.23:6380 => 192.168.0.24:6379 # [04][SLAVE=>MASTER] 192.168.0.24:6380 => 192.168.0.23:6379 # [05][SLAVE=>MASTER] 192.168.0.25:6380 => 192.168.0.22:6379 # [06][SLAVE=>MASTER] 192.168.0.26:6380 => 192.168.0.27:6379 # [07][SLAVE=>MASTER] 192.168.0.27:6380 => 192.168.0.29:6380 # [08][SLAVE=>MASTER] 192.168.0.28:6380 => 192.168.0.21:6379 # [09][SLAVE=>MASTER] 192.168.0.29:6379 => 192.168.0.26:6379 REDIS_CLI=${REDIS_CLI:-redis-cli} REDIS_IP=${REDIS_IP:-127.0.0.1} REDIS_PORT=${REDIS_PORT:-6379} function usage() { echo "usage: `basename $0` redis_node" echo "example: `basename $0` 127.0.0.1:6379" } # with a parameter: single redis node if test $# -ne 1; then usage exit 1 fi # 检查参数 eval $(echo "$1" | awk -F[\:] '{ printf("REDIS_IP=%s\nREDIS_PORT=%s\n",$1,$2) }') if test -z "$REDIS_IP" -o -z "$REDIS_PORT"; then echo "parameter error" usage exit 1 fi # 确保redis-cli可用 which "$REDIS_CLI" > /dev/null 2>&1 if test $? -ne 0; then echo -e "\`redis-cli\` not exists or not executable" exit 1 fi # master映射表,key为master的id,value为master的“ip:port” declare -A master_map=() # slave映表,key为master的id,value为slave的“ip:port” declare -A slave_map=() master_nodes_str= master_slave_maps_str= # 找出所有master masters=`$REDIS_CLI -h $REDIS_IP -p $REDIS_PORT CLUSTER NODES | awk -F[\ \@] '/master/{ printf("%s,%s\n",$1,$2); }' | sort` for master in $masters; do eval $(echo $master | awk -F[,] '{ printf("master_id=%s\nmaster_node=%s\n",$1,$2); }') master_map[$master_id]=$master_node if test -z "$master_nodes_str"; then master_nodes_str="$master_node|$master_id" else master_nodes_str="$master_node|$master_id,$master_nodes_str" fi done # 找出所有slave # “CLUSTER NODES”命令的输出格式当前有两个版本,需要awk需要根据NF的值做区分 slaves=`$REDIS_CLI -h $REDIS_IP -p $REDIS_PORT CLUSTER NODES | awk -F[\ \@] '/slave/{ if (NF==9) printf("%s,%s\n",$5,$2); else printf("%s,%s\n",$4,$2); }' | sort` for slave in $slaves; do eval $(echo $slave | awk -F[,] '{ printf("master_id=%s\nslave_node=%s\n",$1,$2); }') slave_map[$master_id]=$slave_node done for key in ${!master_map[@]} do master_node=${master_map[$key]} slave_node=${slave_map[$key]} if test -z "$master_slave_maps_str"; then master_slave_maps_str="$slave_node|$master_node" else master_slave_maps_str="$slave_node|$master_node,$master_slave_maps_str" fi done # 显示所有master index=1 old_master_node_ip= master_nodes_str=`echo "$master_nodes_str" | tr ',' '\n' | sort` for master_node_str in $master_nodes_str; do eval $(echo "$master_node_str" | awk -F[\|] '{ printf("master_node=%s\nmaster_id=%s\n", $1, $2); }') eval $(echo "$master_node" | awk -F[\:] '{ printf("master_node_ip=%s\nmaster_node_port=%s\n", $1, $2); }') tag= # 同一IP上出现多个master,标星 if test "$master_node_ip" = "$old_master_node_ip"; then tag=" (*)" fi printf "[%02d][MASTER] %-20s \033[0;32;31m%s\033[m%s\n" $index "$master_node" "$master_id" "$tag" old_master_node_ip=$master_node_ip index=$((++index)) done # 显示所有slave到master的映射 index=1 echo "" master_slave_maps_str=`echo "$master_slave_maps_str" | tr ',' '\n' | sort` for master_slave_map_str in $master_slave_maps_str; do eval $(echo "$master_slave_map_str" | awk -F[\|] '{ printf("slave_node=%s\nmaster_node=%s\n", $1, $2); }') eval $(echo "$slave_node" | awk -F[\:] '{ printf("slave_node_ip=%s\nslave_node_port=%s\n", $1, $2); }') eval $(echo "$master_node" | awk -F[\:] '{ printf("master_node_ip=%s\nmaster_node_port=%s\n", $1, $2); }') tag= # 一对master和slave出现在同一IP,标星 if test ! -z "$slave_node_ip" -a "$slave_node_ip" = "$master_node_ip"; then tag=" (*)" fi n=$(($index % 2)) if test $n -eq 0; then printf "[%02d][SLAVE=>MASTER] \033[1;33m%21s\033[m => \033[1;33m%s\033[m%s\n" $index $slave_node $master_node "$tag" else printf "[%02d][SLAVE=>MASTER] %21s => %s%s\n" $index $slave_node $master_node "$tag" fi index=$((++index)) done echo ""
在一台物理机上启动6个Redis实例,组成3主3从集群,端口号依次为:1379 ~ 1384,端口号1379、1380和1384三个为master,端口1379的进程ID为17620。现将进程17620暂停(发送SIGSTOP信号),观察集群发现故障时长,和主从切换时长。 暂停进程17620(端口1379),然后每秒查看一次集群状态 $ kill -19 17620;for ((i=0;i<10000000;++i)) do date +'[%H:%M:%S]';redis-cli -c -p 1380 cluster nodes;echo "";sleep 1; done [14:23:51]f03b1008988acbb0f69d96252decda9adf747be9 192.168.31.98:1384 master - 0 1525847030599 137 connected 1987 10923-16383c1a9d1d23438241803ec97fbd765737df80f402a 192.168.31.98:1381 slave f03b1008988acbb0f69d96252decda9adf747be9 0 1525847031200 137 connected4e932f2a3d80de29798660c5ea62e473e63a6630 192.168.31.98:1383 slave f6080015129eada3261925cc1b466f1824263358 0 1525847031100 134 connected689f7c1ae71ea294c4ad7c5d1b32ae4e78e27915 192.168.31.98:1382 slave fa7bbbf7d48389409ce05d303272078c3a6fd44f 0 1525847030097 132 connectedfa7bbbf7d48389409ce05d303272078c3a6fd44f 192.168.31.98:1379 master - 0 1525847030799 132 connected 0-1986 1988-5457f6080015129eada3261925cc1b466f1824263358 192.168.31.98:1380 myself,master - 0 0 134 connected 5458-10922 [14:23:52] 第1秒故障还未被发现f03b1008988acbb0f69d96252decda9adf747be9 192.168.31.98:1384 master - 0 1525847031602 137 connected 1987 10923-16383c1a9d1d23438241803ec97fbd765737df80f402a 192.168.31.98:1381 slave f03b1008988acbb0f69d96252decda9adf747be9 0 1525847031200 137 connected4e932f2a3d80de29798660c5ea62e473e63a6630 192.168.31.98:1383 slave f6080015129eada3261925cc1b466f1824263358 0 1525847031100 134 connected689f7c1ae71ea294c4ad7c5d1b32ae4e78e27915 192.168.31.98:1382 slave fa7bbbf7d48389409ce05d303272078c3a6fd44f 0 1525847031602 132 connectedfa7bbbf7d48389409ce05d303272078c3a6fd44f 192.168.31.98:1379 master - 1525847032302 1525847030799 132 connected 0-1986 1988-5457f6080015129eada3261925cc1b466f1824263358 192.168.31.98:1380 myself,master - 0 0 134 connected 5458-10922 [14:23:53] 第2秒故障还未被发现f03b1008988acbb0f69d96252decda9adf747be9 192.168.31.98:1384 master - 0 1525847033103 137 connected 1987 10923-16383c1a9d1d23438241803ec97fbd765737df80f402a 192.168.31.98:1381 slave f03b1008988acbb0f69d96252decda9adf747be9 0 1525847032703 137 connected4e932f2a3d80de29798660c5ea62e473e63a6630 192.168.31.98:1383 slave f6080015129eada3261925cc1b466f1824263358 0 1525847032602 134 connected689f7c1ae71ea294c4ad7c5d1b32ae4e78e27915 192.168.31.98:1382 slave fa7bbbf7d48389409ce05d303272078c3a6fd44f 0 1525847033103 132 connectedfa7bbbf7d48389409ce05d303272078c3a6fd44f 192.168.31.98:1379 master - 1525847032302 1525847030799 132 connected 0-1986 1988-5457f6080015129eada3261925cc1b466f1824263358 192.168.31.98:1380 myself,master - 0 0 134 connected 5458-10922 [14:23:54] 第3秒故障还未被发现f03b1008988acbb0f69d96252decda9adf747be9 192.168.31.98:1384 master - 0 1525847033604 137 connected 1987 10923-16383c1a9d1d23438241803ec97fbd765737df80f402a 192.168.31.98:1381 slave f03b1008988acbb0f69d96252decda9adf747be9 0 1525847034205 137 connected4e932f2a3d80de29798660c5ea62e473e63a6630 192.168.31.98:1383 slave f6080015129eada3261925cc1b466f1824263358 0 1525847034106 134 connected689f7c1ae71ea294c4ad7c5d1b32ae4e78e27915 192.168.31.98:1382 slave fa7bbbf7d48389409ce05d303272078c3a6fd44f 0 1525847033103 132 connectedfa7bbbf7d48389409ce05d303272078c3a6fd44f 192.168.31.98:1379 master - 1525847032302 1525847030799 132 connected 0-1986 1988-5457f6080015129eada3261925cc1b466f1824263358 192.168.31.98:1380 myself,master - 0 0 134 connected 5458-10922 [14:23:55] 第4秒发现故障,但未选举出新的masterf03b1008988acbb0f69d96252decda9adf747be9 192.168.31.98:1384 master - 0 1525847034606 137 connected 1987 10923-16383c1a9d1d23438241803ec97fbd765737df80f402a 192.168.31.98:1381 slave f03b1008988acbb0f69d96252decda9adf747be9 0 1525847034205 137 connected4e932f2a3d80de29798660c5ea62e473e63a6630 192.168.31.98:1383 slave f6080015129eada3261925cc1b466f1824263358 0 1525847034106 134 connected689f7c1ae71ea294c4ad7c5d1b32ae4e78e27915 192.168.31.98:1382 slave fa7bbbf7d48389409ce05d303272078c3a6fd44f 0 1525847034606 132 connectedfa7bbbf7d48389409ce05d303272078c3a6fd44f 192.168.31.98:1379 master,fail? - 1525847032302 1525847030799 132 connected 0-1986 1988-5457f6080015129eada3261925cc1b466f1824263358 192.168.31.98:1380 myself,master - 0 0 134 connected 5458-10922 [14:23:56] 第5秒,仍未选举出新的masterf03b1008988acbb0f69d96252decda9adf747be9 192.168.31.98:1384 master - 0 1525847036207 137 connected 1987 10923-16383c1a9d1d23438241803ec97fbd765737df80f402a 192.168.31.98:1381 slave f03b1008988acbb0f69d96252decda9adf747be9 0 1525847035706 137 connected4e932f2a3d80de29798660c5ea62e473e63a6630 192.168.31.98:1383 slave f6080015129eada3261925cc1b466f1824263358 0 1525847035606 134 connected689f7c1ae71ea294c4ad7c5d1b32ae4e78e27915 192.168.31.98:1382 slave fa7bbbf7d48389409ce05d303272078c3a6fd44f 0 1525847036206 132 connectedfa7bbbf7d48389409ce05d303272078c3a6fd44f 192.168.31.98:1379 master,fail - 1525847032302 1525847030799 132 connected 0-1986 1988-5457f6080015129eada3261925cc1b466f1824263358 192.168.31.98:1380 myself,master - 0 0 134 connected 5458-10922 [14:23:57] 第6秒,选择出新的masterf03b1008988acbb0f69d96252decda9adf747be9 192.168.31.98:1384 master - 0 1525847036207 137 connected 1987 10923-16383c1a9d1d23438241803ec97fbd765737df80f402a 192.168.31.98:1381 slave f03b1008988acbb0f69d96252decda9adf747be9 0 1525847037212 137 connected4e932f2a3d80de29798660c5ea62e473e63a6630 192.168.31.98:1383 slave f6080015129eada3261925cc1b466f1824263358 0 1525847036606 134 connected689f7c1ae71ea294c4ad7c5d1b32ae4e78e27915 192.168.31.98:1382 master - 0 1525847036206 138 connected 0-1986 1988-5457fa7bbbf7d48389409ce05d303272078c3a6fd44f 192.168.31.98:1379 master,fail - 1525847032302 1525847030799 132 connectedf6080015129eada3261925cc1b466f1824263358 192.168.31.98:1380 myself,master - 0 0 134 connected 5458-10922 与时间有关的配置项:repl-ping-slave-period 1repl-timeout 10cluster-node-timeout 3000
目录 目录 1 1. 前言 2 2. 缩略语 2 3. 配置和主题 3 3.1. 配置和主题结构 3 3.1.1. Conf 3 3.1.2. ConfImpl 3 3.1.3. Topic 3 3.1.4. TopicImpl 3 4. 线程 4 5. 消费者 5 5.1. 消费者结构 5 5.1.1. Handle 5 5.1.2. HandleImpl 5 5.1.3. ConsumeCb 6 5.1.4. EventCb 6 5.1.5. Consumer 7 5.1.6. KafkaConsumer 7 5.1.7. KafkaConsumerImpl 7 5.1.8. rd_kafka_message_t 7 5.1.9. rd_kafka_msg_s 7 5.1.10. rd_kafka_msgq_t 8 5.1.11. rd_kafka_toppar_t 8 6. 生产者 10 6.1. 生产者结构 10 6.1.1. DeliveryReportCb 11 6.1.2. PartitionerCb 11 6.1.3. Producer 11 6.1.4. ProduceImpl 11 6.2. 生产者启动过程1 11 6.3. 生产者启动过程2 12 6.4. 生产者生产过程 14 7. poll过程 15 1. 前言 librdkafka提供的异步的生产接口,异步的消费接口和同步的消息接口,没有同步的生产接口。 2. 缩略语 缩略语 缩略语全称 示例或说明 rd Rapid Development rd.h rk RdKafka toppar Topic Partition struct rd_kafka_toppar_t { }; rep Reply, struct rd_kafka_t { rd_kafka_q_t *rk_rep }; msgq Message Queue struct rd_kafka_msgq_t { }; rkb RdKafka Broker Kafka代理 rko RdKafka Operation Kafka操作 rkm RdKafka Message Kafka消息 payload 存在Kafka上的消息(或叫Log) 3. 配置和主题 3.1. 配置和主题结构 3.1.1. Conf 配置接口,配置分两种:全局的和主题的。 3.1.2. ConfImpl 配置的实现。 3.1.3. Topic 主题接口。 3.1.4. TopicImpl 主题的实现。 4. 线程 RdKafka编程涉及到三类线程: 1) 应用线程,业务代码的实现 2) Kafka Broker线程rd_kafka_broker_thread_main,负责与Broker通讯,多个 3) Kafka Handler线程rd_kafka_thread_main,每创建一个consumer或producer即会创建一个Handler线程。 5. 消费者 5.1. 消费者结构 5.1.1. Handle 定义了poll等接口,它的实现者为HandleImpl。 5.1.2. HandleImpl 实现了消费者和生产者均使用的poll等,其中poll的作用为: 1) 为生产者回调消息发送结果; 2) 为生产者和消费者回调事件。 class Handle { /** * @brief Polls the provided kafka handle for events. * * Events will trigger application provided callbacks to be called. * * The \p timeout_ms argument specifies the maximum amount of time * (in milliseconds) that the call will block waiting for events. * For non-blocking calls, provide 0 as \p timeout_ms. * To wait indefinately for events, provide -1. * * Events: * - delivery report callbacks (if an RdKafka::DeliveryCb is configured) [producer] * - event callbacks (if an RdKafka::EventCb is configured) [producer & consumer] * * @remark An application should make sure to call poll() at regular * intervals to serve any queued callbacks waiting to be called. * * @warning This method MUST NOT be used with the RdKafka::KafkaConsumer, * use its RdKafka::KafkaConsumer::consume() instead. * * @returns the number of events served. */ virtual int poll(int timeout_ms) = 0; }; 5.1.3. ConsumeCb 只针对消费者的Callback。 5.1.4. RebalanceCb 只针对消费者的Callback。 5.1.5. EventCb 消费者和生产者均可设置EventCb,如:_global_conf->set("event_cb", &_event_cb, errmsg);。 /** * @brief Event callback class * * Events are a generic interface for propagating errors, statistics, logs, etc * from librdkafka to the application. * * @sa RdKafka::Event */ class RD_EXPORT EventCb { public: /** * @brief Event callback * * @sa RdKafka::Event */ virtual void event_cb (Event &event) = 0; virtual ~EventCb() { } }; /** * @brief Event object class as passed to the EventCb callback. */ class RD_EXPORT Event { public: /** @brief Event type */ enum Type { EVENT_ERROR, /**< Event is an error condition */ EVENT_STATS, /**< Event is a statistics JSON document */ EVENT_LOG, /**< Event is a log message */ EVENT_THROTTLE /**< Event is a throttle level signaling from the broker */ }; }; 5.1.6. Consumer 简单消息者,一般不使用,而是使用KafkaConsumer。 5.1.7. KafkaConsumer 消费者和生产者均采用多重继承方式,其中KafkaConsumer为消费者接口,KafkaConsumerImpl为消费者实现。 5.1.8. KafkaConsumerImpl KafkaConsumerImpl为消费者实现。 5.1.9. rd_kafka_message_t 消息结构。 5.1.10. rd_kafka_msg_s 消息结构,但消息数据实际存储在rd_kafka_message_t,结构大致如下: struct rd_kafka_msg_s { rd_kafka_message_t rkm_rkmessage; struct { rd_kafka_msg_s* tqe_next; rd_kafka_msg_s** tqe_prev; int64_t rkm_timestamp; rd_kafka_timestamp_type_t rkm_tstype; }rkm_link; }; 5.1.11. rd_kafka_msgq_t 存储消息的消息队列,生产者生产的消息并不直接socket发送到brokers,而是放入了这个队列,结构大致如下: struct rd_kafka_msgq_t { struct { rd_kafka_msg_s* tqh_first; // 队首 rd_kafka_msg_s* tqh_last; // 队尾 }; // 消息个数 rd_atomic32_t rkmq_msg_cnt; // 所有消息加起来的字节数 rd_atomic64_t rkmq_msg_bytes; }; 5.1.12. rd_kafka_toppar_t Topic-Partition队列,很复杂的一个结构,部分内容如下: // Topic + Partition combination typedef struct rd_kafka_toppar_s { struct { rd_kafka_toppar_s* tqe_next; rd_kafka_toppar_s** tqe_prev; }rktp_rklink; struct { rd_kafka_toppar_s* tqe_next; rd_kafka_toppar_s** tqe_prev; }rktp_rkblink; struct { rd_kafka_toppar_s* cqe_next; rd_kafka_toppar_s* cqe_prev; }rktp_fetchlink; struct { rd_kafka_toppar_s* tqe_next; rd_kafka_toppar_s** tqe_prev; }rktp_rktlink; struct { rd_kafka_toppar_s* tqe_next; rd_kafka_toppar_s** tqe_prev; }rktp_cgrplink; rd_kafka_itopic_t* rktp_rkt; int32_t rktp_partition; int32_t rktp_leader_id; rd_kafka_broker_t* rktp_leader; rd_kafka_broker_t* rktp_next_leader; rd_refcnt_t rktp_refcnt; rd_kafka_msgq_t rktp_msgq; // application->rdkafka queue }rd_kafka_toppar_t; 6. 生产者 6.1. 生产者结构 6.1.1. DeliveryReportCb 消息已经成功递送到Broker时回调,只针对生产者有效。 6.1.2. PartitionerCb 计算分区号回调函数,只针对生产者有效。 6.1.3. Producer Producer为生产者接口,它的实现者为ProducerImpl。 6.1.4. ProduceImpl ProducerImpl为生产者的实现。 6.2. 生产者启动过程1 启动时会创建两组线程:一组Broker线程(rd_kafka_broker_thread_main,多个),实为与Broker间的网络IO线程;一组Handler线程(rd_kafka_thread_main,单个),每调用一次RdKafka::Producer::create或rd_kafka_new即创建一Handler线程。 Handler线程调用栈: (gdb) t 17 [Switching to thread 17 (Thread 0x7ff7059d3700 (LWP 16765))] #0 0x00007ff7091e6cf2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 (gdb) bt #0 0x00007ff7091e6cf2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000005b4d2f in cnd_timedwait_ms (cnd=0x1517748, mtx=0x1517720, timeout_ms=898) at tinycthread.c:501 #2 0x0000000000580e16 in rd_kafka_q_serve (rkq=0x1517720, timeout_ms=898, max_cnt=0, cb_type=RD_KAFKA_Q_CB_CALLBACK, callback=0x0, opaque=0x0) at rdkafka_queue.c:440 #3 0x000000000054ee9b in rd_kafka_thread_main (arg=0x1516df0) at rdkafka.c:1227 #4 0x00000000005b4e0f in _thrd_wrapper_function (aArg=0x15179d0) at tinycthread.c:624 #5 0x00007ff7091e2e25 in start_thread () from /lib64/libpthread.so.0 #6 0x00007ff7082d135d in clone () from /lib64/libc.so.6 6.3. 生产者启动过程2 创建网络IO线程,消费者启动过程类似,只是一个调用rd_kafka_broker_producer_serve(rkb),另一个调用rd_kafka_broker_consumer_serve(rkb)。 IO线程负责消息的收和发,发送底层调用的是sendmsg,收调用的是recvmsg(但MSVC平台调用send和recv)。 6.4. 生产者生产过程 生产者生产的消息并不直接socket发送到brokers,而是放入队列rd_kafka_msgq_t中。Broker线程(rd_kafka_broker_thread_main)消费这个队列。 Broker线程同时监控与Broker间的网络连接,又要监控队列中是否有数据,如何实现的?这个队列和管道绑定在一起的,绑定的是管道写端(rktp->rktp_msgq_wakeup_fd = rkb->rkb_toppar_wakeup_fd; rkb->rkb_toppar_wakeup_fd=rkb->rkb_wakeup_fd[1])。 这样Broker线程即可同时监听网络数据和管道数据。 // int rd_kafka_msg_partitioner(rd_kafka_itopic_t *rkt, rd_kafka_msg_t *rkm,int do_lock) (gdb) p *rkm $7 = {rkm_rkmessage = {err = RD_KAFKA_RESP_ERR_NO_ERROR, rkt = 0x1590c10, partition = 1, payload = 0x7f48c4001260, len = 203, key = 0x7f48c400132b, key_len = 14, offset = 0, _private = 0x0}, rkm_link = {tqe_next = 0x5b5d47554245445b, tqe_prev = 0x6361667265746e69}, rkm_flags = 196610, rkm_timestamp = 1524829399009, rkm_tstype = RD_KAFKA_TIMESTAMP_CREATE_TIME, rkm_u = {producer = {ts_timeout = 16074575505526, ts_enq = 16074275505526}}} (gdb) p rkm->rkm_rkmessage $8 = {err = RD_KAFKA_RESP_ERR_NO_ERROR, rkt = 0x1590c10, partition = 1, payload = 0x7f48c4001260, len = 203, key = 0x7f48c400132b, key_len = 14, offset = 0, _private = 0x0} (gdb) p rkm->rkm_rkmessage->payload $9 = (void *) 0x7f48c4001260 (gdb) p (char*)rkm->rkm_rkmessage->payload $10 = 0x7f48c4001260 "{\"p\":\"f\",\"o\":1,\"d\":\"m\",\"d\":\"m\",\"i\":\"f2\",\"ip\":\"127.0.0.1\",\"pt\":2018,\"sc\":0,\"fc\":1,\"tc\":0,\"acc\":395,\"mcc\":395,\"cd\":\"test\",\"cmd\":\"tester\",\"cf\":\"main\",\"cp\":\"1.49.16.9"... 7. poll过程 poll的作用是触发回调,生产者即使不调用poll,消息也会发送出去,但是如果不通过poll触发回调,则不能确定消息发送状态(成功或失败等)。 消费队列rd_kafka_t->rk_rep,rk_rep为响应队列,类型为rd_kafka_q_t或rd_kafka_q_s:
// 下列代码输出什么? #include <iostream> #include <string> // typedef basic_ostream<char> ostream; class A { private: int m1,m2; public: A(int a, int b) { m1=a;m2=b; } operator std::string() const { return "str"; } operator int() const { return 2018; } }; int main() { A a(1,2); std::cout << a; return 0; }; 答案是2018, 因为类basic_ostream有成员函数operator<<(int), 而没有成员函数operator<<(const std::string&), 优先调用同名的成员函数,故输出2018,相关源代码如下: // 名字空间std中的全局函数 /usr/include/c++/4.8.2/bits/basic_string.h: template<typename _CharT, typename _Traits, typename _Alloc> inline basic_ostream<_CharT, _Traits>& operator <<(basic_ostream<_CharT, _Traits>& __os, const basic_string<_CharT, _Traits, _Alloc>& __str) { return __ostream_insert(__os, __str.data(), __str.size()); } // 类basic_ostream的成员函数 // std::cout为名字空间std中的类basic_ostream的一个实例 ostream: __ostream_type& basic_ostream::operator<<(int __n); // 下列代码有什么问题,如何修改? #include <iostream> #include <string> class A { public: int m1,m2; public: A(int a, int b) { m1=a;m2=b; } std::ostream& operator <<(std::ostream& os) { os << m1 << m2; return os; } }; int main() { A a(1,2); std::cout << a; return 0; }; 类basic_ostream没有成员函数“operator <<(const A&)”, 也不存在全局的: operator <<(const basic_ostream&, const A&) 而只有左操作数是自己时才会调用成员重载操作符, 都不符合,所以语法错误。 有两种修改方式: 1) 将“std::cout << a”改成“a.operator <<(std::cout)”, 2) 或定义全局的: std::ostream& operator<<(std::ostream& os, const A& a) { os << a.m1 << a.m2; return os; }
当Linux服务器的TIME_WAIT过多时, 通常会想到去修改参数降低TIME_WAIT时长, 以减少TIME_WAIT数量,但Linux并没有提供这样的接口, 除非重新编译内核。 Linux默认的TIME_WAIT时长一般是60秒, 定义在内核的include/net/tcp.h文件中: #define TCP_TIMEWAIT_LEN (60*HZ) /* how long to wait to destroy TIME-WAIT state, * about 60 seconds */ #define TCP_FIN_TIMEOUT TCP_TIMEWAIT_LEN /* BSD style FIN_WAIT2 deadlock breaker. * It used to be 3min, new value is 60sec, * to combine FIN-WAIT-2 timeout with * TIME-WAIT timer. */ 注意tcp_fin_timeout不是TIME_WAIT时间: # cat /proc/sys/net/ipv4/tcp_fin_timeout 60 tcp_fin_timeout实为FIN_WAIT_2状态的时长, Linux没有提供修改TIME_WAIT时长接口,除非修改宏的定义重新编译内核。 但Windows可以修改注册表中的TcpTimedWaitDelay值来控制TIME_WAIT时长。 RTO:超时重传(Retransmission Timeout) TIME_WAIT是一个常见经常的问题,相关内容(/etc/sysctl.conf或/proc/sys/net/ipv4): 1) net.ipv4.tcp_timestamps 为1表示开启TCP时间戳,用来计算往返时间RTT(Round-Trip Time)和防止序列号回绕 2) net.ipv4.tcp_tw_reuse 为1表示允许将TIME-WAIT的句柄重新用于新的TCP连接 3) net.ipv4.tcp_tw_recycle 为1表示开启TCP连接中TIME-WAIT的快速回收,NAT环境可能导致DROP掉SYN包(回复RST) 4) net.ipv4.tcp_fin_timeout FIN_WAIT_2状态的超时时长 5) net.ipv4.tcp_syncookies 为1时SYN Cookies,当SYN等待队列溢出时启用cookies来处理,可防范少量SYN攻击 6) net.ipv4.tcp_max_tw_buckets 保持TIME_WAIT套接字的最大个数,超过这个数字TIME_WAIT套接字将立刻被清除并打印警告信息 7) net.ipv4.ip_local_port_range 8) net.ipv4.tcp_max_syn_backlog 端口最大backlog内核限制,防止占用过大内核内存 9) net.ipv4.tcp_syn_retries 对一个新建连接,内核要发送多少个SYN连接请求才决定放弃,不应该大于255 10) net.ipv4.tcp_retries1 放弃回应一个TCP连接请求前﹐需要进行多少次重试,RFC规定最低的数值是3,这也是默认值 11) net.ipv4.tcp_retries2 在丢弃激活(已建立通讯状况)的TCP连接之前﹐需要进行多少次重试,默认值为15 12) net.ipv4.tcp_synack_retries TCP三次握手的SYN/ACK阶段重试次数,缺省5 13) net.ipv4.tcp_max_orphans 不属于任何进程(已经从进程上下文中删除)的sockets最大个数,超过这个值会被立即RESET,并同时显示警告信息 14) net.ipv4.tcp_orphan_retries 孤儿sockets废弃前重试的次数,缺省值是7 15) net.ipv4.tcp_mem 内核分配给TCP连接的内存,单位是page: 第一个数字表示TCP使用的page少于此值时,内核不进行任何处理(干预), 第二个数字表示TCP使用的page超过此值时,内核进入“memory pressure”压力模式, 第三个数字表示TCP使用的page超过些值时,报“Out of socket memory”错误,TCP 连接将被拒绝 16) net.ipv4.tcp_rmem 为每个TCP连接分配的读缓冲区内存大小,单位是byte 17) net.ipv4.tcp_wmem 为每个TCP连接分配的写缓冲区内存大小,单位是byte: 第一个数字表示,为TCP连接分配的最小内存, 第二个数字表示,为TCP连接分配的缺省内存, 第三个数字表示,为TCP连接分配的最大内存(net.core.wmem_max可覆盖该值) 18) net.ipv4.tcp_keepalive_time 当keepalive起用的时候,TCP发送keepalive消息的频度,单位为秒,缺省是7200秒(即2小时) 19) net.ipv4.tcp_keepalive_intvl keepalive探测包的发送间隔 20) net.ipv4.tcp_keepalive_probes 如果对方不予应答,探测包的发送次数 代码中可通过SO_LINGER来控制。
结论: 待确认是否为redis的BUG,原因是进程实际占用的内存远小于配置的最大内存,所以不会是内存不够需要淘汰。 CPU百分百redis-server进程集群状态: slave 临时解决办法: 使用gdb将d.ht[0].used的值改为0 问题原因: dictGetRandomKey()过程中, 无法走到分支“if (dictSize(d) == 0) return NULL;”, 导致函数dbRandomKey()进入死循环。 版本: Redis server v=3.2.0 sha=00000000:0 malloc=jemalloc-4.0.3 bits=64 build=9894db3ef433c070 现象1:CPU百分百 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 25636 redis 20 0 38492 4096 1360 R 100.0 0.0 2578:10 redis-server 现象2:大量CLOSE_WAIT状态连接: tcp 2417 0 1.49.26.98:11382 1.49.26.98:37268 CLOSE_WAIT - tcp 2521 0 1.49.26.98:11382 1.49.26.98:35141 CLOSE_WAIT - tcp 2521 0 1.49.26.98:11382 1.49.26.98:57181 CLOSE_WAIT - 进程状态: redis 25636 30.0 0.0 38492 4096 ? Rsl 3月23 2579:55 /data/redis/bin/redis-server *:1382 [cluster] 最大内存配置(1G): maxmemory 1073741824 运行日志: 25636:S 28 Mar 00:21:24.526 - 1 clients connected (0 slaves), 1312384 bytes in use 25636:S 28 Mar 00:21:29.531 - DB 0: 1 keys (1 volatile) in 8 slots HT. 25636:S 28 Mar 00:21:29.531 - 1 clients connected (0 slaves), 1312384 bytes in use 25636:S 28 Mar 00:21:32.585 - Accepted 1.118.14.7:58132 调用栈: #0 dictGenHashFunction (key=<optimized out>, len=5) at dict.c:123 #1 0x00000000004232e6 in dictFind (d=0x7f71c2a17240, key=key@entry=0x7f71c2a15001) at dict.c:499 #2 0x000000000043a00a in dbRandomKey (db=0x7f71c2a24800) at db.c:176 #3 0x000000000043a0a2 in randomkeyCommand (c=0x7f71c2aae1c0) at db.c:355 #4 0x0000000000426b95 in call (c=c@entry=0x7f71c2aae1c0, flags=flags@entry=15) at server.c:2221 #5 0x0000000000429ba7 in processCommand (c=0x7f71c2aae1c0) at server.c:2500 #6 0x0000000000436515 in processInputBuffer (c=0x7f71c2aae1c0) at networking.c:1296 #7 0x0000000000421338 in aeProcessEvents (eventLoop=eventLoop@entry=0x7f71c2a2e050, flags=flags@entry=3) at ae.c:412 #8 0x00000000004215eb in aeMain (eventLoop=0x7f71c2a2e050) at ae.c:455 #9 0x000000000041e5df in main (argc=2, argv=0x7ffef34b2418) at server.c:4079 #0 0x00007f71c2fbc3a2 in random () from /lib64/libc.so.6 #1 0x0000000000423745 in dictGetRandomKey (d=0x7f71c2a171e0) at dict.c:646 #2 0x0000000000439fc0 in dbRandomKey (db=0x7f71c2a24800) at db.c:171 #3 0x000000000043a0a2 in randomkeyCommand (c=0x7f71c2aae1c0) at db.c:355 #4 0x0000000000426b95 in call (c=c@entry=0x7f71c2aae1c0, flags=flags@entry=15) at server.c:2221 #5 0x0000000000429ba7 in processCommand (c=0x7f71c2aae1c0) at server.c:2500 #6 0x0000000000436515 in processInputBuffer (c=0x7f71c2aae1c0) at networking.c:1296 #7 0x0000000000421338 in aeProcessEvents (eventLoop=eventLoop@entry=0x7f71c2a2e050, flags=flags@entry=3) at ae.c:412 #8 0x00000000004215eb in aeMain (eventLoop=0x7f71c2a2e050) at ae.c:455 #9 0x000000000041e5df in main (argc=2, argv=0x7ffef34b2418) at server.c:4079 #0 0x00007f71c30e17e4 in __memcmp_sse4_1 () from /lib64/libc.so.6 #1 0x0000000000424219 in dictSdsKeyCompare (privdata=<optimized out>, key1=<optimized out>, key2=<optimized out>) at server.c:445 #2 0x000000000042331d in dictFind (d=0x7f71c2a17240, key=0x7f71c2a27e73) at dict.c:504 #3 0x0000000000439494 in getExpire (db=0x7f71c2a24800, key=0x7f71c2a27e60) at db.c:824 #4 0x0000000000439c4f in expireIfNeeded (db=0x7f71c2a24800, key=0x7f71c2a27e60) at db.c:858 #5 0x000000000043a01a in dbRandomKey (db=0x7f71c2a24800) at db.c:177 #6 0x000000000043a0a2 in randomkeyCommand (c=0x7f71c2aae1c0) at db.c:355 #7 0x0000000000426b95 in call (c=c@entry=0x7f71c2aae1c0, flags=flags@entry=15) at server.c:2221 #8 0x0000000000429ba7 in processCommand (c=0x7f71c2aae1c0) at server.c:2500 #9 0x0000000000436515 in processInputBuffer (c=0x7f71c2aae1c0) at networking.c:1296 #10 0x0000000000421338 in aeProcessEvents (eventLoop=eventLoop@entry=0x7f71c2a2e050, flags=flags@entry=3) at ae.c:412 #11 0x00000000004215eb in aeMain (eventLoop=0x7f71c2a2e050) at ae.c:455 #12 0x000000000041e5df in main (argc=2, argv=0x7ffef34b2418) at server.c:4079 #0 dictGetRandomKey (d=<optimized out>) at dict.c:663 #1 0x0000000000439fc0 in dbRandomKey (db=0x7f71c2a24800) at db.c:171 #2 0x000000000043a0a2 in randomkeyCommand (c=0x7f71c2aae1c0) at db.c:355 #3 0x0000000000426b95 in call (c=c@entry=0x7f71c2aae1c0, flags=flags@entry=15) at server.c:2221 #4 0x0000000000429ba7 in processCommand (c=0x7f71c2aae1c0) at server.c:2500 #5 0x0000000000436515 in processInputBuffer (c=0x7f71c2aae1c0) at networking.c:1296 #6 0x0000000000421338 in aeProcessEvents (eventLoop=eventLoop@entry=0x7f71c2a2e050, flags=flags@entry=3) at ae.c:412 #7 0x00000000004215eb in aeMain (eventLoop=0x7f71c2a2e050) at ae.c:455 #8 0x000000000041e5df in main (argc=2, argv=0x7ffef34b2418) at server.c:4079 猜测: 达到最大内存,进入淘汰keys逻辑,但没有keys符合淘汰,从而死循环。 相关代码: /* Return a random key from the currently selected database. */ void randomkeyCommand(client *c) { robj *key; if ((key = dbRandomKey(c->db)) == NULL) { addReply(c,shared.nullbulk); return; } addReplyBulk(c,key); decrRefCount(key); } /* Return a random key, in form of a Redis object. * If there are no keys, NULL is returned. * * The function makes sure to return keys not already expired. */ robj *dbRandomKey(redisDb *db) { dictEntry *de; while(1) { // CPU百分百的原因,是这里死循环了 sds key; robj *keyobj; de = dictGetRandomKey(db->dict); if (de == NULL) return NULL; key = dictGetKey(de); keyobj = createStringObject(key,sdslen(key)); if (dictFind(db->expires,key)) { if (expireIfNeeded(db,keyobj)) { decrRefCount(keyobj); continue; /* search for another key. This expired. */ } } return keyobj; } } void call(client *c, int flags) { long long dirty, start, duration; int client_old_flags = c->flags; /* Sent the command to clients in MONITOR mode, only if the commands are * not generated from reading an AOF. */ if (listLength(server.monitors) && !server.loading && !(c->cmd->flags & (CMD_SKIP_MONITOR|CMD_ADMIN))) { replicationFeedMonitors(c,server.monitors,c->db->id,c->argv,c->argc); } /* Initialization: clear the flags that must be set by the command on * demand, and initialize the array for additional commands propagation. */ c->flags &= ~(CLIENT_FORCE_AOF|CLIENT_FORCE_REPL|CLIENT_PREVENT_PROP); redisOpArrayInit(&server.also_propagate); /* Call the command. */ dirty = server.dirty; start = ustime(); c->cmd->proc(c); duration = ustime()-start; dirty = server.dirty-dirty; if (dirty < 0) dirty = 0; 。。。。。。 } /* With multiplexing we need to take per-client state. * Clients are taken in a linked list. */ typedef struct client { 。。。。。。 struct redisCommand *cmd, *lastcmd; /* Last command executed. */ 。。。。。。 }; typedef void redisCommandProc(client *c); typedef int *redisGetKeysProc(struct redisCommand *cmd, robj **argv, int argc, int *numkeys); struct redisCommand { char *name; redisCommandProc *proc; int arity; char *sflags; /* Flags as string representation, one char per flag. */ int flags; /* The actual flags, obtained from the 'sflags' field. */ /* Use a function to determine keys arguments in a command line. * Used for Redis Cluster redirect. */ redisGetKeysProc *getkeys_proc; /* What keys should be loaded in background when calling this command? */ int firstkey; /* The first argument that's a key (0 = no keys) */ int lastkey; /* The last argument that's a key */ int keystep; /* The step between first and last key */ long long microseconds, calls; }; /* This is our hash table structure. Every dictionary has two of this as we * implement incremental rehashing, for the old to the new table. */ typedef struct dictht { dictEntry **table; unsigned long size; unsigned long sizemask; unsigned long used; } dictht; typedef struct dict { dictType *type; void *privdata; dictht ht[2]; long rehashidx; /* rehashing not in progress if rehashidx == -1 */ int iterators; /* number of iterators currently running */ } dict; /* Return a random entry from the hash table. Useful to * implement randomized algorithms */ dictEntry *dictGetRandomKey(dict *d) { dictEntry *he, *orighe; unsigned int h; int listlen, listele; // (gdb) p *d // $1 = {type = 0x71d940 <dbDictType>, privdata = 0x0, ht = {{table = 0x7f71c2a1e480, size = 8, sizemask = 7, used = 1}, {table = 0x0, size = 0, sizemask = 0, used = 0}}, rehashidx = -1, iterators = 0} // // (gdb) p d.ht[0] // $3 = {table = 0x7f71c2a1e480, size = 8, sizemask = 7, used = 1} // (gdb) p d.ht[1] // $4 = {table = 0x0, size = 0, sizemask = 0, used = 0} // // (gdb) set variable d.ht[0].used=0 // (gdb) p d.ht[0].used // $7 = 0 // #define dictSize(d) ((d)->ht[0].used+(d)->ht[1].used) if (dictSize(d) == 0) return NULL; if (dictIsRehashing(d)) _dictRehashStep(d); if (dictIsRehashing(d)) { do { /* We are sure there are no elements in indexes from 0 * to rehashidx-1 */ h = d->rehashidx + (random() % (d->ht[0].size + d->ht[1].size - d->rehashidx)); he = (h >= d->ht[0].size) ? d->ht[1].table[h - d->ht[0].size] : d->ht[0].table[h]; } while(he == NULL); } else { do { h = random() & d->ht[0].sizemask; he = d->ht[0].table[h]; } while(he == NULL); } /* Now we found a non empty bucket, but it is a linked * list and we need to get a random element from the list. * The only sane way to do so is counting the elements and * select a random index. */ listlen = 0; orighe = he; while(he) { he = he->next; listlen++; } listele = random() % listlen; he = orighe; while(listele--) he = he->next; return he; } /* This function performs just a step of rehashing, and only if there are * no safe iterators bound to our hash table. When we have iterators in the * middle of a rehashing we can't mess with the two hash tables otherwise * some element can be missed or duplicated. * * This function is called by common lookup or update operations in the * dictionary so that the hash table automatically migrates from H1 to H2 * while it is actively used. */ static void _dictRehashStep(dict *d) { if (d->iterators == 0) dictRehash(d,1); } 进程内存(问题解决,退出死循环后才能看到,但结果和ps看到一致): # Memory used_memory:1375320 used_memory_human:1.31M used_memory_rss:4321280 used_memory_rss_human:4.12M used_memory_peak:2468448 used_memory_peak_human:2.35M total_system_memory:33453797376 total_system_memory_human:31.16G used_memory_lua:34816 used_memory_lua_human:34.00K maxmemory:1073741824 maxmemory_human:1.00G maxmemory_policy:allkeys-lru mem_fragmentation_ratio:3.14 mem_allocator:jemalloc-4.0.3
问题复现步骤: 1) 输入字符串: { "V":0.12345678 } 2) 字符串转成cJSON对象 3) 调用cJSON_Print将cJSON对象再转成字符串 4) 再将字符串转成cJSON对象 5) 保留8位精度方式调用printf打印值,输出变成:0.123456 问题的原因出在cJSON的print_number函数: static char *print_number(cJSON *item) { char *str; double d = item->valuedouble; if (fabs(((double) item->valueint) - d) && d >= INT_MIN) { str = (char*) cJSON_malloc(21); /* 2^64+1 can be represented in 21 chars. */ if (str) sprintf(str, "%d", item->valueint); } else { str = (char*) cJSON_malloc(64); /* This is a nice tradeoff. */ if (str) { if (fabs(floor(d) - d) sprintf(str, "%.0f", d); else if (fabs(d) 1.0e9) sprintf(str, "%e", d); else sprintf(str, "%f", d); } } return str; } 最后一个sprintf调用没有指定保留的精度,默认为6位,这就是问题的原因。 注:float的精度为6~7位有效数字,double的精度为15~16位。
目录 目录 1 1. 研究目的 1 2. 基本概念 1 3. crontab 1 3.1. 编辑 2 3.1.1. “crontab -e”工作流 2 3.2. 问题 3 4. crond 3 4.1. /etc/crontab 3 1. 研究目的 更好使用crontab,和解决crontab使用问题。本文分析的是Paul Vixie版本crontab和crond。一般可通过执行“man crontab”查看AUTHOR是不是“Paul Vixie”。 2. 基本概念 1) crond是一个后台守护程序,定时执行由它负责; 2) crontab是crond的命令行工具,通过它来增删改定时任务,不同用户的crontab是独立分开的。 3. crontab crontab启动后,会首先切换当前目录,当前目录由宏CRONDIR定义(pathnames.h中): #ifndef CRONDIR /* CRONDIR is where cron(8) and crontab(1) both chdir * to; SPOOL_DIR, CRON_ALLOW, CRON_DENY, and LOG_FILE * are all relative to this directory. */ #define CRONDIR "/var/cron" #endif 老版本一般为/var/cron,新版本目录为:/var/spool。 接下来会检查用户是否有权限执行crontab,比如用户名密码过期了则不能执行。检查通过后根据命令行参数分成4个命名分别执行: 1) list_cmd:对应于crontab -l; 2) delete_cmd:对应于crontab -r; 3) edit_cmd:对应于crontab -e 4) replace_cmd:对应于crontab filepath。 3.1. 编辑 crontab默认使用宏_PATH_VI指定的编程器,文件/usr/include/paths.h定义了_PATH_VI: #define _PATH_VI "/usr/bin/vi" 但如果系统没有文件/usr/include/paths.h或没有定义_PATH_VI,则为/usr/ucb/vi: #if defined(_PATH_VI) # define EDITOR _PATH_VI #else # define EDITOR "/usr/ucb/vi" #endif 除此外,crontab还支持从环境变量VISUAL和EDITOR读取采用哪个编辑器,其中先读取VISUAL,如果没有指定再读取EDITOR。 “crontab -e”的完整工作流如下: 3.1.1. “crontab -e”工作流 以用户root为例: 1) 切换当前目录为“/var/cron”; 2) 拼写文件名“tabs/username”,假设用户名为root,则为“tabs/root(新版本文件为:cron/root)”,完整的路径为:/var/cron/tabs/root。文件tabs/root的内容和命令“crontab -l”的输出相同; 3) 打开文件/var/cron/tabs/root,然后取得文件的访问时间和修改时间。如果文件不存在,则读取/dev/null的访问时间和修改时间; 4) 生成格式为“crontab.XXXXXXXXXX”的临时文件,比如:crontab.b2gvnE; 5) 修改临时文件的owner; 6) 将文件tabs/root的内容逐字符复制到临时文件中; 7) 取得编辑用的编辑器,默认为“/usr/bin/vi”; 8) fork一个子进程; 9) 通过execlp执行/usr/bin/vi; 10) 等待vi进程退出; 11) 如果vi正常退出,检查修改时间,如果有变化,则执行replace_cmd; 12) replace_cmd过程上,会加上头: /* write a signature at the top of the file. * * VERY IMPORTANT: make sure NHEADER_LINES agrees with this code. */ fprintf(tmp, "# DO NOT EDIT THIS FILE - edit the master and reinstall.\n"); fprintf(tmp, "# (%s installed on %-24.24s)\n", Filename, ctime(&now)); fprintf(tmp, "# (Cron version %s -- %s)\n", CRON_VERSION, rcsid); 13) replace_cmd会创建一个新的位于当前目录(比如/var/cron或/var/spool)下的临时文件; 14) 然后复制原来的临时文件内容到瓣的临时文件中,并检查语法; 15) 完成再调用rename将临时文件名改为第2步取得的正式文件名; 16) 更新文件的访问时间和修改时间。 3.2. 问题 1) “crontab -e”未退出前的修改但已保存是否生效? 2) crontab中定义的环境变量,注释是否可以在同一行,如: STARTDATE=2017-12-18 # 开始日期 4. crond 老版本的crond,修改改需要重启进程才会生效,新版本crond通过inotify监控文件变化,修改后不用重启即会生效。 4.1. /etc/crontab 系统crontab文件,在加载用户crontab前会先加载/etc/crontab,而且/etc/crontab总是属性root用户。 有关crontab更多,请浏览《cron运行原理》:http://blog.chinaunix.net/uid-20682147-id-4977039.html。 Crontab专题:http://blog.chinaunix.net/uid/20682147/cid-224920-list-1.html
两种方式: 1)直接在crontab中定义变量,如: A=123 * * * * * echo $A > /tmp/a.txt 注意在定义变量时不能使用$引用其它变量,如下面的做法错误: A=123 B=$A 2)在/etc/environment中定义变量 此文件定义变量的格式为:NAME=VALUE,和crontab相关,也不能使用$引用其它变量。 操作系统在登录时使用的第一个文件是/etc/environment文件,/etc/environment文件包含指定所有进程的基本环境的变量。 注意,千万不要有“PATH=$PATH:/usr/local/jdk/bin”这样的用法,这将导致系统无法启动。 技巧: 想保持多台机器的crontab一致,但变量值不完全相同, 这个时候可以考虑将变量配置在/etc/environment中,这样crontab就可以相同了。 如,机器1: A=123 机器2: A=456 两者的crontab配置: * * * * * echo "$A" > /x.txt 一般不建议直接修改/etc/environment,而可采取在目录/etc/profile.d下新增一个.sh文件方式替代。 但如果想crontab中生效,则只能修改/etc/environment,经测试/etc/profile.d方式不起作用。 注意:在/etc/environment设置的变量,在shell中并不生效,但crontab中有效。
版本: redis-3.2.9 部署: 5台64G内存的物理机,每台机器启动2个redis进程组成5主5备集群,每台机器1个主1个备,并且错开互备。 问题: 发现redis进程占用内存高达40G,而且全是备进程。尝试通过重启进程方式释放内存,但进入复制死循环,报如下所示错误: for lack of backlog (Slave request was: 51875158284) 通过网上查找资料,修改client-output-buffer-limit和repl-timeout值,问题未能得到解决,仍然报for lack of backlog,并仍然循环复制。 move备进程的data目录,但保留nodes.conf文件,然后再重启,这次重启成功。采取同样方法处理其它备进程,同样成功,内存同样降到和主进程接近的大小10G。 待分析:为何备进程占用的内存是它的主进程的4倍(分别40G和10G)?除了上述方法外,是否有其它更安全可靠的释放办法?
crontab条目中包含%号,最常见的取时间,如:date +%d, 对%需要使用\进行转义,否则不能按预期执行,正确做法为: * * * * * echo "`date +\%d`" > /tmp/r1r.txt 而不能为 * * * * * echo "`date +%d`" > /tmp/r1r.txt %是crontab的特殊字符,所有%后的被当作了标准输入,这可以通过“ man 5 crontab”查看到说明: The entire command portion of the line, up to a newline or a "%" character, will be executed by /bin/sh or by the shell specified in the SHELL variable of the cronfile. A "%" character in the command, unless escaped with a backslash (\), will be changed into newline char-acters, and all data after the first % will be sent to the command as standard input. 示例: $ cat /tmp/hello.txt cat: /tmp/hello.txt: 没有那个文件或目录 $ echo -e "`crontab -l`\n* * * * * cat > /tmp/hello.txt % hello word"|crontab - $ crontab -l|grep hello.txt * * * * * cat > /tmp/hello.txt % hello word $ cat /tmp/hello.txt hello word
coredump的调用栈: #0 0xf76f5440 in __kernel_vsyscall () #1 0xf73c4657 in raise () from /lib/libc.so.6 #2 0xf73c5e93 in abort () from /lib/libc.so.6 #3 0xf75fe78d in __gnu_cxx::__verbose_terminate_handler() () from /lib/libstdc++.so.6 #4 0xf75fc263 in ?? () from /lib/libstdc++.so.6 #5 0xf75fc29f in std::terminate() () from /lib/libstdc++.so.6 #6 0xf75fc2b3 in ?? () from /lib/libstdc++.so.6 #7 0xf75fbdc9 in __cxa_call_unexpected () from /lib/libstdc++.so.6 #8 0x085d8cbe in hbase::thrift2::CHBaseClient::check_and_put (this=0xede004f8, table_name="A:B", row_key="2883054611_1201423701201702062600010410", family_name="cf1", Python Exception list index out of range: column_name="pid", column_value="", row=std::map with 5 elements, check_flag=apache::hadoop::hbase::thrift2::TDurability::FSYNC_WAL) at /data/src/hbase_client.cpp:1178 原因抛出了声明之外的异常,比如: void f() throw (A); void f() throw (A) { ...... throw B(); ...... }
可以修改/etc/rc.d/boot.local让规则重启后也能生效,如: /sbin/iptables -F /sbin/iptables -A INPUT -i eth0 -p tcp --sport 80 -j ACCEPT /sbin/iptables -A INPUT -i eth0 -p tcp -j DROP /sbin/iptables -A INPUT -i eth0 -p udp -j DROP iptables是一个链的方式从前往后判断,如果前面的规则成立就不会往后继续,所以要注意顺序,一般每行对应一条规则。 -A是Append意思,也就是追加 -I是Insert意思,也就是插入 -F表示清除(即删除)掉已有规则,也就是清空。 查看已有的规则,执行命令:iptables -L -n 如(参数-L为list意思,-n表示以数字方式显示IP和端口,不指定-n则显示为名称,如:http即80端口): # iptables -L -n Chain INPUT (policy ACCEPT) target prot opt source destination ACCEPT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp spt:443 ACCEPT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp spt:80 DROP tcp -- 0.0.0.0/0 0.0.0.0/0 DROP udp -- 0.0.0.0/0 0.0.0.0/0 Chain FORWARD (policy ACCEPT) target prot opt source destination Chain OUTPUT (policy ACCEPT) target prot opt source destination 从可以看到:iptables有三种规则链(Chain),即INPUT、OUTPUT和FORWARD。 INPUT 用于指定输入规则,比如外部是可以访问本机的80端口 OUTPUT 用于指定输出规则,比如本机是否可以访问外部的80端口 FORWARD 用于指定端口转发规则,比如将8080端口的数据转到到80端口 -I和-A需要指定链(Chain)名,其中-I的链名后还需要指定第几条(行)规则。 可通过-D参数删除规则,有两种删除方式,一是匹配模式,二是指定第几条(行)。 也可以通过-R参数修改已有规则,另外-L参数后也可以跟链(Chain)名,表示只列出指定链的所有规则。 -j参数后跟的是动作,即满足规则时执行的操作,可以为ACCEPT、DROP、REJECT和REDIRECT等。 在iptables的INPUT链的第一行插入一条规则(可访问其它机器的80端口): iptables -I INPUT 1 -p tcp --sport 80 -j ACCEPT 在iptables的INPUT链尾追加一条规则(可访问其它机器的80端口): iptables -A INPUT -p tcp --sport 80 -j ACCEPT 如果要让其它机器可以访问本机的80端口,则为: iptables -A INPUT -p tcp --dport 80 -j ACCEPT 插入前: # iptables -L -n Chain INPUT (policy ACCEPT) target prot opt source destination DROP tcp -- 0.0.0.0/0 0.0.0.0/0 DROP udp -- 0.0.0.0/0 0.0.0.0/0 插入: # iptables -I INPUT 1 -p tcp --sport 80 -j ACCEPT 插入后: # iptables -L -n Chain INPUT (policy ACCEPT) target prot opt source destination ACCEPT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp spt:80 DROP tcp -- 0.0.0.0/0 0.0.0.0/0 DROP udp -- 0.0.0.0/0 0.0.0.0/0 追加前: # iptables -L -n Chain INPUT (policy ACCEPT) target prot opt source destination DROP tcp -- 0.0.0.0/0 0.0.0.0/0 DROP udp -- 0.0.0.0/0 0.0.0.0/0 追加: # iptables -I INPUT 1 -p tcp --sport 80 -j ACCEPT 追加后(ACCEPT将不能生效): # iptables -L -n Chain INPUT (policy ACCEPT) target prot opt source destination DROP tcp -- 0.0.0.0/0 0.0.0.0/0 DROP udp -- 0.0.0.0/0 0.0.0.0/0 ACCEPT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp spt:80
确定Kafka安装和启动正确,ZooKeeper可以查到所有的Brokers,但执行: kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 1 --topic my-replicated-topic 遇到如下错误: java.net.SocketException: Network is unreachable at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141) Error while executing topic command : replication factor: 3 larger than available brokers: 0 [2017-06-26 17:25:18,037] ERROR org.apache.kafka.common.errors.InvalidReplicationFactorException: replication factor: 3 larger than available brokers: 0 这个问题可能是broker的配置文件server.properties中的配置项zookeeper.connect指定了kafka的zookeeper的根目录,如: zookeeper.connect=192.168.31.32:2181,192.168.31.33:2181/kafka 这个时候,命令行参数“--zookeeper”的值也需要带上根目录,否则就会报这个错误,正确做法是: kafka-topics.sh --create --zookeeper localhost:2181/kafka --replication-factor 3 --partitions 1 --topic my-replicated-topic
tcpcopy是一个tcp流量复制工具,当前还支持udp和mysql流量的复制。 目的: 将机器10.24.110.21的5000端口流量引流到机器10.23.25.11的5000端口。 示例:将10.24.110.21:4077引流到10.23.25.11:5000 1) 线上机器:10.24.110.21 tcpcopy -x 4077-10.23.25.11:5000 -s 10.23.25.12 -c 192.168.100.x -n 1 2) 测试机器:10.23.25.11 route add -net 192.168.100.0 netmask 255.255.255.0 gw 10.23.25.12 192.168.100/24为虚拟的IP,tcpcopy使用它来连接测试机。 3) 辅助机器:10.23.25.12 intercept -i eth1 -F tcp and src port 5000 测试机器和辅助机器需要在同一网段,否则添加不了路由。 如何需要将多台线上的机器引渡到同一台测试机上了? 关键点:需要不同的辅助端口 假设引流两台线上机到一台测试机10.23.25.11:5000 在辅助机器上启动两个不同端口的intercept进程(-p参数默认值为36524): intercept -p 36524 -i eth1 -F tcp and src port 5000 intercept -p 36525 -i eth1 -F tcp and src port 5000 同时,测试机上需要添加两条到辅助机的路由: route add -net 192.168.100.0 netmask 255.255.255.0 gw 10.23.25.12 route add -net 192.168.110.0 netmask 255.255.255.0 gw 10.23.25.12 两台线上机上分别启动tcpcopy引流(需要指定不同的协助机端口): tcpcopy -x 4077-10.23.25.11:5000 -c 192.168.100.x -n 1 -s 10.23.25.12 -f 1 -p 36524 tcpcopy -x 4077-10.23.25.11:5000 -c 192.168.110.x -n 1 -s 10.23.25.12 -f 6 -p 36525
C++11将addressof作为标准库的一部分,用于取变量和函数等内存地址。 代码示例: #include #include void f() {} int main() { int m; printf("%p\n", std::addressof(m)); // 一些环境非C++11可用std::__addressof printf("%p\n", std::addressof(f)); return 0; } 运行输出示例: 0x7ffc983b699c 0x4005f0
有如下一个结构体: struct X { uint32_t a; char* b[0]; }; sizeof(X)的值为多少了? 关键点:数组维度为0的成员不参与,但是它的类型参与。 注:在x86_64上“char*”的algin值为8,x86上为4。 那么: #pragma pack(8) struct X { uint32_t a; char* b[0]; }; #pragma pack() sizeof(X)值为8,因为alignof(char*)和pack(8)最小值为8,故按8字节对齐。 #pragma pack(4) struct X { uint32_t a; char* b[0]; }; #pragma pack() sizeof(X)值为8,因为alignof(char*)和pack(4)最小值为4,故按4字节对齐。 #pragma pack(1) struct X { uint32_t a; char* b[0]; }; #pragma pack() 按1字节对齐时,sizeof(X)值为8,因为alignof(char*)和pack(1)最小值为4,故按1字节对齐。 如果结构体变成: struct X { uint32_t a; char b[0]; }; sizeof(X)的值为多少了?
如果HBase thrift2报:“TIOError exception: Default TException”, 这个可能是因为操作的表不存在,不一定是网络或磁盘操作异常。 HBase Thrift2偷懒了,所有异常被统一成了TIOError和TIllegalArgument两个异常, 导致调用者无法区分,而且出错信息也没能很好的带过来,增加了定位工作量。 在HBase client中为如下一个继承关系: public class TableNotFoundException extends DoNotRetryIOException public class DoNotRetryIOException extends HBaseIOException public class HBaseIOException extends IOException HBase master相关日志: 2017-05-27 17:20:42,879 ERROR [thrift2-worker-7] client.AsyncProcess: Failed to get region location org.apache.hadoop.hbase.TableNotFoundException: ABCDE at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1285) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1183) at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:422) at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:371) at org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:245) at org.apache.hadoop.hbase.client.BufferedMutatorImpl.flush(BufferedMutatorImpl.java:197) at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1461) at org.apache.hadoop.hbase.client.HTable.put(HTable.java:1017) at org.apache.hadoop.hbase.thrift2.ThriftHBaseServiceHandler.put(ThriftHBaseServiceHandler.java:243) at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
通过需要将process_monitor.sh放到crontab上,以便系统重启自动生效,也可避免process_monitor.sh被意外终止导致失效,crontab的时间部分一般设置为1分钟执行一次,如:* * * * *。 不用做任何修改,即可用process_monitor.sh监控各种进程。 源码下载:https://github.com/eyjian/mooon/blob/master/mooon/shell/process_monitor.sh。 使用之前,请给process_monitor.sh带上可执行权限,不带任何参数执行process_monitor.sh时显示帮助信息。 运行process_monitor.sh,需要指定两个参数: 1)参数1:被监控的对象,支持同一程序带不同参数的分别监控,典型的如java程序 2)参数2:被监控的对象不存在时,重拉起脚本或命令 参数1又可分成两部分: 1)被监控对象,如java程序,不含参数部分,值需要和ps看到的完全相同,比如ps看到的是绝对路径,则也需为绝对路径; 2)参数匹配部分,一个用于区分同一程序不同进程的源自于参数的字符串。这部分是可选的,只有当被监控对象以不同参数在同一机器上同时运行时才需要指定。 建议将process_monitor.sh放到目录/usr/local/bin下,以方便使用。 示例1:监控ZooKeeper进程(假设ZooKeeper安装目录为/data/zookeeper,JDK安装目录为/usr/local/jdk) /usr/local/bin/process_monitor.sh "/usr/local/jdk/bin/java -Dzookeeper" "/data/zookeeper/bin/zkServer.sh start" 上面的“-Dzookeeper”为参数的一部分,借助它可以区分于其它java程序,比如HDFS DataNode为“-Dproc_datanode”: /usr/local/bin/process_monitor.sh "/usr/local/jdk/bin/java -Dproc_datanode" "/data/hadoop/sbin/hadoop-daemon.sh start datanode" 参数2无匹配部分使用示例: /usr/local/bin/process_monitor.sh "/usr/local/ip2location/bin/ip2location" "/usr/local/ip2location/bin/ip2location --num_worker_threads=8 --num_io_threads=2" 放在crontab中的示例: * * * * * /usr/local/bin/process_monitor.sh "/usr/local/ip2location/bin/ip2location" "/usr/local/ip2location/bin/ip2location --num_worker_threads=8 --num_io_threads=2"
https://github.com/eyjian/mooon/releases/tag/mooon-tools mooon_ssh:批量远程命令工具,在多台机器上执行指定命令 mooon_upload:批量远程上传工具,上传单个或多个文件到单台或多台机器 mooon_download:批量远程下载工具,从指定机器下载一个或多个文件 mooon-tools-glibc2.17.tar.gz 64位版本,glibc为2.17,点击下载 mooon-tools-glibc2.4.tar.gz 32位版本,glibc2.4,常常可用于64位版本glibc2.17环境,点击下载。 建议复制到目录/usr/local/bin,或在/usr/local/bin目录下解压,以方便所有用户直接使用,而不用指定文件路径。 可以通过环境变量或参数方式指定连接远程机器的用户名、密码和IP地址或IP地址列表,但参数方式优先: 1) 环境变量H等同参数-h,用于指定远程机器的IP或IP列表,多个IP间以逗号分隔,但mooon_download只能指定一个IP 2) 环境变量U等同参数-u,用于指定连接远程机器的用户名 3) 环境变量P等同参数-p,用于指定远程机器的用户密码 4) 环境变量PORT等同参数-P,用于指定远程机器的端口号 环境变量方式和参数方式可以混合使用,即部分通过环境变量设置值,部分以参数方式指定值。 并建议,参数值尽可能使用单引号,以避免需要对值进行转义处理,除非值本身已包含了单引号。 如果使用双引号,则需要做转义,如批量kill掉java进程: mooon_ssh -c="kill \$(/usr/local/jdk/bin/jps|awk /DataNode/'{print \$1}')" 另外,低版本glibc不兼容高版本的glibc,因此glibc2.4的不能用于glibc2.17环境,64位版本也不能用于32位环境。 64位系统上查看glibc版本方法:/lib64/libc.so.6 32位系统上查看glibc版本方法:/lib/libc.so.6 参数无顺序要求,不带任何参数执行即可查看使用帮助,如: $ mooon_ssh parameter[-c]'s value not set usage: -P[22/10,65535]: remote hosts port, e.g., -P=22. You can also set environment `PORT` instead of `-P`, e.g., export PORT=1998 -c[]: command to execute remotely, e.g., -c='grep ERROR /tmp/*.log' -h[]: remote hosts separated by comma, e.g., -h='192.168.1.10,192.168.1.11'. You can also set environment `H` instead of `-h`, e.g., export H=192.168.1.10,192.168.1.11 -p[]: remote host password, e.g., -p='password'. You can also set environment `P` instead of `-p`, e.g., export P=123456 -t[60/1,65535]: timeout seconds to remote host, e.g., -t=100 -u[]: remote host user name, e.g., -u=root. You can also set environment `U` instead of `-u`, e.g., export U=zhangsan 对于整数类型的参数,均设有默认值和取值范围,如“-P[22/10,65535]”表示默认值为,取值范围为10~65535。对于字符串类型参数,如果为空中括号“[]”,则表示无默认值,否则中括号“[]”中的内容为默认值,如“-u[root]”表示参数“-u”的默认值为root。 mooon_ssh使用示例: 1) 参数方式 mooon_ssh -u=root -p='mypassword' -h=192.168.31.2,192.168.31.3 -c='whoami' 2) 环境变量方式 export U=root export P='mypassword' export H=192.168.31.2,192.168.31.3 mooon_ssh -c='whoami' 3) 混合方式 export U=root export P='mypassword' mooon_ssh -c='whoami' -h=192.168.31.2 mooon_upload和mooon_download使用方法类似。 远程批量添加一条crontab方法: mooon_ssh -c='echo -e "`crontab -l`\n* * * * * touch /tmp/x.txt" | crontab -' 完成后,crontab中将添加如下一行: * * * * * touch /tmp/x.txt
正常情况下,什么时候上报blocks,是由NameNode通过回复心跳响应的方式触发的。 一次机房搬迁中,原机房hadoop版本为2.7.2,新机房版本为2.8.0,采用先扩容再缩容的方式搬迁。由于新旧机房机型不同和磁盘数不同,操作过程搞混过hdfs-site.xml,因为两种不同的机型,hdfs-site.xml不便做到相同,导致了NameNode报大量“missing block”。 然而依据NameNode所报信息,在DataNode能找到那些被标记为“missing”的blocks。修复配置问题后,“missing block”并没有消失。结合DataNode源代码,估计是因为DataNode没有向NameNode上报blocks。 结合DataNode的源代码,发现了HDFS自带的工具triggerBlockReport,它可以强制指定的DataNode向NameNode上报块,使用方法为: hdfs dfsadmin -triggerBlockReport datanode_host:ipc_port 如:hdfs dfsadmin -triggerBlockReport 192.168.31.35:50020 正常情况下NameNode启动时,会要求DataNode上报一次blocks(通过fullBlockReportLeaseId值来控制),相关源代码如下: DataNode相关代码(BPServiceActor.java): private void offerService() throws Exception { HeartbeatResponse resp = sendHeartBeat(requestBlockReportLease); // 向NameNode发向心跳 long fullBlockReportLeaseId = resp.getFullBlockReportLeaseId(); // 心跳响应 boolean forceFullBr = scheduler.forceFullBlockReport.getAndSet(false); // triggerBlockReport强制上报仅一次有效 if (forceFullBr) { LOG.info("Forcing a full block report to " + nnAddr); } if ((fullBlockReportLeaseId != 0) || forceFullBr) { cmds = blockReport(fullBlockReportLeaseId); fullBlockReportLeaseId = 0; } } // NameNode相关代码(FSNamesystem.java): /** * The given node has reported in. This method should: * 1) Record the heartbeat, so the datanode isn't timed out * 2) Adjust usage stats for future block allocation * * If a substantial amount of time passed since the last datanode * heartbeat then request an immediate block report. * * @return an array of datanode commands * @throws IOException */ HeartbeatResponse handleHeartbeat(DatanodeRegistration nodeReg, StorageReport[] reports, long cacheCapacity, long cacheUsed, int xceiverCount, int xmitsInProgress, int failedVolumes, VolumeFailureSummary volumeFailureSummary, boolean requestFullBlockReportLease) throws IOException { readLock(); try { //get datanode commands final int maxTransfer = blockManager.getMaxReplicationStreams() - xmitsInProgress; DatanodeCommand[] cmds = blockManager.getDatanodeManager().handleHeartbeat( nodeReg, reports, blockPoolId, cacheCapacity, cacheUsed, xceiverCount, maxTransfer, failedVolumes, volumeFailureSummary); long fullBlockReportLeaseId = 0; if (requestFullBlockReportLease) { fullBlockReportLeaseId = blockManager.requestBlockReportLeaseId(nodeReg); } //create ha status final NNHAStatusHeartbeat haState = new NNHAStatusHeartbeat( haContext.getState().getServiceState(), getFSImage().getCorrectLastAppliedOrWrittenTxId()); return new HeartbeatResponse(cmds, haState, rollingUpgradeInfo, fullBlockReportLeaseId); } finally { readUnlock("handleHeartbeat"); } }
详情请参见DatanodeUtil.java中的函数idToBlockDir(File root, long blockId)。 如果block文件没有放在正确的目录下,则DataNode会出现“expected block file path”日志。 // g++ -g -o block2dir block2dir.cpp #include #include int main(int argc, char* argv[]) { if (argc != 2) { fprintf(stderr, "usage: block2dir block_id, example: block2dir 1075840138\n"); exit(1); } const long block_id = atol(argv[1]); const int d1 = (int) ((block_id >> 16) & 0x1F); const int d2 = (int) ((block_id >> 8) & 0x1F); fprintf(stderr, "subdir%d/subdir%d\n", d1, d2); return 0; } 运行示例: $ ./block2dir 1075840138 subdir0/subdir4 /** * @return the meta name given the block name and generation stamp. */ public static String getMetaName(String blockName, long generationStamp) { return blockName + "_" + generationStamp + Block.METADATA_EXTENSION; }
如果DataNode的dfs.datanode.data.dir全配置成SSD类型,则执行“hdfs dfs -put /etc/hosts hdfs:///tmp/”时会报如下错误: 2017-05-04 16:08:22,545 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 3 to reach 3 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and org.apache.hadoop.net.NetworkTopology 2017-05-04 16:08:22,545 WARN org.apache.hadoop.hdfs.protocol.BlockStoragePolicy: Failed to place enough replicas: expected size is 3 but only 0 storage types can be selected (replication=3, selected=[], unavailable=[DISK], removed=[DISK, DISK, DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}) 2017-05-04 16:08:22,545 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 3 to reach 3 (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All required storage types are unavailable: unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]} 2017-05-04 16:08:22,545 INFO org.apache.hadoop.ipc.Server: IPC Server handler 37 on 8020, call Call#5 Retry#0 org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 10.208.5.220:40701 java.io.IOException: File /tmp/in/hosts._COPYING_ could only be replicated to 0 nodes instead of minReplication (=1). There are 5 datanode(s) running and no node(s) are excluded in this operation. at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1733) at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:265) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2496) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:828) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:506) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:845) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:788) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2455)
Hadoop-2.8.0分布式安装手册.pdf 目录 目录 1 1. 前言 3 2. 特性介绍 3 3. 部署 5 3.1. 机器列表 5 3.2. 主机名 5 3.2.1. 临时修改主机名 6 3.2.2. 永久修改主机名 6 3.3. 免密码登录范围 7 3.4. 修改最大可打开文件数 7 3.5. OOM相关:vm.overcommit_memory 7 4. 约定 7 4.1. 安装目录约定 7 4.2. 服务端口约定 8 4.3. 各模块RPC和HTTP端口 9 5. 工作详单 9 6. JDK安装 9 6.1. 下载安装包 9 6.2. 安装步骤 10 7. 免密码ssh2登录 10 8. Hadoop安装和配置 11 8.1. 下载安装包 11 8.2. 安装和环境变量配置 12 8.3. 修改hadoop-env.sh 12 8.4. 修改/etc/hosts 13 8.5. 修改slaves 14 8.6. 准备好各配置文件 14 8.7. 修改hdfs-site.xml 15 8.8. 修改core-site.xml 18 8.8.1. dfs.namenode.rpc-address 18 8.9. 修改mapred-site.xml 18 8.10. 修改yarn-site.xml 19 9. 启动顺序 20 10. 启动HDFS 21 10.1. 启动好zookeeper 21 10.2. 创建主备切换命名空间 21 10.3. 启动所有JournalNode 21 10.4. 格式化NameNode 21 10.5. 初始化JournalNode 22 10.6. 启动主NameNode 22 10.7. 启动备NameNode 23 10.8. 启动主备切换进程 23 10.9. 启动所有DataNode 23 10.10. 检查启动是否成功 23 10.10.1. DataNode 24 10.10.2. NameNode 24 10.11. 执行HDFS命令 24 10.11.1. 查看DataNode是否正常启动 24 10.11.2. 查看NameNode的主备状态 24 10.11.3. hdfs dfs ls 25 10.11.4. hdfs dfs -put 25 10.11.5. hdfs dfs -rm 25 10.11.6. 人工切换主备NameNode 25 10.11.7. HDFS只允许有一主一备两个NameNode 25 10.11.8. 存储均衡start-balancer.sh 26 10.11.9. 查看文件分布在哪些节点 27 10.11.10. 关闭安全模式 28 10.11.11. 删除missing blocks 28 11. 扩容和下线 28 11.1. 新增JournalNode 28 11.2. 新NameNode如何加入? 29 11.3. 扩容DataNode 30 11.4. 下线DataNode 30 11.5. 强制DataNode上报块信息 32 12. 启动YARN 32 12.1. 启动YARN 32 12.2. 执行YARN命令 33 12.2.1. yarn node -list 33 12.2.2. yarn node -status 33 12.2.3. yarn rmadmin -getServiceState rm1 34 12.2.4. yarn rmadmin -transitionToStandby rm1 34 13. 运行MapReduce程序 34 14. HDFS权限配置 35 14.1. hdfs-site.xml 35 14.2. core-site.xml 36 15. C++客户端编程 36 15.1. 示例代码 36 15.2. 运行示例 37 16. fsImage 37 17. 常见错误 39 1. 前言 当前版本的Hadoop已解决了hdfs、yarn和hbase等单点,并支持自动的主备切换。 本文的目的是为当前最新版本的Hadoop 2.8.0提供最为详细的安装说明,以帮助减少安装过程中遇到的困难,并对一些错误原因进行说明,hdfs配置使用基于QJM(Quorum Journal Manager)的HA。本文的安装只涉及了hadoop-common、hadoop-hdfs、hadoop-mapreduce和hadoop-yarn,并不包含HBase、Hive和Pig等。 NameNode存储了一个文件有哪些块,但是它并不存储这些块在哪些DataNode上,DataNode会上报有哪些块。如果在NameNode的Web上看到“missing”,是因为没有任何的DataNode上报该块,也就造成的丢失。 2. 特性介绍 版本 发版本日期 新特性 3.0.0 支持多NameNode 2.8.0 2016/1/25 2.7.1 2015/7/6 2.7.0 2015/4/21 1) 不再支持JDK6,须JDK 7+ 2) 支持文件截取(truncate) 3) 支持为每种存储类型设置配额 4) 支持文件变长块(之前一直为固定块大小,默认为64M) 5) 支持Windows Azure Storage 6) YARN认证可插拔 7) 自动共享,全局缓存YARN本地化资源(测试阶段) 8) 限制一个作业运行的Map/Reduce任务 9) 加快大量输出文件时大型作业的FileOutputCommitter速度 2.6.4 2016/2/11 2.6.3 2015/12/17 2.6.2 2015/10/28 2.6.1 2015/9/23 2.6.0 2014/11/18 1) YARN支持长时间运行的服务 2) YARN支持升级回滚 3) YARN支持应用运行在Docker容器中 2.5.2 2014/11/19 2.5.1 2014/9/12 2.5.0 2014/8/11 2.4.1 2014/6/30 2.4.0 2014/4/7 1) HDFS升级回滚 2) HDFS支持完整的https 3) YARN ResourceManager支持自动故障切换 2.2.0 2013/10/15 1) HDFS Federation 2) HDFS Snapshots 2.1.0-beta 2013/8/25 1) HDFS快照 2) 支持Windows 2.0.3-alpha 2013/2/14 1) 基于QJM的NameNode HA 2.0.0-alpha 2012/5/23 1) 人工切换的NameNode HA 2) HDFS Federation 1.0.0 2011/12/27 0.23.11 2014/6/27 0.23.10 2013/12/11 0.22.0 2011/12/10 0.23.0 2011/11/17 0.20.205.0 2011/10/17 0.20.204.0 2011/9/5 0.20.203.0 2011/5/11 0.21.0 2010/8/23 0.20.2 2010/2/26 0.20.1 2009/9/14 0.19.2 2009/7/23 0.20.0 2009/4/22 0.19.1 2009/2/24 0.18.3 2009/1/29 0.19.0 2008/11/21 0.18.2 2008/11/3 0.18.1 2008/9/17 0.18.0 2008/8/22 0.17.2 2008/8/19 0.17.1 2008/6/23 0.17.0 2008/5/20 0.16.4 2008/5/5 0.16.3 2008/4/16 0.16.2 2008/4/2 0.16.1 2008/3/13 0.16.0 2008/2/7 0.15.3 2008/1/18 0.15.2 2008/1/2 0.15.1 2007/11/27 0.14.4 2007/11/26 0.15.0 2007/10/29 0.14.3 2007/10/19 0.14.1 2007/9/4 完整请浏览:http://hadoop.apache.org/releases.html。 3. 部署 推荐使用批量操作工具:mooon_ssh、mooon_upload和mooon_download安装部署,可以提升操作效率(https://github.com/eyjian/mooon/tree/master/mooon/tools),采用CMake编译,依赖OpenSSL(https://www.openssl.org/)和libssh2(http://www.libssh2.org)两个库,其中libssh2也依赖OpenSSL。 3.1. 机器列表 共5台机器(zookeeper部署在这5台机器上),部署如下表所示: NameNode JournalNode DataNode ZooKeeper 10.148.137.143 10.148.137.204 10.148.137.143 10.148.137.204 10.148.138.11 10.148.138.11 10.148.140.14 10.148.140.15 10.148.137.143 10.148.137.204 10.148.138.11 10.148.140.14 10.148.140.15 3.2. 主机名 机器IP 对应的主机名 10.148.137.143 hadoop-137-143 10.148.137.204 hadoop-137-204 10.148.138.11 hadoop-138-11 10.148.140.14 hadoop-140-14 10.148.140.15 hadoop-140-15 注意主机名不能有下划线,否则启动时,SecondaryNameNode节点会报如下所示的错误(取自hadoop-hadoop-secondarynamenode-VM_39_166_sles10_64.out文件): Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /data/hadoop/hadoop-2.8.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now. It's highly recommended that you fix the library with 'execstack -c ', or link it with '-z noexecstack'. Exception in thread "main" java.lang.IllegalArgumentException: The value of property bind.address must not be null at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) at org.apache.hadoop.conf.Configuration.set(Configuration.java:971) at org.apache.hadoop.conf.Configuration.set(Configuration.java:953) at org.apache.hadoop.http.HttpServer2.initializeWebServer(HttpServer2.java:391) at org.apache.hadoop.http.HttpServer2.(HttpServer2.java:344) at org.apache.hadoop.http.HttpServer2.(HttpServer2.java:104) at org.apache.hadoop.http.HttpServer2$Builder.build(HttpServer2.java:292) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.initialize(SecondaryNameNode.java:264) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.(SecondaryNameNode.java:192) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.main(SecondaryNameNode.java:651) 3.2.1. 临时修改主机名 命令hostname不但可以查看主机名,还可以用它来修改主机名,格式为:hostname 新主机名。 在修改之前172.25.40.171对应的主机名为VM-40-171-sles10-64,而172.25.39.166对应的主机名为VM_39_166_sles10_64。两者的主机名均带有下划线,因此需要修改。为求简单,仅将原下划线改成横线: hostname VM-40-171-sles10-64 hostname VM-39-166-sles10-64 经过上述修改后,还不够,类似于修改环境变量,还需要通过修改系统配置文件做永久修改。 3.2.2. 永久修改主机名 不同的Linux发行版本,对应的系统配置文件可能不同,SuSE 10.1是/etc/HOSTNAME: # cat /etc/HOSTNAME VM_39_166_sles10_64 将文件中的“VM_39_166_sles10_64”,改成“VM-39-166-sles10-64”。有些Linux发行版本对应的可能是/etc/hostname文件,有些可能是/etc/sysconfig/network文件。 不但所在文件不同,修改的方法可能也不一样,比如有些是名字对形式,如/etc/sysconfig/network格式为:HOSTNAME=主机名。 修改之后,需要重启网卡,以使修改生效,执行命令:/etc/rc.d/boot.localnet start(不同系统命令会有差异,这是SuSE上的方法),再次使用hostname查看,会发现主机名变了。 直接重启系统,也可以使修改生效。 注意修改主机名后,需要重新验证ssh免密码登录,方法为:ssh 用户名@新的主机名。 可以通过以下多处查看机器名: 1) hostname命令(也可以用来修改主机名,但当次仅当次会话有效) 2) cat /proc/sys/kernel/hostname 3) cat /etc/hostname或cat /etc/sysconfig/network(永久性的修改,需要重启) 4) sysctl kernel.hostname(也可以用来修改主机名,但仅重启之前有效) 3.3. 免密码登录范围 要求能通过免登录包括使用IP和主机名都能免密码登录: 1) NameNode能免密码登录所有的DataNode 2) 各NameNode能免密码登录自己 3) 各NameNode间能免密码互登录 4) DataNode能免密码登录自己 5) DataNode不需要配置免密码登录NameNode和其它DataNode。 注:免密码登录不是必须的,如果不使用hadoop-daemons.sh等需要ssh、scp的脚本。 3.4. 修改最大可打开文件数 修改文件/etc/security/limits.conf,加入以下两行: * soft nofile 102400 * hard nofile 102400 # End of file 其中102400为一个进程最大可以打开的文件个数,当与RedisServer的连接数多时,需要设定为合适的值。 修改后,需要重新登录才会生效,如果是crontab,则需要重启crontab,如:service crond restart,有些平台可能是service cron restart。 3.5. OOM相关:vm.overcommit_memory 如果“/proc/sys/vm/overcommit_memory”的值为0,则会表示开启了OOM。可以设置为1关闭OOM,设置方法请参照net.core.somaxconn完成。 4. 约定 4.1. 安装目录约定 为便于讲解,本文约定Hadoop、JDK安装目录如下: 安装目录 版本 说明 JDK /data/jdk 1.7.0 ln -s /data/jdk1.7.0_55 /data/jdk Hadoop /data/hadoop/hadoop 2.8.0 ln -s /data/hadoop/hadoop-2.8.0 /data/hadoop/hadoop 在实际安装部署时,可以根据实际进行修改。 4.2. 服务端口约定 端口 作用 9000 fs.defaultFS,如:hdfs://172.25.40.171:9000 9001 dfs.namenode.rpc-address,DataNode会连接这个端口 50070 dfs.namenode.http-address 50470 dfs.namenode.https-address 50100 dfs.namenode.backup.address 50105 dfs.namenode.backup.http-address 50090 dfs.namenode.secondary.http-address,如:172.25.39.166:50090 50091 dfs.namenode.secondary.https-address,如:172.25.39.166:50091 50020 dfs.datanode.ipc.address 50075 dfs.datanode.http.address 50475 dfs.datanode.https.address 50010 dfs.datanode.address,DataNode的数据传输端口 8480 dfs.journalnode.rpc-address,主备NameNode以http方式从这个端口获取edit文件 8481 dfs.journalnode.https-address 8032 yarn.resourcemanager.address 8088 yarn.resourcemanager.webapp.address,YARN的http端口 8090 yarn.resourcemanager.webapp.https.address 8030 yarn.resourcemanager.scheduler.address 8031 yarn.resourcemanager.resource-tracker.address 8033 yarn.resourcemanager.admin.address 8042 yarn.nodemanager.webapp.address 8040 yarn.nodemanager.localizer.address 8188 yarn.timeline-service.webapp.address 10020 mapreduce.jobhistory.address 19888 mapreduce.jobhistory.webapp.address 2888 ZooKeeper,如果是Leader,用来监听Follower的连接 3888 ZooKeeper,用于Leader选举 2181 ZooKeeper,用来监听客户端的连接 16010 hbase.master.info.port,HMaster的http端口 16000 hbase.master.port,HMaster的RPC端口 60030 hbase.regionserver.info.port,HRegionServer的http端口 60020 hbase.regionserver.port,HRegionServer的RPC端口 8080 hbase.rest.port,HBase REST server的端口 10000 hive.server2.thrift.port 9083 hive.metastore.uris 4.3. 各模块RPC和HTTP端口 模块 RPC端口 HTTP端口 HTTPS端口 HDFS JournalNode 8485 8480 8481 HDFS NameNode 8020 50070 HDFS DataNode 50020 50075 HDFS SecondaryNameNode 50090 50091 Yarn Resource Manager 8032 8088 8090 Yarn Node Manager 8040 8042 Yarn SharedCache 8788 HMaster 16010 HRegionServer 16030 HBase thrift 9090 9095 HBase rest 8085 注:DataNode通过端口50010传输数据。 5. 工作详单 为运行Hadoop(HDFS、YARN和MapReduce)需要完成的工作详单: JDK安装 Hadoop是Java语言开发的,所以需要。 免密码登录 NameNode控制SecondaryNameNode和DataNode使用了ssh和scp命令,需要无密码执行。 Hadoop安装和配置 这里指的是HDFS、YARN和MapReduce,不包含HBase、Hive等的安装。 6. JDK安装 本文安装的JDK 1.7.0版本。 6.1. 下载安装包 JDK最新二进制安装包下载网址: http://www.oracle.com/technetwork/java/javase/downloads JDK1.7二进制安装包下载网址: http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html 本文下载的是64位Linux版本的JDK1.7:jdk-7u55-linux-x64.gz。请不要安装JDK1.8版本,JDK1.8和Hadoop 2.8.0不匹配,编译Hadoop 2.8.0源码时会报很多错误。 6.2. 安装步骤 JDK的安装非常简单,将jdk-7u55-linux-x64.gz上传到Linux,然后解压,接着配置好环境变量即可(本文jdk-7u55-linux-x64.gz被上传在/data目录下): 1) 进入/data目录 2) 解压安装包:tar xzf jdk-7u55-linux-x64.gz,解压后会在生成目录/data/jdk1.7.0_55 3) 建立软件链接:ln -s /data/jdk1.7.0_55 /data/jdk 4) 修改/etc/profile或用户目录下的profile,或同等文件,配置如下所示环境变量: export JAVA_HOME=/data/jdk export CLASSPATH=$JAVA_HOME/lib/tools.jar export PATH=$JAVA_HOME/bin:$PATH 完成这项操作之后,需要重新登录,或source一下profile文件,以便环境变量生效,当然也可以手工运行一下,以即时生效。如果还不放心,可以运行下java或javac,看看命令是否可执行。如果在安装JDK之前,已经可执行了,则表示不用安装JDK。 7. 免密码ssh2登录 以下针对的是ssh2,而不是ssh,也不包括OpenSSH。配置分两部分:一是对登录机的配置,二是对被登录机的配置,其中登录机为客户端,被登录机为服务端,也就是解决客户端到服务端的无密码登录问题。下述涉及到的命令,可以直接拷贝到Linux终端上执行,已全部验证通过,操作环境为SuSE 10.1。 第一步,修改所有被登录机上的sshd配置文件/etc/ssh2/sshd2_config: 1) (如果不以root用户运行hadoop,则跳过这一步)将PermitRootLogin值设置为yes,也就是取掉前面的注释号# 2) 将AllowedAuthentications值设置为publickey,password,也就是取掉前面的注释号# 3) 重启sshd服务:service ssh2 restart 第二步,在所有登录机上,执行以下步骤: 1) 进入到.ssh2目录:cd ~/.ssh2 2) ssh-keygen2 -t dsa -P'' -P表示密码,-P''就表示空密码,也可以不用-P参数,但这样就要敲三次回车键,用-P''就一次回车。 成功之后,会在用户的主目录下生成私钥文件id_dsa_2048_a,和公钥文件id_dsa_2048_a.pub。 3) 生成identification文件:echo "IdKey id_dsa_2048_a" >> identification,请注意IdKey后面有一个空格,确保identification文件内容如下: # cat identification IdKey id_dsa_2048_a 4) 将文件id_dsa_2048_a.pub,上传到所有被登录机的~/.ssh2目录:scp id_dsa_2048_a.pub root@192.168.0.1:/root/.ssh2,这里假设192.168.0.1为其中一个被登录机的IP。在执行scp之前,请确保192.168.0.1上有/root/.ssh2这个目录,而/root/需要修改为root用户的实际HOME目录,通常环境变量$HOME为用户主目录,~也表示用户主目录,不带任何参数的cd命令也会直接切换到用户主目录。 第三步,在所有被登录机上,执行以下步骤: 1) 进入到.ssh2目录:cd ~/.ssh2 2) 生成authorization文件:echo "Key id_dsa_2048_a.pub" >> authorization,请注意Key后面有一个空格,确保authorization文件内容如下: # cat authorization Key id_dsa_2048_a.pub 完成上述工作之后,从登录机到被登录机的ssh登录就不需要密码了。如果没有配置好免密码登录,在启动时会遇到如下错误: Starting namenodes on [172.25.40.171] 172.25.40.171: Host key not found from database. 172.25.40.171: Key fingerprint: 172.25.40.171: xofiz-zilip-tokar-rupyb-tufer-tahyc-sibah-kyvuf-palik-hazyt-duxux 172.25.40.171: You can get a public key's fingerprint by running 172.25.40.171: % ssh-keygen -F publickey.pub 172.25.40.171: on the keyfile. 172.25.40.171: warning: tcgetattr failed in ssh_rl_set_tty_modes_for_fd: fd 1: Invalid argument 或下列这样的错误: Starting namenodes on [172.25.40.171] 172.25.40.171: hadoop's password: 建议生成的私钥和公钥文件名都带上自己的IP,否则会有些混乱。 按照中免密码登录范围的说明,配置好所有的免密码登录。更多关于免密码登录说明,请浏览技术博客: 1) http://blog.chinaunix.net/uid-20682147-id-4212099.html(两个SSH2间免密码登录) 2) http://blog.chinaunix.net/uid-20682147-id-4212097.html(SSH2免密码登录OpenSSH) 3) http://blog.chinaunix.net/uid-20682147-id-4212094.html(OpenSSH免密码登录SSH2) 4) http://blog.chinaunix.net/uid-20682147-id-5520240.html(两个openssh间免密码登录) 8. Hadoop安装和配置 本部分仅包括HDFS、MapReduce和Yarn的安装,不包括HBase、Hive等的安装。 8.1. 下载安装包 Hadoop二进制安装包下载网址:http://hadoop.apache.org/releases.html#Download(或直接进入http://mirror.bit.edu.cn/apache/hadoop/common/进行下载),本文下载的是hadoop-2.8.0版本(安装包: http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.8.0/hadoop-2.8.0.tar.gz,源码包:http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.8.0/hadoop-2.8.0-src.tar.gz)。 官方的安装说明请浏览Cluster Setup: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html。 8.2. 安装和环境变量配置 1) 将Hadoop安装包hadoop-2.8.0.tar.gz上传到/data/hadoop目录下 2) 进入/data/hadoop目录 3) 在/data/hadoop目录下,解压安装包hadoop-2.8.0.tar.gz:tar xzf hadoop-2.8.0.tar.gz 4) 建立软件链接:ln -s /data/hadoop/hadoop-2.8.0 /data/hadoop/hadoop 5) 修改用户主目录下的文件.profile(当然也可以是/etc/profile或其它同等效果的文件),设置Hadoop环境变量: export JAVA_HOME=/data/jdk export HADOOP_HOME=/data/hadoop/hadoop export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export PATH=$HADOOP_HOME/bin:$PATH 需要重新登录以生效,或者在终端上执行:export HADOOP_HOME=/data/hadoop/hadoop也可以即时生效。 8.3. 修改hadoop-env.sh 修改所有节点上的$HADOOP_HOME/etc/hadoop/hadoop-env.sh文件,在靠近文件头部分加入:export JAVA_HOME=/data/jdk 特别说明一下:虽然在/etc/profile已经添加了JAVA_HOME,但仍然得修改所有节点上的hadoop-env.sh,否则启动时,报如下所示的错误: 10.12.154.79: Error: JAVA_HOME is not set and could not be found. 10.12.154.77: Error: JAVA_HOME is not set and could not be found. 10.12.154.78: Error: JAVA_HOME is not set and could not be found. 10.12.154.78: Error: JAVA_HOME is not set and could not be found. 10.12.154.77: Error: JAVA_HOME is not set and could not be found. 10.12.154.79: Error: JAVA_HOME is not set and could not be found. 除JAVA_HOME之外,再添加: export HADOOP_HOME=/data/hadoop/hadoop export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop 同时,建议将下列添加到/etc/profile或~/.profile中: export JAVA_HOME=/data/jdk export HADOOP_HOME=/data/hadoop/hadoop export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop 8.4. 修改/etc/hosts 为省去不必要的麻烦,建议在所有节点的/etc/hosts文件,都做如下所配置: 10.148.137.143 hadoop-137-143 # NameNode 10.148.137.204 hadoop-137-204 # NameNode 10.148.138.11 hadoop-138-11 # DataNode 10.148.140.14 hadoop-140-14 # DataNode 10.148.140.15 hadoop-140-15 # DataNode 注意不要为一个IP配置多个不同主机名,否则HTTP页面可能无法正常运作。 主机名,如VM-39-166-sles10-64,可通过hostname命令取得。由于都配置了主机名,在启动HDFS或其它之前,需要确保针对主机名进行过ssh,否则启动时,会遇到如下所示的错误: VM-39-166-sles10-64: Host key not found from database. VM-39-166-sles10-64: Key fingerprint: VM-39-166-sles10-64: xofiz-zilip-tokar-rupyb-tufer-tahyc-sibah-kyvuf-palik-hazyt-duxux VM-39-166-sles10-64: You can get a public key's fingerprint by running VM-39-166-sles10-64: % ssh-keygen -F publickey.pub VM-39-166-sles10-64: on the keyfile. VM-39-166-sles10-64: warning: tcgetattr failed in ssh_rl_set_tty_modes_for_fd: fd 1: Invalid argument 上述错误表示没有以主机名ssh过一次VM-39-166-sles10-64。按下列方法修复错误: ssh hadoop@VM-39-166-sles10-64 Host key not found from database. Key fingerprint: xofiz-zilip-tokar-rupyb-tufer-tahyc-sibah-kyvuf-palik-hazyt-duxux You can get a public key's fingerprint by running % ssh-keygen -F publickey.pub on the keyfile. Are you sure you want to continue connecting (yes/no)? yes Host key saved to /data/hadoop/.ssh2/hostkeys/key_36000_137vm_13739_137166_137sles10_13764.pub host key for VM-39-166-sles10-64, accepted by hadoop Thu Apr 17 2014 12:44:32 +0800 Authentication successful. Last login: Thu Apr 17 2014 09:24:54 +0800 from 10.32.73.69 Welcome to SuSE Linux 10 SP2 64Bit Nov 10,2010 by DIS Version v2.6.20101110 No mail. 8.5. 修改slaves 这些脚本使用到了slaves: hadoop-daemons.sh slaves.sh start-dfs.sh stop-dfs.sh yarn-daemons.sh 这些脚本都依赖无密码SSH,如果没有使用到,则可以不管slaves文件。 slaves即为HDFS的DataNode节点。当使用脚本start-dfs.sh来启动hdfs时,会使用到这个文件,以无密码登录方式到各slaves上启动DataNode。 修改主NameNode和备NameNode上的$HADOOP_HOME/etc/hadoop/slaves文件,将slaves的节点IP(也可以是相应的主机名)一个个加进去,一行一个IP,如下所示: > cat slaves 10.148.138.11 10.148.140.14 10.148.140.15 8.6. 准备好各配置文件 配置文件放在$HADOOP_HOME/etc/hadoop目录下,对于Hadoop 2.3.0、Hadoop 2.8.0和Hadoop 2.8.0版本,该目录下的core-site.xml、yarn-site.xml、hdfs-site.xml和mapred-site.xml都是空的。如果不配置好就启动,如执行start-dfs.sh,则会遇到各种错误。 可从$HADOOP_HOME/share/hadoop目录下拷贝一份到/etc/hadoop目录,然后在此基础上进行修改(以下内容可以直接拷贝执行,2.3.0版本中各default.xml文件路径不同于2.8.0版本): # 进入$HADOOP_HOME目录 cd $HADOOP_HOME cp ./share/doc/hadoop/hadoop-project-dist/hadoop-common/core-default.xml ./etc/hadoop/core-site.xml cp ./share/doc/hadoop/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml ./etc/hadoop/hdfs-site.xml cp ./share/doc/hadoop/hadoop-yarn/hadoop-yarn-common/yarn-default.xml ./etc/hadoop/yarn-site.xml cp ./share/doc/hadoop/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml ./etc/hadoop/mapred-site.xml 接下来,需要对默认的core-site.xml、yarn-site.xml、hdfs-site.xml和mapred-site.xml进行适当的修改,否则仍然无法启动成功。 QJM的配置参照的官方文档: http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html。 8.7. 修改hdfs-site.xml 对hdfs-site.xml文件的修改,涉及下表中的属性: 属性名 属性值 说明 dfs.nameservices mycluster dfs.ha.namenodes.mycluster nn1,nn2 同一nameservice下,只能配置一或两个,也就是不能有nn3了 dfs.namenode.rpc-address.mycluster.nn1 hadoop-137-143:8020 dfs.namenode.rpc-address.mycluster.nn2 hadoop-137-204:8020 dfs.namenode.http-address.mycluster.nn1 hadoop-137-143:50070 dfs.namenode.http-address.mycluster.nn2 hadoop-137-204:50070 dfs.namenode.shared.edits.dir qjournal://hadoop-137-143:8485;hadoop-137-204:8485;hadoop-138-11:8485/mycluster 至少三台Quorum Journal节点配置 dfs.client.failover.proxy.provider.mycluster org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider 客户端通过它来找主NameNode dfs.ha.fencing.methods sshfence 如果配置为sshfence,当主NameNode异常时,使用ssh登录到主NameNode,然后使用fuser将主NameNode杀死,因此需要确保所有NameNode上可以使用fuser。 用来保证同一时刻只有一个主NameNode,以防止脑裂。可带用户名和端口参数,格式示例:sshfence([[username][:port]]);值还可以为shell脚本,格式示例: shell(/path/to/my/script.sh arg1 arg2 ...),如: shell(/bin/true) 如果sshd不是默认的22端口时,就需要指定。 dfs.ha.fencing.ssh.private-key-files /data/hadoop/.ssh2/id_dsa_2048_a 指定私钥,如果是OpenSSL,则值为/data/hadoop/.ssh/id_rsa dfs.ha.fencing.ssh.connect-timeout 30000 可选的配置 dfs.journalnode.edits.dir /data/hadoop/hadoop/journal 这个不要带前缀“file://”,JournalNode存储其本地状态的位置,在JouralNode机器上的绝对路径,JNs的edits和其他本地状态将被存储在此处。此处如果带前缀,则会报“Journal dir should be an absolute path” dfs.datanode.data.dir file:///data/hadoop/hadoop/data 请带上前缀“file://”,不要全配置成SSD类型,否则写文件时会遇到错误“Failed to place enough replicas” dfs.namenode.name.dir 请带上前缀“file://”,NameNode元数据存放目录,默认值为file://${hadoop.tmp.dir}/dfs/name,也就是在临时目录下,可以考虑放到数据目录下 dfs.namenode.checkpoint.dir 默认值为file://${hadoop.tmp.dir}/dfs/namesecondary,但如果没有启用SecondaryNameNode,则不需要 dfs.ha.automatic-failover.enabled true 自动主备切换 dfs.datanode.max.xcievers 4096 可选修改,类似于linux的最大可打开的文件个数,默认为256,建议设置成大一点。同时,需要保证系统可打开的文件个数足够(可通过ulimit命令查看)。该错误会导致hbase报“notservingregionexception”。 dfs.journalnode.rpc-address 0.0.0.0:8485 配置JournalNode的RPC端口号,默认为0.0.0.0:8485,可以不用修改 dfs.hosts 可选配置,但建议配置,以防止其它DataNode无意中连接进来。用于配置DataNode白名单,只有在白名单中的DataNode才能连接NameNode。dfs.hosts的值为一本地文件绝对路径,如:/data/hadoop/etc/hadoop/hosts.include dfs.hosts.exclude 正常不要填写,需要下线DataNode时用到。dfs.hosts.exclude的值为本地文件的绝对路径,文件内容为每行一个需要下线的DataNode主机名或IP地址,如:/data/hadoop/etc/hadoop/hosts.exclude dfs.namenode.num.checkpoints.retained 2 默认为2,指定NameNode保存fsImage文件的个数 dfs.namenode.num.extra.edits.retained 1000000 Edit文件保存个数 dfs.namenode.max.extra.edits.segments.retained 10000 dfs.datanode.scan.period.hours 默认为504小时 dfs.blockreport.intervalMsec DataNode向NameNode报告块信息的时间间隔,默认值为21600000毫秒 dfs.datanode.directoryscan.interval DataNode进行内存和磁盘数据集块校验,更新内存中的信息和磁盘中信息的不一致情况,默认值为21600秒 dfs.heartbeat.interval 3 向NameNode发心跳的间隔,单位:秒 详细配置可参考: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml。 8.8. 修改core-site.xml 对core-site.xml文件的修改,涉及下表中的属性: 属性名 属性值 说明 fs.defaultFS hdfs://mycluster fs.default.name hdfs://mycluster 按理应当不用填写这个参数,因为fs.defaultFS已取代它,但启动时报错: fs.defaultFS is file:/// hadoop.tmp.dir /data/hadoop/hadoop/tmp ha.zookeeper.quorum hadoop-137-143:2181,hadoop-138-11:2181,hadoop-140-14:2181 ha.zookeeper.parent-znode /mycluster/hadoop-ha io.seqfile.local.dir 默认值为${hadoop.tmp.dir}/io/local fs.s3.buffer.dir 默认值为${hadoop.tmp.dir}/s3 fs.s3a.buffer.dir 默认值为${hadoop.tmp.dir}/s3a 注意启动之前,需要将配置的目录创建好,如创建好/data/hadoop/current/tmp目录。详细可参考: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/core-default.xm。 8.8.1. dfs.namenode.rpc-address 如果没有配置,则启动时报如下错误: Incorrect configuration: namenode address dfs.namenode.servicerpc-address or dfs.namenode.rpc-address is not configured. 这里需要指定IP和端口,如果只指定了IP,如10.148.137.143,则启动时输出如下: Starting namenodes on [] 改成“hadoop-137-143:8020”后,则启动时输出为: Starting namenodes on [10.148.137.143] 8.9. 修改mapred-site.xml 对hdfs-site.xml文件的修改,涉及下表中的属性: 属性名 属性值 涉及范围 mapreduce.framework.name yarn 所有mapreduce节点 详细配置可参考: http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml。 8.10. 修改yarn-site.xml 对yarn-site.xml文件的修改,涉及下表中的属性: 属性名 属性值 涉及范围 yarn.resourcemanager.hostname 0.0.0.0 ResourceManager NodeManager HA模式可不配置,但由于其它配置项可能有引用它,建议保持值为0.0.0.0,如果没有被引用到,则可不配置。 yarn.nodemanager.hostname 0.0.0.0 yarn.nodemanager.aux-services mapreduce_shuffle 以下为HA相关的配置,包括自动切换(可仅可在ResourceManager节点上配置) yarn.resourcemanager.ha.enabled true 启用HA yarn.resourcemanager.cluster-id yarn-cluster 可不同于HDFS的 yarn.resourcemanager.ha.rm-ids rm1,rm2 注意NodeManager要和ResourceManager一样配置 yarn.resourcemanager.hostname.rm1 hadoop-137-143 yarn.resourcemanager.hostname.rm2 hadoop-137-204 yarn.resourcemanager.webapp.address.rm1 hadoop-137-143:8088 yarn.resourcemanager.webapp.address.rm2 hadoop-137-204:8088 yarn.resourcemanager.zk-address hadoop-137-143:2181,hadoop-137-204:2181,hadoop-138-11:2181 yarn.resourcemanager.ha.automatic-failover.enable true 可不配置,因为当yarn.resourcemanager.ha.enabled为true时,它的默认值即为true 以下为NodeManager配置 yarn.nodemanager.vmem-pmem-ratio 每使用1MB物理内存,最多可用的虚拟内存数,默认值为2.1,在运行spark-sql时如果遇到“Yarn application has already exited with state FINISHED”,则应当检查NodeManager的日志,以查看是否该配置偏小原因 yarn.nodemanager.resource.cpu-vcores NodeManager总的可用虚拟CPU个数,默认值为8 yarn.nodemanager.resource.memory-mb 该节点上YARN可使用的物理内存总量,默认是8192(MB) yarn.nodemanager.pmem-check-enabled 是否启动一个线程检查每个任务正使用的物理内存量,如果任务超出分配值,则直接将其杀掉,默认是true yarn.nodemanager.vmem-check-enabled 是否启动一个线程检查每个任务正使用的虚拟内存量,如果任务超出分配值,则直接将其杀掉,默认是true 以下为ResourceManager配置 yarn.scheduler.minimum-allocation-mb 单个容器可申请的最小内存 yarn.scheduler.maximum-allocation-mb 单个容器可申请的最大内存 yarn.nodemanager.hostname如果配置成具体的IP,如10.12.154.79,则会导致每个NodeManager的配置不同。详细配置可参考: http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml。 Yarn HA的配置可以参考: https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html。 9. 启动顺序 Zookeeper -> JournalNode -> 格式化NameNode -> 初始化JournalNode -> 创建命名空间(zkfc) -> NameNode -> 主备切换进程 -> DataNode -> ResourceManager -> NodeManager 但请注意首次启动NameNode之前,得先做format,也请注意备NameNode的启动方法。主备切换进程的启动只需要在“创建命名空间(zkfc)”之后即可。 10. 启动HDFS 在启动HDFS之前,需要先完成对NameNode的格式化。 10.1. 启动好zookeeper ./zkServer.sh start 注意在启动其它之前先启动zookeeper。 10.2. 创建主备切换命名空间 这一步和格式化NameNode、实始化JournalNode无顺序关系。在其中一个namenode上执行:./hdfs zkfc -formatZK 成功后,将在ZooKeer上创建core-site.xml中ha.zookeeper.parent-znode指定的路径。如果有修改hdfs-site.xml中的dfs.ha.namenodes.mycluster值,则需要重新做一次formatZK,否则自动主备NameNode切换将失效。zkfc进程的日志文件将发现如下信息(假设nm1改成了nn1): Unable to determine service address for namenode 'nm1' 注意如果有修改dfs.ha.namenodes.mycluster,上层的HBase等依赖HBase的也需要重启。 10.3. 启动所有JournalNode NameNode将元数据操作日志记录在JournalNode上,主备NameNode通过记录在JouralNode上的日志完成元数据同步。 在所有JournalNode上执行(注意是两个参数,在“hdfs namenode -format”之后做这一步): ./hadoop-daemon.sh start journalnode 注意,在执行“hdfs namenode -format”之前,必须先启动好JournalNode,而format又必须在启动namenode之前。 10.4. 格式化NameNode 注意只有新的,才需要做这一步,而且只需要在主NameNode上执行。 1) 进入$HADOOP_HOME/bin目录 2) 进行格式化:./hdfs namenode -format 如果完成有,输出包含“INFO util.ExitUtil: Exiting with status 0”,则表示格式化成功。 在进行格式化时,如果没有在/etc/hosts文件中添加主机名和IP的映射:“172.25.40.171 VM-40-171-sles10-64”,则会报如下所示错误: 14/04/17 03:44:09 WARN net.DNS: Unable to determine local hostname -falling back to "localhost" java.net.UnknownHostException: VM-40-171-sles10-64: VM-40-171-sles10-64: unknown error at java.net.InetAddress.getLocalHost(InetAddress.java:1484) at org.apache.hadoop.net.DNS.resolveLocalHostname(DNS.java:264) at org.apache.hadoop.net.DNS.(DNS.java:57) at org.apache.hadoop.hdfs.server.namenode.NNStorage.newBlockPoolID(NNStorage.java:945) at org.apache.hadoop.hdfs.server.namenode.NNStorage.newNamespaceInfo(NNStorage.java:573) at org.apache.hadoop.hdfs.server.namenode.FSImage.format(FSImage.java:144) at org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:845) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1256) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1370) Caused by: java.net.UnknownHostException: VM-40-171-sles10-64: unknown error at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method) at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:907) at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1302) at java.net.InetAddress.getLocalHost(InetAddress.java:1479) ... 8 more 10.5. 初始化JournalNode 这一步需要在格式化NameNode之后进行! 如果是非HA转HA才需要这一步,在其中一个JournalNode上执行: ./hdfs namenode -initializeSharedEdits 此命令默认是交互式的,加上参数-force转成非交互式。 在所有JournalNode创建如下目录: mkdir -p /data/hadoop/hadoop/journal/mycluster/current 如果此步在格式化NameNode前运行,则会报错“NameNode is not formatted”。 10.6. 启动主NameNode 1) 进入$HADOOP_HOME/sbin目录 2) 启动主NameNode: ./hadoop-daemon.sh start namenode 启动时,遇到如下所示的错误,则表示NameNode不能免密码登录自己。如果之前使用IP可以免密码登录自己,则原因一般是因为没有使用主机名登录过自己,因此解决办法是使用主机名SSH一下,比如:ssh hadoop@VM_40_171_sles10_64,然后再启动。 Starting namenodes on [VM_40_171_sles10_64] VM_40_171_sles10_64: Host key not found from database. VM_40_171_sles10_64: Key fingerprint: VM_40_171_sles10_64: xofiz-zilip-tokar-rupyb-tufer-tahyc-sibah-kyvuf-palik-hazyt-duxux VM_40_171_sles10_64: You can get a public key's fingerprint by running VM_40_171_sles10_64: % ssh-keygen -F publickey.pub VM_40_171_sles10_64: on the keyfile. VM_40_171_sles10_64: warning: tcgetattr failed in ssh_rl_set_tty_modes_for_fd: fd 1: Invalid argument 10.7. 启动备NameNode 1) ./hdfs namenode -bootstrapStandby 2) ./hadoop-daemon.sh start namenode 如果没有执行第1步,直接启动会遇到如下错误: No valid image files found 或者在该NameNode日志会发现如下错误: 2016-04-08 14:08:39,745 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception loading fsimage java.io.IOException: NameNode is not formatted. 10.8. 启动主备切换进程 在所有NameNode上启动主备切换进程: ./hadoop-daemon.sh start zkfc 只有启动了DFSZKFailoverController进程,HDFS才能自动切换主备。 注:zkfc是zookeeper failover controller的缩写。 10.9. 启动所有DataNode 在各个DataNode上分别执行: ./hadoop-daemon.sh start datanode 如果有发现DataNode进程并没有起来,可以试试删除logs目录下的DataNode日志,再得启看看。 10.10. 检查启动是否成功 1) 使用JDK提供的jps命令,查看相应的进程是否已启动 2) 检查$HADOOP_HOME/logs目录下的log和out文件,看看是否有异常信息。 启动后nn1和nn2都处于备机状态,将nn1切换为主机: ./hdfs haadmin -transitionToActive nn1 10.10.1. DataNode 执行jps命令(注:jps是jdk中的一个命令,不是jre中的命令),可看到DataNode进程: $ jps 18669 DataNode 24542 Jps 10.10.2. NameNode 执行jps命令,可看到NameNode进程: $ jps 18669 NameNode 24542 Jps 10.11. 执行HDFS命令 执行HDFS命令,以进一步检验是否已经安装成功和配置好。关于HDFS命令的用法,直接运行命令hdfs或hdfs dfs,即可看到相关的用法说明。 10.11.1. 查看DataNode是否正常启动 hdfs dfsadmin -report 注意如果core-site.xml中的配置项fs.default.name的值为file:///,则会报: report: FileSystem file:/// is not an HDFS file system Usage: hdfs dfsadmin [-report] [-live] [-dead] [-decommissioning] 解决这个问题,只需要将fs.default.name的值设置为和fs.defaultFS相同的值。 10.11.2. 查看NameNode的主备状态 如查看NameNode1和NameNode2分别是主还是备: $ hdfs haadmin -getServiceState nn1 standby $ hdfs haadmin -getServiceState nn2 active 10.11.3. hdfs dfs ls “hdfs dfs -ls”带一个参数,如果参数以“hdfs://URI”打头表示访问HDFS,否则相当于ls。其中URI为NameNode的IP或主机名,可以包含端口号,即hdfs-site.xml中“dfs.namenode.rpc-address”指定的值。 “hdfs dfs -ls”要求默认端口为8020,如果配置成9000,则需要指定端口号,否则不用指定端口,这一点类似于浏览器访问一个URL。示例: > hdfs dfs -ls hdfs:///172.25.40.171:9001/ 9001后面的斜杠/是和必须的,否则被当作文件。如果不指定端口号9001,则使用默认的8020,“172.25.40.171:9001”由hdfs-site.xml中“dfs.namenode.rpc-address”指定。 不难看出“hdfs dfs -ls”可以操作不同的HDFS集群,只需要指定不同的URI。 文件上传后,被存储在DataNode的data目录下(由DataNode的hdfs-site.xml中的属性“dfs.datanode.data.dir”指定),如: $HADOOP_HOME/data/current/BP-139798373-172.25.40.171-1397735615751/current/finalized/blk_1073741825 文件名中的“blk”是block,即块的意思,默认情况下blk_1073741825即为文件的一个完整块,Hadoop未对它进额外处理。 10.11.4. hdfs dfs -put 上传文件命令,示例: > hdfs dfs -put /etc/SuSE-release hdfs:///172.25.40.171:9001/ 10.11.5. hdfs dfs -rm 删除文件命令,示例: > hdfs dfs -rm hdfs://172.25.40.171:9001/SuSE-release Deleted hdfs://172.25.40.171:9001/SuSE-release 10.11.6. 人工切换主备NameNode hdfs haadmin -failover --forcefence --forceactive nn1 nn2 # 让nn2成为主NameNode 10.11.7. HDFS只允许有一主一备两个NameNode 注:hadoop-3.0版本将支持多备NameNode,类似于HBase那样。 如果试图配置三个NameNode,如: dfs.ha.namenodes.test nn1,nn2,nn3 The prefix for a given nameservice, contains a comma-separated list of namenodes for a given nameservice (eg EXAMPLENAMESERVICE). 则运行“hdfs namenode -bootstrapStandby”时会报如下错误,表示在同一NameSpace内不能超过2个NameNode: 16/04/11 09:51:57 ERROR namenode.NameNode: Failed to start namenode. java.io.IOException: java.lang.IllegalArgumentException: Expected exactly 2 NameNodes in namespace 'test'. Instead, got only 3 (NN ids were 'nn1','nn2','nn3' at org.apache.hadoop.hdfs.server.namenode.ha.BootstrapStandby.run(BootstrapStandby.java:425) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1454) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1554) Caused by: java.lang.IllegalArgumentException: Expected exactly 2 NameNodes in namespace 'test'. Instead, got only 3 (NN ids were 'nn1','nn2','nn3' at com.google.common.base.Preconditions.checkArgument(Preconditions.java:115) 10.11.8. 存储均衡start-balancer.sh 示例:start-balancer.sh –t 10% 10%表示机器与机器之间磁盘使用率偏差小于10%时认为均衡,否则做均衡搬动。“start-balancer.sh”调用“hdfs start balancer”来做均衡,可以调用stop-balancer.sh停止均衡。 均衡过程非常慢,但是均衡过程中,仍能够正常访问HDFS,包括往HDFS上传文件。 [VM2016@hadoop-030 /data4/hadoop/sbin]$ hdfs balancer # 可以改为调用start-balancer.sh 16/04/08 14:26:55 INFO balancer.Balancer: namenodes = [hdfs://test] // test为HDFS的cluster名 16/04/08 14:26:55 INFO balancer.Balancer: parameters = Balancer.Parameters[BalancingPolicy.Node, threshold=10.0, max idle iteration = 5, number of nodes to be excluded = 0, number of nodes to be included = 0] Time Stamp Iteration# Bytes Already Moved Bytes Left To Move Bytes Being Moved 16/04/08 14:26:56 INFO net.NetworkTopology: Adding a new node: /default-rack/192.168.1.231:50010 16/04/08 14:26:56 INFO net.NetworkTopology: Adding a new node: /default-rack/192.168.1.229:50010 16/04/08 14:26:56 INFO net.NetworkTopology: Adding a new node: /default-rack/192.168.1.213:50010 16/04/08 14:26:56 INFO net.NetworkTopology: Adding a new node: /default-rack/192.168.1.208:50010 16/04/08 14:26:56 INFO net.NetworkTopology: Adding a new node: /default-rack/192.168.1.232:50010 16/04/08 14:26:56 INFO net.NetworkTopology: Adding a new node: /default-rack/192.168.1.207:50010 16/04/08 14:26:56 INFO balancer.Balancer: 5 over-utilized: [192.168.1.231:50010:DISK, 192.168.1.229:50010:DISK, 192.168.1.213:50010:DISK, 192.168.1.208:50010:DISK, 192.168.1.232:50010:DISK] 16/04/08 14:26:56 INFO balancer.Balancer: 1 underutilized(未充分利用的): [192.168.1.207:50010:DISK] # 数据将移向该节点 16/04/08 14:26:56 INFO balancer.Balancer: Need to move 816.01 GB to make the cluster balanced. # 需要移动816.01G数据达到平衡 16/04/08 14:26:56 INFO balancer.Balancer: Decided to move 10 GB bytes from 192.168.1.231:50010:DISK to 192.168.1.207:50010:DISK # 从192.168.1.231移动10G数据到192.168.1.207 16/04/08 14:26:56 INFO balancer.Balancer: Will move 10 GB in this iteration 16/04/08 14:32:58 INFO balancer.Dispatcher: Successfully moved blk_1073749366_8542 with size=77829046 from 192.168.1.231:50010:DISK to 192.168.1.207:50010:DISK through 192.168.1.213:50010 16/04/08 14:32:59 INFO balancer.Dispatcher: Successfully moved blk_1073749386_8562 with size=77829046 from 192.168.1.231:50010:DISK to 192.168.1.207:50010:DISK through 192.168.1.231:50010 16/04/08 14:33:34 INFO balancer.Dispatcher: Successfully moved blk_1073749378_8554 with size=77829046 from 192.168.1.231:50010:DISK to 192.168.1.207:50010:DISK through 192.168.1.231:50010 16/04/08 14:34:38 INFO balancer.Dispatcher: Successfully moved blk_1073749371_8547 with size=134217728 from 192.168.1.231:50010:DISK to 192.168.1.207:50010:DISK through 192.168.1.213:50010 16/04/08 14:34:54 INFO balancer.Dispatcher: Successfully moved blk_1073749395_8571 with size=134217728 from 192.168.1.231:50010:DISK to 192.168.1.207:50010:DISK through 192.168.1.231:50010 Apr 8, 2016 2:35:01 PM 0 478.67 MB 816.01 GB 10 GB 16/04/08 14:35:10 INFO net.NetworkTopology: Adding a new node: /default-rack/192.168.1.213:50010 16/04/08 14:35:10 INFO net.NetworkTopology: Adding a new node: /default-rack/192.168.1.229:50010 16/04/08 14:35:10 INFO net.NetworkTopology: Adding a new node: /default-rack/192.168.1.232:50010 16/04/08 14:35:10 INFO net.NetworkTopology: Adding a new node: /default-rack/192.168.1.231:50010 16/04/08 14:35:10 INFO net.NetworkTopology: Adding a new node: /default-rack/192.168.1.208:50010 16/04/08 14:35:10 INFO net.NetworkTopology: Adding a new node: /default-rack/192.168.1.207:50010 16/04/08 14:35:10 INFO balancer.Balancer: 5 over-utilized: [192.168.1.213:50010:DISK, 192.168.1.229:50010:DISK, 192.168.1.232:50010:DISK, 192.168.1.231:50010:DISK, 192.168.1.208:50010:DISK] 16/04/08 14:35:10 INFO balancer.Balancer: 1 underutilized(未充分利用的): [192.168.1.207:50010:DISK] 16/04/08 14:35:10 INFO balancer.Balancer: Need to move 815.45 GB to make the cluster balanced. 16/04/08 14:35:10 INFO balancer.Balancer: Decided to move 10 GB bytes from 192.168.1.213:50010:DISK to 192.168.1.207:50010:DISK 16/04/08 14:35:10 INFO balancer.Balancer: Will move 10 GB in this iteration 16/04/08 14:41:18 INFO balancer.Dispatcher: Successfully moved blk_1073760371_19547 with size=77829046 from 192.168.1.213:50010:DISK to 192.168.1.207:50010:DISK through 192.168.1.213:50010 16/04/08 14:41:19 INFO balancer.Dispatcher: Successfully moved blk_1073760385_19561 with size=77829046 from 192.168.1.213:50010:DISK to 192.168.1.207:50010:DISK through 192.168.1.213:50010 16/04/08 14:41:22 INFO balancer.Dispatcher: Successfully moved blk_1073760393_19569 with size=77829046 from 192.168.1.213:50010:DISK to 192.168.1.207:50010:DISK through 192.168.1.213:50010 16/04/08 14:41:23 INFO balancer.Dispatcher: Successfully moved blk_1073760363_19539 with size=77829046 from 192.168.1.213:50010:DISK to 192.168.1.207:50010:DISK through 192.168.1.213:50010 10.11.9. 查看文件分布在哪些节点 hdfs fsck hdfs:///tmp/slaves -files -locations -blocks 10.11.10. 关闭安全模式 hdfs dfsadmin -safemode leave 10.11.11. 删除missing blocks hdfs fsck -delete 11. 扩容和下线 11.1. 新增JournalNode 如果是扩容,将已有JournalNode的current目录打包到新机器的“dfs.journalnode.edits.dir”指定的相同位置下。 为保证扩容和缩容JournalNode成功,需要先将NameNode和JournalNode全停止掉,再修改配置,然后在启动JournalNode成功后(日志停留在“IPC Server listener on 8485: starting”处),再启动NameNode,否则可能遇到如下这样的错误: org.apache.hadoop.hdfs.qjournal.protocol.JournalOutOfSyncException: Can't write, no segment open 找一台已有JournalNode节点,修改它的hdfs-site.xml,将新增的Journal包含进来,如在 qjournal://hadoop-030:8485;hadoop-031:8485;hadoop-032:8485/test 的基础上新增hadoop-033和hadoop-034两个JournalNode: qjournal://hadoop-030:8485;hadoop-031:8485;hadoop-032:8485;hadoop-033:8485;hadoop-034:8485/test 然后将安装目录和数据目录(hdfs-site.xml中的dfs.journalnode.edits.dir指定的目录)都复制到新的节点。 如果不复制JournalNode的数据目录,则新节点上的JournalNode可能会报错“Journal Storage Directory /data/journal/test not formatted”,将来的版本可能会实现自动同步。ZooKeeper的扩容不需要复制已有节点的data和datalog,而且也不能这样操作。 接下来,就可以在新节点上启动好JournalNode(不需要做什么初始化),并重启下NameNode。注意观察JournalNode日志,查看是否启动成功,当日志显示为以下这样的INFO级别日志则表示启动成功: 2016-04-26 10:31:11,160 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /data/journal/test/current/edits_inprogress_0000000000000194269 -> /data/journal/test/current/edits_0000000000000194269-0000000000000194270 但只能出现如下的日志,才表示工作正常: 2017-05-18 15:22:42,901 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 8485: starting 2017-05-18 15:23:27,028 INFO org.apache.hadoop.hdfs.qjournal.server.JournalNode: Initializing journal in directory /data/journal/data/test 2017-05-18 15:23:27,042 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /data/journal/data/test/in_use.lock acquired by nodename 15259@hadoop-40 2017-05-18 15:23:27,057 INFO org.apache.hadoop.hdfs.qjournal.server.Journal: Scanning storage FileJournalManager(root=/data/journal/data/test) 2017-05-18 15:23:27,152 INFO org.apache.hadoop.hdfs.qjournal.server.Journal: Latest log is EditLogFile(file=/data/journal/data/test/current/edits_inprogress_0000000000027248811,first=0000000000027248811,last=0000000000027248811,inProgress=true,hasCorruptHeader=false) 11.2. 新NameNode如何加入? 记得更换NameNode后,需要重新执行“hdfs zkfc -formatZK”,否则将不能自动主备切换。 当有NameNode机器损坏时,必然存在新NameNode来替代。把配置修改成指向新NameNode,然后以备机形式启动新NameNode,这样新的NameNode即加入到Cluster中: 1) ./hdfs namenode -bootstrapStandby 2) ./hadoop-daemon.sh start namenode 记启动主备切换进程DFSZKFailoverController,否则将不能自动做主备切换!!! 新的NameNode通过bootstrapStandby操作从主NameNode拉取fsImage(hadoop-091:50070为主NameNode): 17/04/24 14:25:32 INFO namenode.TransferFsImage: Opening connection to http://hadoop-091:50070/imagetransfer?getimage=1&txid=2768127&storageInfo=-63:2009831148:1492719902489:CID-5b2992bb-4dcb-4211-8070-6934f4d232a8&bootstrapstandby=true 17/04/24 14:25:32 INFO namenode.TransferFsImage: Image Transfer timeout configured to 60000 milliseconds 17/04/24 14:25:32 INFO namenode.TransferFsImage: Transfer took 0.01s at 28461.54 KB/s 17/04/24 14:25:32 INFO namenode.TransferFsImage: Downloaded file fsimage.ckpt_0000000000002768127 size 379293 bytes. 如果没有足够多的DataNode连接到NameNode,则NameNode也会进入safe模式,下面的信息显示只有0台DataNodes连接到了NameNode。 原因有可能是因为修改了dfs.ha.namenodes.mycluster的值,导致DataNode不认识,比如将nm1改成了nn1等,这个时候还需要重新formatZK,否则自动主备切换将失效。 如果DataNode上的配置也同步修改了,但修改后未重启,则需要重启DataNode: Safe mode is ON. The reported blocks 0 needs additional 12891 blocks to reach the threshold 0.9990 of total blocks 12904. The number of live datanodes 0 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached. 11.3. 扩容DataNode 兼容的版本,可以跨版本扩容,比如由Hadoop-2.7.2扩容Hadoop-2.8.0。扩容方法为在新增的机器上安装和配置好DataNode,在成功启动DataNode后,在主NameNode上执行命令:bin/hdfs dfsadmin -refreshNodes,即完成扩容。 如要数据均衡到新加入的机器,执行命令:sbin/start-balancer.sh,可带参数-threshold,默认值为10,如:sbin/start-balancer.sh -threshold 5。参数-threshold的取值范围为0~100。 balancer命令可在NameNode和DataNode上执行,但最好在新增机器或空闲机器上执行。 参数-threshold的值表示节点存储使用率和集群存储使用率间的关系,如果节点的存储使用率小于集群存储的使用率,则执行balance操作。 11.4. 下线DataNode 约束:本操作需要在主NameNode上进行,即状态为active的NameNode上进行!!!如果备NameNode也运行着,建议备的hdfs-site.xml也做同样修改,以防止下线过程中发现主备NameNode切换,或者干脆停掉备NameNode。 下线完成后,记得将hdfs-site.xml修改回来(即将dfs.hosts.exclude值恢复为空值),如果不修改回来,那被下线掉的DataNode将一直处于Decommissioned状态,同时还得做一次“/data/hadoop/bin/hdfs dfsadmin -refreshNodes”,否则被下线的DataNode一直处于Decommissioned状态。 下线后,只要配置了dfs.hosts,即使被下线的DataNode进程未停掉,也不会再连接进来,而且这是推荐的方式,以防止外部的DataNode无意中连接进来。 但在将dfs.hosts.exclude值恢复为空值之前,需要将已下线的所有DataNode进程停掉,最好还设置hdfs-site.xml中的dfs.hosts值,以限制可以连接NameNode的DataNode,不然一不小心,被下线的DataNode又连接上来了,切记!另外注意,如果有用到slaves文件,也需要slaves同步修改。 修改主NameNode的hdfs-site.xml,设置dfs.hosts.exclude的值,值为一文件的全路径,如:/home/hadoop/etc/hadoop/hosts.exclude。文件内容为需要下线(即删除)的DataNode的机器名或IP,每行一个机器名或IP(注意暂不要将下线的DataNode从slaves中剔除)。 修改完hdfs-site.xml后,在主NameNode上执行:bin/hdfs dfsadmin -refreshNodes,以刷新DataNode,下线完成后可同扩容一样做下balance。 使用命令bin/hdfs dfsadmin -report或web界面可以观察下线的DataNode退役(Decommissioning)状态。完成后,将下线的DataNode从slaves中剔除。 下线前的状态: $ hdfs dfsadmin -report Name: 192.168.31.33:50010 (hadoop-33) Hostname: hadoop-33 Decommission Status : Normal Configured Capacity: 3247462653952 (2.95 TB) DFS Used: 297339283 (283.56 MB) Non DFS Used: 165960652397 (154.56 GB) DFS Remaining: 3081204662272 (2.80 TB) DFS Used%: 0.01% DFS Remaining%: 94.88% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 1 Last contact: Wed Apr 19 18:03:33 CST 2017 下线进行中的的状态: $ hdfs dfsadmin -report Name: 192.168.31.33:50010 (hadoop-33) Hostname: hadoop-33 Decommission Status : Decommission in progress Configured Capacity: 3247462653952 (2.95 TB) DFS Used: 297339283 (283.56 MB) Non DFS Used: 165960652397 (154.56 GB) DFS Remaining: 3081204662272 (2.80 TB) DFS Used%: 0.01% DFS Remaining%: 94.88% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 16 Last contact: Thu Apr 20 09:00:48 CST 2017 下线完成后的状态: $ hdfs dfsadmin -report Name: 192.168.31.33:50010 (hadoop-33) Hostname: hadoop-33 Decommission Status : Decommissioned Configured Capacity: 1935079350272 (1.76 TB) DFS Used: 257292167968 (239.62 GB) Non DFS Used: 99063741175 (92.26 GB) DFS Remaining: 1578723441129 (1.44 TB) DFS Used%: 13.30% DFS Remaining%: 81.58% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 13 Last contact: Thu Apr 20 09:29:00 CST 2017 如果长时间处于“Decommission In Progress”状态,而不能转换成Decommissioned状态,这个时候可用“hdfs fsck”检查下。 成功下线后,还需要将该节点从slaves中删除,以及dfs.hosts.exclude中剔除,然后再做一下:bin/hdfs dfsadmin -refreshNodes。 11.5. 强制DataNode上报块信息 在扩容过程中,有可能遇到DataNode启动时未向NameNode上报block信息。正常时,NameNode都会通过心跳响应的方式告诉DataNode上报block,但当NameNode和DataNode版本不一致等时,可能会使这个机制失效。搜索DataNode的日志文件,将搜索不到上报信息日志“sent block report”。 这个时候,一旦重启NameNode,就会出现大量“missing block”。幸好HDFS提供了工具,可以直接强制DataNode上报block,方法为: hdfs dfsadmin -triggerBlockReport 192.168.31.26:50020 上述192.168.31.26为DataNode的IP地址,50020为DataNode的RPC端口。最终应当保持DataNode和NameNode版本一致,不然得每次做一下这个操作,而且可能还有其它问题存在。 12. 启动YARN 12.1. 启动YARN 如果不能自动主备切换,检查下是否有其它的ResourceManager正占用着ZooKeeper。 1) 进入$HADOOP_HOME/sbin目录 2) 在主备两台都执行:start-yarn.sh,即开始启动YARN 若启动成功,则在Master节点执行jps,可以看到ResourceManager: > jps 24689 NameNode 30156 Jps 28861 ResourceManager 在Slaves节点执行jps,可以看到NodeManager: $ jps 14019 NodeManager 23257 DataNode 15115 Jps 如果只需要单独启动指定节点上的ResourceManager,这样: ./yarn-daemon.sh start resourcemanager 对于NodeManager,则是这样: ./yarn-daemon.sh start nodemanager 12.2. 执行YARN命令 12.2.1. yarn node -list 列举YARN集群中的所有NodeManager,如(注意参数间的空格,直接执行yarn可以看到使用帮助): > yarn node -list Total Nodes:3 Node-Id Node-State Node-Http-Address Number-of-Running-Containers localhost:45980 RUNNING localhost:8042 0 localhost:47551 RUNNING localhost:8042 0 localhost:58394 RUNNING localhost:8042 0 12.2.2. yarn node -status 查看指定NodeManager的状态,如: > yarn node -status localhost:47551 Node Report : Node-Id : localhost:47551 Rack : /default-rack Node-State : RUNNING Node-Http-Address : localhost:8042 Last-Health-Update : 星期五 18/四月/14 01:45:41:555GMT Health-Report : Containers : 0 Memory-Used : 0MB Memory-Capacity : 8192MB CPU-Used : 0 vcores CPU-Capacity : 8 vcores 12.2.3. yarn rmadmin -getServiceState rm1 查看rm1的主备状态,即查看它是主(active)还是备(standby)。 12.2.4. yarn rmadmin -transitionToStandby rm1 将rm1从主切为备。 更多的yarn命令可以参考: https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YarnCommands.html。 13. 运行MapReduce程序 在安装目录的share/hadoop/mapreduce子目录下,有现存的示例程序: hadoop@VM-40-171-sles10-64:~/hadoop> ls share/hadoop/mapreduce hadoop-mapreduce-client-app-2.8.0.jar hadoop-mapreduce-client-jobclient-2.8.0-tests.jar hadoop-mapreduce-client-common-2.8.0.jar hadoop-mapreduce-client-shuffle-2.8.0.jar hadoop-mapreduce-client-core-2.8.0.jar hadoop-mapreduce-examples-2.8.0.jar hadoop-mapreduce-client-hs-2.8.0.jar lib hadoop-mapreduce-client-hs-plugins-2.8.0.jar lib-examples hadoop-mapreduce-client-jobclient-2.8.0.jar sources 跑一个示例程序试试: hdfs dfs -put /etc/hosts hdfs:///test/in/ hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.0.jar wordcount hdfs:///test/in/ hdfs:///test/out/ 运行过程中,使用java的jps命令,可以看到yarn启动了名为YarnChild的进程。 wordcount运行完成后,结果会保存在out目录下,保存结果的文件名类似于“part-r-00000”。另外,跑这个示例程序有两个需求注意的点: 1) in目录下要有文本文件,或in即为被统计的文本文件,可以为HDFS上的文件或目录,也可以为本地文件或目录 2) out目录不能存在,程序会自动去创建它,如果已经存在则会报错。 包hadoop-mapreduce-examples-2.8.0.jar中含有多个示例程序,不带参数运行,即可看到用法: > hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.0.jar wordcount Usage: wordcount > hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.0.jar An example program must be given as the first argument. Valid program names are: aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files. aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files. bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi. dbcount: An example job that count the pageview counts from a database. distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi. grep: A map/reduce program that counts the matches of a regex in the input. join: A job that effects a join over sorted, equally partitioned datasets multifilewc: A job that counts words from several files. pentomino: A map/reduce tile laying program to find solutions to pentomino problems. pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method. randomtextwriter: A map/reduce program that writes 10GB of random textual data per node. randomwriter: A map/reduce program that writes 10GB of random data per node. secondarysort: An example defining a secondary sort to the reduce. sort: A map/reduce program that sorts the data written by the random writer. sudoku: A sudoku solver. teragen: Generate data for the terasort terasort: Run the terasort teravalidate: Checking results of terasort wordcount: A map/reduce program that counts the words in the input files. wordmean: A map/reduce program that counts the average length of the words in the input files. wordmedian: A map/reduce program that counts the median length of the words in the input files. wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files. 修改日志级别为DEBBUG,并打屏: export HADOOP_ROOT_LOGGER=DEBUG,console 14. HDFS权限配置 14.1. hdfs-site.xml dfs.permissions.enabled = true dfs.permissions.superusergroup = supergroup dfs.cluster.administrators = ACL-for-admins dfs.namenode.acls.enabled = true dfs.web.ugi = webuser,webgroup 14.2. core-site.xml fs.permissions.umask-mode = 022 hadoop.security.authentication = simple 安全验证规则,可为simple或kerberos 15. C++客户端编程 15.1. 示例代码 // g++ -g -o x x.cpp -L$JAVA_HOME/lib/amd64/jli -ljli -L$JAVA_HOME/jre/lib/amd64/server -ljvm -I$HADOOP_HOME/include $HADOOP_HOME/lib/native/libhdfs.a -lpthread -ldl #include "hdfs.h" #include #include #include int main(int argc, char **argv) { #if 0 hdfsFS fs = hdfsConnect("default", 0); // HA方式 const char* writePath = "hdfs://mycluster/tmp/testfile.txt"; hdfsFile writeFile = hdfsOpenFile(fs, writePath, O_WRONLY |O_CREAT, 0, 0, 0); if(!writeFile) { fprintf(stderr, "Failed to open %s for writing!\n", writePath); exit(-1); } const char* buffer = "Hello, World!\n"; tSize num_written_bytes = hdfsWrite(fs, writeFile, (void*)buffer, strlen(buffer)+1); if (hdfsFlush(fs, writeFile)) { fprintf(stderr, "Failed to 'flush' %s\n", writePath); exit(-1); } hdfsCloseFile(fs, writeFile); #else struct hdfsBuilder* bld = hdfsNewBuilder(); hdfsBuilderSetNameNode(bld, "default"); // HA方式 hdfsFS fs = hdfsBuilderConnect(bld); if (NULL == fs) { fprintf(stderr, "Failed to connect hdfs\n"); exit(-1); } int num_entries = 0; hdfsFileInfo* entries; if (argc entries = hdfsListDirectory(fs, "/", &num_entries); else entries = hdfsListDirectory(fs, argv[1], &num_entries); fprintf(stdout, "num_entries: %d\n", num_entries); for (int i=0; i ; ++i) { fprintf(stdout, "%s\n", entries[i].mName); } hdfsFreeFileInfo(entries, num_entries); hdfsDisconnect(fs); //hdfsFreeBuilder(bld); #endif return 0; } 15.2. 运行示例 运行之前需要设置好CLASSPATH,如果设置不当,可能会遇到不少困难,比如期望操作HDFS上的文件和目录,却变成了本地的文件和目录,如者诸于“java.net.UnknownHostException”类的错误等。 为避免出现错误,强烈建议使用命令“hadoop classpath --glob”取得正确的CLASSPATH值。 另外还需要设置好libjli.so和libjvm.so两个库的LD_LIBRARY_PATH,如: export LD_LIBRARY_PATH=$JAVA_HOME/lib/amd64/jli:$JAVA_HOME/jre/lib/amd64/server:$LD_LIBRARY_PATH 16. fsImage Hadoop提供了fsImage和Edit查看工具,分别为oiv和oev,使用示例: hdfs oiv -i fsimage_0000000000000001953 -p XML -o x.xml hdfs oev -i edits_0000000000000001054-0000000000000001055 -o x x.xml 借助工具,可以编辑修改fsImage或Edit文件,做数据修复。 主备NameNode通过QJM同步数据。QJM的数据目录由参数dfs.namenode.name.dir决定,NameNode的数据目录由dfs.journalnode.edits.dir决定。 QJM通过一致复制协议生成日志文件(即edit文件),日志文件名示例: edits_0000000000024641811-0000000000024641854 所以节点的日志文件是完全相同的,即拥有相同的MD5值,主备NameNode从QJM取日志文件,并存在自己的数据目录,因此所有QJM节点和主备NameNode上的日志文件是完全相同的。 备NameNode会定期将日志文件合并成fsImage文件,并将fsImage同步给主NameNode,因此正常情况下主备NameNode间的fsImage文件也是完全相同的。如果出现不同,有可能主备NameNode间数据出现了不一致,或者是因为备NameNode刚好生成新的fsImage但还未同步给主NameNode。 默认备NameNode一小时合并一次edit文件生成新的fsImage文件,并只保留最近两个fsImage: 下面显示开始合并edit文件,生成新的fsImage文件(为3600秒,即1小时,实际由hdfs-site.xml中的dfs.namenode.checkpoint.period决定): 2017-04-21 15:35:44,994 INFO org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Triggering checkpoint because it has been 3600 seconds since the last checkpoint, which exceeds the configured interval 3600 2017-04-21 15:35:44,994 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Save namespace ... 2017-04-21 15:35:45,022 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Going to retain 2 images with txid >= 24641535 下面显示删除上上一个fsImage文件fsimage_0000000000024638036 2017-04-21 15:35:45,022 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Purging old image FSImageFile(file=/data5/namenode/current/fsimage_0000000000024638036, cpktTxId=0000000000024638036) 下面显示向主NameNode上传新的fsImage文件花了0.142秒 2017-04-21 15:35:45,239 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Uploaded image with txid 24647473 to namenode at http://hadoop-030:50070 in 0.142 seconds 2017-04-21 15:36:38,528 INFO org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Triggering log roll on remote NameNode hadoop-030/10.143.136.207:8020 2017-04-21 15:36:38,587 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Reading org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@5854eec9 expecting start txid #24647474 备NameNode会上传最新的fsImage给主NameNode: 2017-04-21 15:35:45,119 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Transfer took 0.01s at 56923.08 KB/s 下面显示已下载了最新的fsImage文件,文件名将是fsimage_0000000000024647473 2017-04-21 15:35:45,119 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Downloaded file fsimage.ckpt_0000000000024647473 size 758002 bytes. 下面显示保留2个fsImage文件 2017-04-21 15:35:45,126 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Going to retain 2 images with txid >= 24641535 下面显示删除上上一个fsImage文件fsimage_0000000000024638036 2017-04-21 15:35:45,126 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Purging old image FSImageFile(file=/data5/namenode/current/fsimage_0000000000024638036, cpktTxId=0000000000024638036) 2017-04-21 15:35:45,236 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Purging remote journals older than txid 23641536 2017-04-21 15:35:45,236 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Purging logs older than 23641536 下面显示删除较老的edit文件 2017-04-21 15:35:45,244 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Purging old edit log EditLogFile(file=/data5/namenode/current/edits_0000000000023641041-0000000000023641120,first=0000000000023641041,last=0000000000023641120,inProgress=false,hasCorruptHeader=false) 17. 常见错误 1) 执行“hdfs dfs -ls”时报ConnectException 原因可能是指定的端口号9000不对,该端口号由hdfs-site.xml中的属性“dfs.namenode.rpc-address”指定,即为NameNode的RPC服务端口号。 文件上传后,被存储在DataNode的data(由DataNode的hdfs-site.xml中的属性“dfs.datanode.data.dir”指定)目录下,如: $HADOOP_HOME/data/current/BP-139798373-172.25.40.171-1397735615751/current/finalized/blk_1073741825 文件名中的“blk”是block,即块的意思,默认情况下blk_1073741825即为文件的一个完整块,Hadoop未对它进额外处理。 hdfs dfs -ls hdfs://172.25.40.171:9000 14/04/17 12:04:02 WARN conf.Configuration: mapred-site.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 14/04/17 12:04:02 WARN conf.Configuration: mapred-site.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 14/04/17 12:04:02 WARN conf.Configuration: mapred-site.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 14/04/17 12:04:02 WARN conf.Configuration: mapred-site.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 14/04/17 12:04:02 WARN conf.Configuration: mapred-site.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 14/04/17 12:04:02 WARN conf.Configuration: mapred-site.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /data/hadoop/hadoop-2.8.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now. It's highly recommended that you fix the library with 'execstack -c ', or link it with '-z noexecstack'. 14/04/17 12:04:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/04/17 12:04:03 WARN conf.Configuration: mapred-site.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 14/04/17 12:04:03 WARN conf.Configuration: mapred-site.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. ls: Call From VM-40-171-sles10-64/172.25.40.171 to VM-40-171-sles10-64:9000 failed on connection exception: java.net.ConnectException: 拒绝连接; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused 2) Initialization failed for Block pool 可能是因为对NameNode做format之前,没有清空DataNode的data目录。 3) Incompatible clusterIDs “Incompatible clusterIDs”的错误原因是在执行“hdfs namenode -format”之前,没有清空DataNode节点的data目录。 网上一些文章和帖子说是tmp目录,它本身也是没问题的,但Hadoop 2.8.0是data目录,实际上这个信息已经由日志的“/data/hadoop/hadoop-2.8.0/data”指出,所以不能死死的参照网上的解决办法,遇到问题时多仔细观察。 从上述描述不难看出,解决办法就是清空所有DataNode的data目录,但注意不要将data目录本身给删除了。 data目录由core-site.xml文件中的属性“dfs.datanode.data.dir”指定。 2014-04-17 19:30:33,075 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /data/hadoop/hadoop-2.8.0/data/in_use.lock acquired by nodename 28326@localhost 2014-04-17 19:30:33,078 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for block pool Block pool (Datanode Uuid unassigned) service to /172.25.40.171:9001 java.io.IOException: Incompatible clusterIDs in /data/hadoop/hadoop-2.8.0/data: namenode clusterID = CID-50401d89-a33e-47bf-9d14-914d8f1c4862; datanode clusterID = CID-153d6fcb-d037-4156-b63a-10d6be224091 at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:472) at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:225) at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:249) at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:929) at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:900) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:274) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:220) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:815) at java.lang.Thread.run(Thread.java:744) 2014-04-17 19:30:33,081 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for: Block pool (Datanode Uuid unassigned) service to /172.25.40.171:9001 2014-04-17 19:30:33,184 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool ID needed, but service not yet registered with NN java.lang.Exception: trace at org.apache.hadoop.hdfs.server.datanode.BPOfferService.getBlockPoolId(BPOfferService.java:143) at org.apache.hadoop.hdfs.server.datanode.BlockPoolManager.remove(BlockPoolManager.java:91) at org.apache.hadoop.hdfs.server.datanode.DataNode.shutdownBlockPool(DataNode.java:859) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.shutdownActor(BPOfferService.java:350) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.cleanUp(BPServiceActor.java:619) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:837) at java.lang.Thread.run(Thread.java:744) 2014-04-17 19:30:33,184 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Removed Block pool (Datanode Uuid unassigned) 2014-04-17 19:30:33,184 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool ID needed, but service not yet registered with NN java.lang.Exception: trace at org.apache.hadoop.hdfs.server.datanode.BPOfferService.getBlockPoolId(BPOfferService.java:143) at org.apache.hadoop.hdfs.server.datanode.DataNode.shutdownBlockPool(DataNode.java:861) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.shutdownActor(BPOfferService.java:350) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.cleanUp(BPServiceActor.java:619) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:837) at java.lang.Thread.run(Thread.java:744) 2014-04-17 19:30:35,185 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode 2014-04-17 19:30:35,187 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 0 2014-04-17 19:30:35,189 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down DataNode at localhost/127.0.0.1 ************************************************************/ 4) Inconsistent checkpoint fields SecondaryNameNode中的“Inconsistent checkpoint fields”错误原因,可能是因为没有设置好SecondaryNameNode上core-site.xml文件中的“hadoop.tmp.dir”。 2014-04-17 11:42:18,189 INFO org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Log Size Trigger :1000000 txns 2014-04-17 11:43:18,365 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in doCheckpoint java.io.IOException: Inconsistent checkpoint fields. LV = -56 namespaceID = 1384221685 cTime = 0 ; clusterId = CID-319b9698-c88d-4fe2-8cb2-c4f440f690d4 ; blockpoolId = BP-1627258458-172.25.40.171-1397735061985. Expecting respectively: -56; 476845826; 0; CID-50401d89-a33e-47bf-9d14-914d8f1c4862; BP-2131387753-172.25.40.171-1397730036484. at org.apache.hadoop.hdfs.server.namenode.CheckpointSignature.validateStorageInfo(CheckpointSignature.java:135) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:518) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:383) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$1.run(SecondaryNameNode.java:349) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:415) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:345) at java.lang.Thread.run(Thread.java:744) 另外,也请配置好SecondaryNameNode上hdfs-site.xml中的“dfs.datanode.data.dir”为合适的值: hadoop.tmp.dir /data/hadoop/current/tmp A base for other temporary directories. 5) fs.defaultFS is file:/// 在core-site.xml中,当只填写了fs.defaultFS,而fs.default.name为默认的file:///时,会报此错误。解决方法是设置成相同的值。 6) a shared edits dir must not be specified if HA is not enabled 该错误可能是因为hdfs-site.xml中没有配置dfs.nameservices或dfs.ha.namenodes.mycluster。 7) /tmp/dfs/name is in an inconsistent state: storage directory does not exist or is not accessible. 只需按日志中提示的,创建好相应的目录。 8) The auxService:mapreduce_shuffle does not exist 问题原因是没有配置yarn-site.xml中的“yarn.nodemanager.aux-services”,将它的值配置为mapreduce_shuffle,然后重启yarn问题即解决。记住所有yarn节点都需要修改,包括ResourceManager和NodeManager,如果NodeManager上的没有修改,仍然会报这个错误。 9) org.apache.hadoop.ipc.Client: Retrying connect to server 该问题,有可能是因为NodeManager中的yarn-site.xml和ResourceManager上的不一致,比如NodeManager没有配置yarn.resourcemanager.ha.rm-ids。 10) mapreduce.Job: Running job: job_1445931397013_0001 Hadoop提交mapreduce任务时,卡在mapreduce.Job: Running job: job_1445931397013_0001处。 问题原因可能是因为yarn的NodeManager没起来,可以用jdk的jps确认下。 该问题也有可能是因为NodeManager中的yarn-site.xml和ResourceManager上的不一致,比如NodeManager没有配置yarn.resourcemanager.ha.rm-ids。 11) Could not format one or more JournalNodes 执行“./hdfs namenode -format”时报“Could not format one or more JournalNodes”。 可能是hdfs-site.xml中的dfs.namenode.shared.edits.dir配置错误,比如重复了,如: qjournal://hadoop-168-254:8485;hadoop-168-254:8485;hadoop-168-253:8485;hadoop-168-252:8485;hadoop-168-251:8485/mycluster 修复后,重启JournalNode,问题可能就解决了。 12) org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Already in standby state 遇到这个错误,可能是yarn-site.xml中的yarn.resourcemanager.webapp.address配置错误,比如配置成了两个yarn.resourcemanager.webapp.address.rm1,实际应当是yarn.resourcemanager.webapp.address.rm1和yarn.resourcemanager.webapp.address.rm2。 13) No valid image files found 如果是备NameNode,执行下“hdfs namenode -bootstrapStandby”再启动。 2015-12-01 15:24:39,535 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode. java.io.FileNotFoundException: No valid image files found at org.apache.hadoop.hdfs.server.namenode.FSImageTransactionalStorageInspector.getLatestImages(FSImageTransactionalStorageInspector.java:165) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:623) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:294) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:975) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:681) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:584) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:644) at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:811) at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:795) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1488) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1554) 2015-12-01 15:24:39,536 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1 2015-12-01 15:24:39,539 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: 14) xceivercount 4097 exceeds the limit of concurrent xcievers 4096 此错误的原因是hdfs-site.xml中的配置项“dfs.datanode.max.xcievers”值4096过小,需要改大一点。该错误会导致hbase报“notservingregionexception”。 16/04/06 14:30:34 ERROR namenode.NameNode: Failed to start namenode. 15) java.lang.IllegalArgumentException: Unable to construct journal, qjournal://hadoop-030:8485;hadoop-031:8454;hadoop-032 执行“hdfs namenode -format”遇到上述错误时,是因为hdfs-site.xml中的配置dfs.namenode.shared.edits.dir配置错误,其中的hadoop-032省了“:8454”部分。 16) Bad URI 'qjournal://hadoop-030:8485;hadoop-031:8454;hadoop-032:8454': must identify journal in path component 是因为配置hdfs-site.xml中的“dfs.namenode.shared.edits.dir”时,路径少带了cluster名。 17) 16/04/06 14:48:19 INFO ipc.Client: Retrying connect to server: hadoop-032/10.143.136.211:8454. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 检查hdfs-site.xml中的“dfs.namenode.shared.edits.dir”值,JournalNode默认端口是8485,不是8454,确认是否有写错。JournalNode端口由hdfs-site.xml中的配置项dfs.journalnode.rpc-address决定。 18) Exception in thread "main" org.apache.hadoop.HadoopIllegalArgumentException: Could not get the namenode ID of this node. You may run zkfc on the node other than namenode. 执行“hdfs zkfc -formatZK”遇到上面这个错误,是因为还没有执行“hdfs namenode -format”。NameNode ID是在“hdfs namenode -format”时生成的。 19) 2016-04-06 17:08:07,690 INFO org.apache.hadoop.hdfs.server.common.Storage: Storage directory [DISK]file:/data3/datanode/data/ has already been used. 以非root用户启动DataNode,但启动不了,在它的日志文件中发现如下错误信息: 2016-04-06 17:08:07,707 INFO org.apache.hadoop.hdfs.server.common.Storage: Analyzing storage directories for bpid BP-418073539-10.143.136.207-1459927327462 2016-04-06 17:08:07,707 WARN org.apache.hadoop.hdfs.server.common.Storage: Failed to analyze storage directories for block pool BP-418073539-10.143.136.207-1459927327462 java.io.IOException: BlockPoolSliceStorage.recoverTransitionRead: attempt to load an used block storage: /data3/datanode/data/current/BP-418073539-10.143.136.207-1459927327462 继续寻找,会发现还存在如何错误提示: Invalid dfs.datanode.data.dir /data3/datanode/data: EPERM: Operation not permitted 使用命令“ls -l”检查目录/data3/datanode/data的权限设置,发现owner为root,原因是因为之前使用root启动过DataNode,将owner改过来即可解决此问题。 20) 2016-04-06 18:00:26,939 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Problem connecting to server: hadoop-031/10.143.136.208:8020 DataNode的日志文件不停地记录如下日志,是因为DataNode将作为主NameNode,但实际上10.143.136.208并没有启动,主NameNode不是它。这个并不表示DataNode没有起来,而是因为DataNode会同时和主NameNode和备NameNode建立心跳,当备NameNode没有起来时,有这些日志是正常现象。 2016-04-06 18:00:32,940 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop-031/10.143.136.208:8020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2016-04-06 17:55:44,555 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Namenode Block pool BP-418073539-10.143.136.207-1459927327462 (Datanode Uuid 2d115d45-fd48-4e86-97b1-e74a1f87e1ca) service to hadoop-030/10.143.136.207:8020 trying to claim ACTIVE state with txid=1 “trying to claim ACTIVE state”出自于hadoop/hdfs/server/datanode/BPOfferService.java中的updateActorStatesFromHeartbeat()。 2016-04-06 17:55:49,893 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop-031/10.143.136.208:8020. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) “Retrying connect to server”出自于hadoop/ipc/Client.java中的handleConnectionTimeout()和handleConnectionFailure()。 21) ERROR cluster.YarnClientSchedulerBackend: Yarn application has already exited with state FINISHED! 如果遇到这个错误,请检查NodeManager日志,如果发现有如下所示信息: WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=26665,containerID=container_1461657380500_0020_02_000001] is running beyond virtual memory limits. Current usage: 345.0 MB of 1 GB physical memory used; 2.2 GB of 2.1 GB virtual memory used. Killing container. 则表示需要增大yarn-site.xmk的配置项yarn.nodemanager.vmem-pmem-ratio的值,该配置项默认值为2.1。 16/10/13 10:23:19 ERROR client.TransportClient: Failed to send RPC 7614640087981520382 to /10.143.136.231:34800: java.nio.channels.ClosedChannelException java.nio.channels.ClosedChannelException 16/10/13 10:23:19 ERROR cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Sending RequestExecutors(0,0,Map()) to AM was unsuccessful java.io.IOException: Failed to send RPC 7614640087981520382 to /10.143.136.231:34800: java.nio.channels.ClosedChannelException at org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:249) at org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:233) at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680) at io.netty.util.concurrent.DefaultPromise$LateListeners.run(DefaultPromise.java:845) at io.netty.util.concurrent.DefaultPromise$LateListenerNotifier.run(DefaultPromise.java:873) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) 22) java.net.SocketException: Unresolved address 可能是在非NameNode上启动NameNode: java.net.SocketException: Call From cluster to null:0 failed on socket exception: java.net.SocketException: Unresolved address 23) should be specified as a URI in configuration files 请在dfs.namenode.name.dir、dfs.journalnode.edits.dir和dfs.datanode.data.dir配置的路径前加上前缀“file://”: common.Util: Path /home/namenode/data should be specified as a URI in configuration files. Please update hdfs configuration. 如: dfs.namenode.name.dir file:///home/namenode/data 24) Failed to place enough replicas 如果将DataNode的dfs.datanode.data.dir全配置成SSD类型,则执行“hdfs dfs -put /etc/hosts hdfs:///tmp/”时会报如下错误: 2017-05-04 16:08:22,545 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 3 to reach 3 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and org.apache.hadoop.net.NetworkTopology 2017-05-04 16:08:22,545 WARN org.apache.hadoop.hdfs.protocol.BlockStoragePolicy: Failed to place enough replicas: expected size is 3 but only 0 storage types can be selected (replication=3, selected=[], unavailable=[DISK], removed=[DISK, DISK, DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}) 2017-05-04 16:08:22,545 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 3 to reach 3 (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All required storage types are unavailable: unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]} 2017-05-04 16:08:22,545 INFO org.apache.hadoop.ipc.Server: IPC Server handler 37 on 8020, call Call#5 Retry#0 org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 10.208.5.220:40701 java.io.IOException: File /tmp/in/hosts._COPYING_ could only be replicated to 0 nodes instead of minReplication (=1). There are 5 datanode(s) running and no node(s) are excluded in this operation. at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1733) at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:265) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2496) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:828) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:506) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:845) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:788) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2455)
目录 目录 1 1. 前言 1 2. 名词 1 3. 功能 1 4. 唯一性原理 2 5. 系统结构 2 5.1. mooon-uniq-agent 2 5.2. mooon-uniq-master 2 6. 限制 3 7. 核心思想 4 8. 编译&安装 4 9. 启动&运行 5 10. 编程使用 5 1. 前言 源码位置:https://github.com/eyjian/mooon/tree/master/application/uniq_id。 mooon-uniq-id可单机运行,也可以由1~255台机器组成集群分布式运行。性能极高,单机批量取(每批限定最多100个),每秒可提供2000万个唯一ID服务,单个取每秒可以提供20万唯一ID服务。 mooon-uniq-id还具极高的可用性,只要有一台机器正常即可正常提供服务。 2. 名词 Lable 机器标签 3. 功能 mooon-uniq-id提供64位无符号整数唯一ID和类似于订单号、流水号的字符串唯一ID。 4. 唯一性原理 mooon-uniq-id生成的唯一ID通过以下公式保证: 唯一ID = 机器唯一标签 + 本机递增序列号 + 系统时间 机器唯一标签自动生成,取值从1~255,故最多支持255台机器组成集群。时间粒度为小时,单台机器一小时内的递增序列号值为536870911个,只要一小时内提供的唯一ID数不超过536870911个即是安全的。如果需求超过536870911,则可以扩展到分钟的粒度,mooon-uniq-id保留了6位供用户自由支配,这保留的6位可以用于存储分钟值。 5. 系统结构 mooon-uniq-id的实现为单进程无线程,UDP实现简单高效。采用弱主架构,这种架构下,master在一段时间内不可用,完全不影响正常服务。租期为多长,即可允许多长时间内master不可用,比如租期为7天,则7天内master不可用都不影响正常服务,agent为master的依赖非常低,因此叫弱主架构。 mooon-uniq-id由mooon-uniq-agent和mooon-uniq-master组成弱主架构,为保证可用性mooon-uniq-agent至少1台以上,而且只要有一台可以用即整体可用。 而mooon-uniq-master一般可部署2台形成主备,实际上部署一台也完全无问题,因为mooon-uniq-master只要在租期内恢复正常即可。 5.1. mooon-uniq-agent 对外提供获取唯一ID服务的是mooon-uniq-agent,至少应当部署2台,以提供必要的可用性,部署的越多可用性越高,同时每秒提供的唯一ID个数也越多,支撑的并发数也越大。 mooon-uniq-agent在能够对外服务之前,都必须先从mooon-uniq-master租约到机器标签,否则不能服务。 5.2. mooon-uniq-master mooon-uniq-master负责租约的管理,包括租约的初始化、租约的分配和租约的过期管理和回收。 默认情况下,mooon-uniq-agent每隔1分钟以心跳的形式向mooon-uniq-master续租机器标签,可以通过命令行参数“--interval”修改续租间隔,参数值的单位为秒,默认值为600即1分钟。 mooon-uniq-master将机器标签存储在MySQL表中,各机器标签的状态也存储在MySQL表中。mooon-uniq-master对MySQL有依赖,但也不是强依赖,因为只要在租期内可以成功续租或新租成功即可,而一旦集群上线后,99.999%的时间都只是续租。 在启动mooon-uniq-master之前,需要先创建好如下3个MySQL库和表: -- CREATE DATABASE IF NOT EXISTS uniq_id DEFAULT CHARSET utf8 COLLATE utf8_general_ci; -- Label资源池表 -- mooon-uniq-master第一次启动时会初始化表t_label_pool DROP TABLE IF EXISTS t_label_pool; CREATE TABLE t_label_pool ( f_label TINYINT UNSIGNED NOT NULL, PRIMARY KEY (f_label) ); -- Label在线表 -- f_label为主键,保证同一时间不可能有两个IP租赁同一个Label, -- 但一个IP可能同时持有一个或多个过期的Label,和一个当前有效的Label DROP TABLE IF EXISTS t_label_online; CREATE TABLE t_label_online ( f_label TINYINT UNSIGNED NOT NULL, f_ip CHAR(16) NOT NULL, f_time DATETIME NOT NULL, PRIMARY KEY (f_label), KEY idx_ip(f_ip), KEY idx_time(f_time) ); -- Label日志表 -- 记录租约和续租情况 DROP TABLE IF EXISTS t_label_log; CREATE TABLE t_label_log ( f_id BIGINT NOT NULL AUTO_INCREMENT, f_label TINYINT UNSIGNED NOT NULL, f_ip CHAR(16) NOT NULL, f_event TINYINT NOT NULL, f_time DATETIME NOT NULL, PRIMARY KEY (f_id), KEY idx_label(f_label), KEY idx_ip(f_ip), KEY idx_event(f_event), KEY idx_time(f_time) ); 6. 限制 ID具备唯一性,但不具备递增性。 7. 核心思想 要保证ID的唯一性,最关键是要保证同一个机器标签不能同时出现在多台机器上。mooon-uniq-id的实现引入租约,每台mooon-uniq-agent在服务之前需要先成功租约到机器标签,在租约期内独占标签,并且可以在租期内续租。 在租期后,每个机器标签还有一段冻结期,处于冻结期的机器标签不会被租约。 通过这样的一种机制保证了机器标签的安全性,只要每台mooon-uniq-agent保证序列号在该机器上唯一,那么两者相结合即保证了整体上的唯一性。 序列号总是有限的,为保证永久的唯一性,在组成唯一ID时,加上了时间共同组成唯一性。 8. 编译&安装 mooon-uniq-id依赖mooon基础库,使用git或TortoiseGit下载mooon库源码: https://github.com/eyjian/mooon/tree/master/mooon 同样,使用git或TortoiseGit下载mooon-uniq-id源码: https://github.com/eyjian/mooon/tree/master/application/uniq_id 两者均采用CMake编译,且编译方法相同,先编译mooon基础库,编译和安装步骤: 1) 进入源代码目录,执行cmake生成Makefile文件,安装目录指定为/usr/local/mooon: cmake -DCMAKE_BUILD_TYPE=Debug -DCMAKE_INSTALL_PREFIX=/usr/local/mooon . 2) 执行make编译源代码: make 3) 在make成功后,执行“make install”完成安装: /usr/local/mooon ├── bin ├── CMake.common ├── include │ └── mooon │ ├── net │ ├── sys │ └── utils └── lib └── libmooon.a 接着编译和安装mooon-uniq-id,方法相同,安装路径也请指定为/usr/local/mooon。 9. 启动&运行 简单点可以放到crontab中自动拉起,比如借助process_monitor.sh(下载:https://github.com/eyjian/mooon/blob/master/mooon/shell/process_monitor.sh)的监控和自动重启功能: * * * * * /usr/local/bin/process_monitor.sh "/usr/local/uniq_id/bin/uniq_master" "/usr/local/uniq_id/bin/uniq_master --db_host=127.0.0.1 --db_port=3306 --db_user=zhangsan --db_pass='zS2018' --db_name=uniq_id" * * * * * /usr/local/bin/process_monitor.sh "/usr/local/uniq_id/bin/uniq_agent" "/usr/local/uniq_id/bin/uniq_agent --master_nodes=192.168.31.33.225.168.251:16200,192.168.31.35:16200" 10. 编程使用 可以参考uniq_test.cpp或uniq_stress.cpp使用mooon-uniq-id,如: bool polling = false; // 轮询还是随机选择mooon-uniq-agent const uint32_t timeout_milliseconds = 200; const uint8_t retry_times = 5; // 注意:需要捕获CSyscallException和CException两种异常 try { // agent_nodes为mooon-uniq-agent连接信息,多个间以逗号分隔,注意不要包含空格: // 如:192.168.1.31:6200,192.168.1.32:6200,192.168.1.33:6200 mooon::CUniqId uniq_id(agent_nodes, timeout_milliseconds, retry_times, polling); // 取交易流水号 // // %L 为取mooon-uniq-agent的机器标签,每个mooon-uniq-agent不会有相同的标签 // %Y 为YYYY格式的年,如2017 // %M 为MM格式的月,如:04或12等 // %D 为DD格式的日,如:01或31等 // %m 为mm格式的分钟,如:09或59等 // %H 为hh格式的小时,24小时制,如:03或23等 // %5S 表示取5位的序列号,不足时前补0 std::string transaction_id = uniq_id.get_transaction_id("%L%Y%M%D%m%5S"); fprintf(stdout, "%s\n", transaction_id.c_str()); // 取8字节无符号整数值 const uint64_t uniq_id = uniq_id.get_uniq_id(); union mooon::UniqID uid; uid.value = uniq_id; fprintf(stdout, "id: %" PRIu64" => %s\n", uniq_id, uid.id.str().c_str()); } catch (mooon::sys::CSyscallException& ex) { fprintf(stderr, "%s\n", ex.str().c_str()); } catch (mooon::utils::CException& ex) { fprintf(stderr, "%s\n", ex.str().c_str()); }
# python setup.py install Traceback (most recent call last): File "setup.py", line 11, in import setuptools File "/home/zhangsan/setuptools-34.4.1/setuptools/__init__.py", line 12, in import setuptools.version File "/home/zhangsan/setuptools-34.4.1/setuptools/version.py", line 1, in import pkg_resources File "/home/zhangsan/setuptools-34.4.1/pkg_resources/__init__.py", line 72, in import packaging.requirements File "/usr/local/lib/python2.7/site-packages/packaging/requirements.py", line 59, in MARKER_EXPR = originalTextFor(MARKER_EXPR())("marker") TypeError: __call__() takes exactly 2 arguments (1 given) 对于这个错误,只需要提示找到requirements.py的59行,将 MARKER_EXPR = originalTextFor(MARKER_EXPR())("marker") 改成: MARKER_EXPR = originalTextFor(MARKER_EXPR)("marker") 即可。 如果在安装psycopg2遇到错误: Error: pg_config executable not found. 则表示需要安装包postgresql-devel: yum install postgresql-devel
官方说明: https://dev.mysql.com/doc/refman/5.7/en/mysql-real-escape-string.html 相关资料: https://dev.mysql.com/worklog/task/?id=8077 从MySQL 5.7.6版本开始,如果启用了NO_BACKSLASH_ESCAPES, 则mysql_real_escape_string()函数失败,错误码为CR_INSECURE_API_ERR, 这个时候应当使用mysql_real_escape_string_quote()替代mysql_real_escape_string()。 启用NO_BACKSLASH_ESCAPES,表示将反斜杠当作普通字符,而不是转义字符: SET sql_mode='NO_BACKSLASH_ESCAPES'; 查看当前的SQL模式: mysql> SELECT @@sql_mode; +--------------------------------------------+ | @@sql_mode | +--------------------------------------------+ | STRICT_TRANS_TABLES,NO_ENGINE_SUBSTITUTION | +--------------------------------------------+ 1 row in set (0.01 sec) 未启用NO_BACKSLASH_ESCAPES前(反斜杠为转义字符): mysql> SELECT '\\'\G; *************************** 1. row *************************** \: \ 1 row in set (0.00 sec) 在启用NO_BACKSLASH_ESCAPES后(反斜杠为普通字符): mysql> SET sql_mode='NO_BACKSLASH_ESCAPES'; Query OK, 0 rows affected (0.00 sec) mysql> SELECT '\\'\G; *************************** 1. row *************************** \\: \\ 1 row in set (0.00 sec) mysql> SELECT '\"'\G; *************************** 1. row *************************** \": \" 1 row in set (0.00 sec) 相关源代码: #define CR_INSECURE_API_ERR 2062 测试代码(https://github.com/eyjian/mooon/blob/master/mooon/tools/mysql_escape_test.cpp): try { // 未启用NO_BACKSLASH_ESCAPES printf("%s\n", mysql.escape_string(argv[2]).c_str()); // 启用NO_BACKSLASH_ESCAPES mysql.update("%s", "SET sql_mode='NO_BACKSLASH_ESCAPES'"); printf("%s\n", mysql.escape_string(argv[2]).c_str()); } catch (mooon::sys::CDBException& ex) { fprintf(stderr, "%s\n", ex.str().c_str()); } 运行结果: # ./mysql_escape_test 'root@127.0.0.1:3306' '0x1a' MYSQL_SERVER_VERSION: 5.7.12 0x1a db_exception://[2062]Insecure API function call: 'mysql_real_escape_string' Use instead: 'mysql_real_escape_string_quote'@/data/X/mooon/src/sys/mysql_db.cpp:166 mysql_real_escape_string_quote函数(MySQL 5.7.6引入): unsigned long mysql_real_escape_string_quote(MYSQL *mysql, char *to, const char *from, unsigned long from_length, char quote); 返回值:被转后的长度 参数to:被转后的 参数from:被转前的 参数length:from的有效字符数,不包括结尾符 参数quote:需要处理的字符,如:\, ', ", NUL (ASCII 0), \n, \r
coredump时的调用栈: #0 0x081eff2c in addbyter () #1 0x081f05b8 in dprintf_formatf () #2 0x081f15cf in curl_mvsnprintf () #3 0x081f0079 in curl_msnprintf () #4 0x081ef55c in Curl_failf () #5 0x081fa1a3 in Curl_resolv_timeout () #6 0xeb8fbdd4 in ?? () #7 0x00000000 in ?? () coredump的原因是因为curl的DNS解析超时控制是使用SIGALARM实现的。 这样导致发现SIGALARM会出现多线程修改同一个全局变量,由此产生了COREDUMP。 问题发生的前提是设置了CURLOPT_TIMEOUT或CURLOPT_CONNECTTIMEOUT,并且值不为0。 解决办法: 1) 设置CURLOPT_NOSIGNAL的值为1 2) 使用c-ares(configure时指定参数--enable-ares) lib/curl_setup.h(异步模式使用c-ares控制DNS解析超时): 只有当configure时指定了--enable-ares才会定义USE_ARES。 #ifdef USE_ARES # define CURLRES_ASYNCH # define CURLRES_ARES /* now undef the stock libc functions just to avoid them being used */ # undef HAVE_GETADDRINFO # undef HAVE_GETHOSTBYNAME #elif defined(USE_THREADS_POSIX) || defined(USE_THREADS_WIN32) # define CURLRES_ASYNCH # define CURLRES_THREADED #else # define CURLRES_SYNCH #endif lib/hostip.c(同步模式使用ALARM控制DNS解析超时): 只有定义了CURLRES_SYNCH,才可能定义USE_ALARM_TIMEOUT。 #if defined(CURLRES_SYNCH) && \ defined(HAVE_ALARM) && defined(SIGALRM) && defined(HAVE_SIGSETJMP) /* alarm-based timeouts can only be used with all the dependencies satisfied */ #define USE_ALARM_TIMEOUT #endif 相关源代码: lib/asyn-ares.c: Curl_resolver_getaddrinfo lib/hostasyn.c(基于c-ares的异步版本Curl_getaddrinfo): Curl_resolver_getaddrinfo 从缓存中找(hostip.c): fetch_addr hostip.c: Curl_ipv4_resolve_r hostip.c: curl_jmpenv url.c: Curl_resolv_timeout(hostname) multi.c: Curl_connect transfer.c Curl_connect url.c: Curl_reconnect_request multi.c: Curl_do
点击(此处)折叠或打开 // 测试mktime和localtime_r性能及优化方法 // // 编译方法:g++ -g -o x x.cpp或g++ -O2 -o x x.cpp,两种编译方式性能基本相同。 // // 结论: // 1) 环境变量TZ和isdst均不影响localtime_r的性能(第一次调用了除外) // 2) 环境变量TZ严重影响localtime的性能 // 3) 环境变量TZ和isdst均会严重影响mktime的性能 // *4) 注意mktime的参数即是输入参数也是输出参数,它会修改isdst值 // *5) 另外需要注意localtime_r为非信号安全函数, // 不能在信号处理过程中调用,否则可能发生死锁等问题 // // 64位机器性能数据(与32位CPU不同): /* $ ./x 1000000 test: localtime ... TZ is NULL: 2457ms TZ is empty: 172ms TZ is Asia/Shanghai: 173ms test: localtime_r ... TZ is NULL and isdst=1: 125ms TZ is NULL and isdst=0: 125ms TZ is NULL and isdst=-1: 125ms TZ is NULL and isdst undefined: 125ms TZ is empty and isdst=1: 125ms TZ is empty and isdst=0: 125ms TZ is empty and isdst=-1: 125ms TZ is empty and isdst undefined: 127ms TZ is Asia/Shanghai and isdst=1: 126ms TZ is Asia/Shanghai and isdst=0: 125ms TZ is Asia/Shanghai and isdst=-1: 125ms TZ is Asia/Shanghai and isdst undefined: 125ms test: mktime ... TZ is NULL and isdst=1: 635841ms TZ is NULL and isdst=0: 2583ms TZ is NULL and isdst=-1: 2596ms TZ is NULL and isdst undefined: 2579ms TZ is empty and isdst=1: 122377ms TZ is empty and isdst=0: 229ms TZ is empty and isdst=-1: 230ms TZ is empty and isdst undefined: 229ms TZ is Asia/Shanghai and isdst=1: 122536ms TZ is Asia/Shanghai and isdst=0: 228ms TZ is Asia/Shanghai and isdst=-1: 230ms TZ is Asia/Shanghai and isdst undefined: 228ms */ // 32位机器性能数据(与64位CPU不同): /* $ ./x 1000000 test: localtime ... TZ is NULL: 1445ms TZ is empty: 252ms TZ is Asia/Shanghai: 252ms test: localtime_r ... TZ is NULL and isdst=1: 161ms TZ is NULL and isdst=0: 160ms TZ is NULL and isdst=-1: 161ms TZ is NULL and isdst undefined: 161ms TZ is empty and isdst=1: 160ms TZ is empty and isdst=0: 161ms TZ is empty and isdst=-1: 161ms TZ is empty and isdst undefined: 161ms TZ is Asia/Shanghai and isdst=1: 161ms TZ is Asia/Shanghai and isdst=0: 161ms TZ is Asia/Shanghai and isdst=-1: 161ms TZ is Asia/Shanghai and isdst undefined: 161ms test: mktime ... TZ is NULL and isdst=1: 199375ms TZ is NULL and isdst=0: 1488ms TZ is NULL and isdst=-1: 1483ms TZ is NULL and isdst undefined: 1497ms TZ is empty and isdst=1: 161057ms TZ is empty and isdst=0: 325ms TZ is empty and isdst=-1: 328ms TZ is empty and isdst undefined: 326ms TZ is Asia/Shanghai and isdst=1: 161558ms TZ is Asia/Shanghai and isdst=0: 321ms TZ is Asia/Shanghai and isdst=-1: 335ms TZ is Asia/Shanghai and isdst undefined: 328ms */ // localtime_r相关源代码: /* // The C Standard says that localtime and gmtime return the same pointer. struct tm _tmbuf; // 全局变量 struct tm * __localtime_r (t, tp) const time_t *t; struct tm *tp; { return __tz_convert (t, 1, tp); } // 非线程安全版本,用到了全局变量_tmbuf struct tm * localtime(t) const time_t *t; { return __tz_convert (t, 1, &_tmbuf); } struct tm * __tz_convert (const time_t *timer, int use_localtime, struct tm *tp) { 。。。 // 信号处理函数中调用非信号安全函数,可能造成死锁的地方 __libc_lock_lock (tzset_lock); // localtime_r未用到_tmbuf,只是localtime使用它!!! // 因此对于localtime_r,传递给tzset_internal的第一个参数总是为0(tp != &_tmpbuf), // 而对于localtime,它传递给tzset_internal的第一个参数总是为1 tzset_internal (tp == &_tmbuf && use_localtime, 1); 。。。 } // 决定性能的函数,原因是可能涉及文件操作, // 因此要想提升性能,则应当想办法避免操作文件!!! static void internal_function tzset_internal (always, explicit) int always; int explicit; { static int is_initialized; // 静态变量 const char *tz; // 对于mktime,参数always值总是为1 // 对于localtime,参数always值总是为1 // 对于localtime_r,参数always值总是为0 if (is_initialized && !always) return; // 对于localtime_r第一次调用后,后续都在这里直接返回! is_initialized = 1; tz = getenv ("TZ"); if (tz == NULL && !explicit) tz = TZDEFAULT; if (tz && *tz == '\0') tz = "Universal"; if (tz && *tz == ':') ++tz; // 如果不设置环境变量TZ,则下面这个if语句总是不成立!!! // 因此只有设置了环境变量TZ,才有可能在这里直接返回而不进入读文件操作__tzfile_read if (old_tz != NULL && tz != NULL && strcmp (tz, old_tz) == 0) return; // 在这里返回则可以避免走到文件操作__tzfile_read if (tz == NULL) tz = TZDEFAULT; tz_rules[0].name = NULL; tz_rules[1].name = NULL; // Save the value of `tz'. free (old_tz); old_tz = tz ? __strdup (tz) : NULL; // 读文件,性能慢的原因 __tzfile_read (tz, 0, NULL); // Try to read a data file. if (__use_tzfile) return; 。。。 } */ // mktime相关源代码: /* time_t mktime (struct tm *tp) { #ifdef _LIBC // POSIX.1 8.1.1 requires that whenever mktime() is called, the // time zone names contained in the external variable 'tzname' shall // be set as if the tzset() function had been called. __tzset (); #endif // __mktime_internal会调用localtime_r, // isdst的取值在这里会严重影响到mktime的性能 return __mktime_internal (tp, __localtime_r, &localtime_offset); } void __tzset (void) { __libc_lock_lock (tzset_lock); // 和localtime_r一样也会调用tzset_internal tzset_internal (1, 1); if (!__use_tzfile) { // Set `tzname'. __tzname[0] = (char *) tz_rules[0].name; __tzname[1] = (char *) tz_rules[1].name; } __libc_lock_unlock (tzset_lock); } */ #include stdlib.h> #include stdio.h> #include string.h> #include sys/time.h> #include time.h> static void test_localtime(int M); // 测试localtime性能 static void test_localtime_r(int M); // 测试localtime_r性能 static void test_mktime(int M); // 测试mktime性能 int main(int argc, char* argv[]) { const int M = (argc2)? 1000000: atoi(argv[1]); test_localtime(M); printf("\n"); test_localtime_r(M); printf("\n"); test_mktime(M); return 0; } // test_localtime void test_localtime(int M) { int i; time_t now = time(NULL); struct timeval tv1, tv2; printf("test: localtime ...\n"); unsetenv("TZ"); // test1 { struct tm* result1; gettimeofday(&tv1, NULL); for (i=0; iM; ++i) { result1 = localtime(&now); } gettimeofday(&tv2, NULL); printf("TZ is NULL: %ums\n", (tv2.tv_sec-tv1.tv_sec)*1000 + (tv2.tv_usec-tv1.tv_usec)/1000); } setenv("TZ", "", 0); // test2 { struct tm* result2; gettimeofday(&tv1, NULL); for (i=0; iM; ++i) { result2 = localtime(&now); } gettimeofday(&tv2, NULL); printf("TZ is empty: %ums\n", (tv2.tv_sec-tv1.tv_sec)*1000 + (tv2.tv_usec-tv1.tv_usec)/1000); } setenv("TZ", "Asia/Shanghai", 0); // test3 { struct tm* result3; gettimeofday(&tv1, NULL); for (i=0; iM; ++i) { result3 = localtime(&now); } gettimeofday(&tv2, NULL); printf("TZ is Asia/Shanghai: %ums\n", (tv2.tv_sec-tv1.tv_sec)*1000 + (tv2.tv_usec-tv1.tv_usec)/1000); } } // test_localtime_r void test_localtime_r(int M) { int i; time_t now = time(NULL); struct timeval tv1, tv2; printf("test: localtime_r ...\n"); unsetenv("TZ"); // test1 { struct tm result1; gettimeofday(&tv1, NULL); for (i=0; iM; ++i) { struct tm result1_; memcpy(&result1_, &result1, sizeof(result1_)); result1_.tm_isdst = 1; localtime_r(&now, &result1_); } gettimeofday(&tv2, NULL); printf("TZ is NULL and isdst=1: %ums\n", (tv2.tv_sec-tv1.tv_sec)*1000 + (tv2.tv_usec-tv1.tv_usec)/1000); } // test2 { struct tm result2; gettimeofday(&tv1, NULL); for (i=0; iM; ++i) { struct tm result2_; memcpy(&result2_, &result2, sizeof(result2_)); result2_.tm_isdst = 0; localtime_r(&now, &result2_); } gettimeofday(&tv2, NULL); printf("TZ is NULL and isdst=0: %ums\n", (tv2.tv_sec-tv1.tv_sec)*1000 + (tv2.tv_usec-tv1.tv_usec)/1000); } // test3 { struct tm result3; gettimeofday(&tv1, NULL); for (i=0; iM; ++i) { struct tm result3_; memcpy(&result3_, &result3, sizeof(result3_)); result3_.tm_isdst = -1; localtime_r(&now, &result3_); } gettimeofday(&tv2, NULL); printf("TZ is NULL and isdst=-1: %ums\n", (tv2.tv_sec-tv1.tv_sec)*1000 + (tv2.tv_usec-tv1.tv_usec)/1000); } // test4 { struct tm result4; gettimeofday(&tv1, NULL); for (i=0; iM; ++i) { struct tm result4_; memcpy(&result4_, &result4, sizeof(result4_)); localtime_r(&now, &result4_); } gettimeofday(&tv2, NULL); printf("TZ is NULL and isdst undefined: %ums\n", (tv2.tv_sec-tv1.tv_sec)*1000 + (tv2.tv_usec-tv1.tv_usec)/1000); } setenv("TZ", "", 0); // test5 { struct tm result5; gettimeofday(&tv1, NULL); for (i=0; iM; ++i) { struct tm result5_; memcpy(&result5_, &result5, sizeof(result5_)); result5_.tm_isdst = 1; localtime_r(&now, &result5_); } gettimeofday(&tv2, NULL); printf("TZ is empty and isdst=1: %ums\n", (tv2.tv_sec-tv1.tv_sec)*1000 + (tv2.tv_usec-tv1.tv_usec)/1000); } // test6 { struct tm result6; gettimeofday(&tv1, NULL); for (i=0; iM; ++i) { struct tm result6_; memcpy(&result6_, &result6, sizeof(result6_)); result6_.tm_isdst = 0; localtime_r(&now, &result6_); } gettimeofday(&tv2, NULL); printf("TZ is empty and isdst=0: %ums\n", (tv2.tv_sec-tv1.tv_sec)*1000 + (tv2.tv_usec-tv1.tv_usec)/1000); } // test7 { struct tm result7; gettimeofday(&tv1, NULL); for (i=0; iM; ++i) { struct tm result7_; memcpy(&result7_, &result7, sizeof(result7_)); result7_.tm_isdst = -1; localtime_r(&now, &result7_); } gettimeofday(&tv2, NULL); printf("TZ is empty and isdst=-1: %ums\n", (tv2.tv_sec-tv1.tv_sec)*1000 + (tv2.tv_usec-tv1.tv_usec)/1000); } // test8 { struct tm result8; gettimeofday(&tv1, NULL); for (i=0; iM; ++i) { struct tm result8_; memcpy(&result8_, &result8, sizeof(result8_)); result8_.tm_isdst = -1; localtime_r(&now, &result8_); } gettimeofday(&tv2, NULL); printf("TZ is empty and isdst undefined: %ums\n", (tv2.tv_sec-tv1.tv_sec)*1000 + (tv2.tv_usec-tv1.tv_usec)/1000); } setenv("TZ", "Asia/Shanghai", 0); // test9 { struct tm result9; gettimeofday(&tv1, NULL); for (i=0; iM; ++i) { struct tm result9_; memcpy(&result9_, &result9, sizeof(result9_)); result9_.tm_isdst = 1; localtime_r(&now, &result9_); } gettimeofday(&tv2, NULL); printf("TZ is Asia/Shanghai and isdst=1: %ums\n", (tv2.tv_sec-tv1.tv_sec)*1000 + (tv2.tv_usec-tv1.tv_usec)/1000); } // test10 { struct tm result10; gettimeofday(&tv1, NULL); for (i=0; iM; ++i) { struct tm result10_; memcpy(&result10_, &result10, sizeof(result10_)); result10_.tm_isdst = 0; localtime_r(&now, &result10_); } gettimeofday(&tv2, NULL); printf("TZ is Asia/Shanghai and isdst=0: %ums\n", (tv2.tv_sec-tv1.tv_sec)*1000 + (tv2.tv_usec-tv1.tv_usec)/1000); } // test11 { struct tm result11; gettimeofday(&tv1, NULL); for (i=0; iM; ++i) { struct tm result11_; memcpy(&result11_, &result11, sizeof(result11_)); result11_.tm_isdst = -1; localtime_r(&now, &result11_); } gettimeofday(&tv2, NULL); printf("TZ is Asia/Shanghai and isdst=-1: %ums\n", (tv2.tv_sec-tv1.tv_sec)*1000 + (tv2.tv_usec-tv1.tv_usec)/1000); } // test12 { struct tm result12; gettimeofday(&tv1, NULL); for (i=0; iM; ++i) { struct tm result12_; memcpy(&result12_, &result12, sizeof(result12_)); localtime_r(&now, &result12_); } gettimeofday(&tv2, NULL); printf("TZ is Asia/Shanghai and isdst undefined: %ums\n", (tv2.tv_sec-tv1.tv_sec)*1000 + (tv2.tv_usec-tv1.tv_usec)/1000); } } // test_mktime void test_mktime(int M) { int i; time_t now = time(NULL); struct timeval tv1, tv2; printf("test: mktime ...\n"); unsetenv("TZ"); // test1 { struct tm result1; localtime_r(&now, &result1); gettimeofday(&tv1, NULL); for (i=0; iM; ++i) { struct tm result1_; memcpy(&result1_, &result1, sizeof(result1_)); result1_.tm_isdst = 1; mktime(&result1_); } gettimeofday(&tv2, NULL); printf("TZ is NULL and isdst=1: %ums\n", (tv2.tv_sec-tv1.tv_sec)*1000 + (tv2.tv_usec-tv1.tv_usec)/1000); } // test2 { struct tm result2; localtime_r(&now, &result2); gettimeofday(&tv1, NULL); for (i=0; iM; ++i) { struct tm result2_; memcpy(&result2_, &result2, sizeof(result2_)); result2_.tm_isdst = 0; mktime(&result2_); } gettimeofday(&tv2, NULL); printf("TZ is NULL and isdst=0: %ums\n", (tv2.tv_sec-tv1.tv_sec)*1000 + (tv2.tv_usec-tv1.tv_usec)/1000); } // test3 { struct tm result3; localtime_r(&now, &result3); gettimeofday(&tv1, NULL); for (i=0; iM; ++i) { struct tm result3_; memcpy(&result3_, &result3, sizeof(result3_)); result3_.tm_isdst = -1; mktime(&result3_); } gettimeofday(&tv2, NULL); printf("TZ is NULL and isdst=-1: %ums\n", (tv2.tv_sec-tv1.tv_sec)*1000 + (tv2.tv_usec-tv1.tv_usec)/1000); } // test4 { struct tm result4; localtime_r(&now, &result4); gettimeofday(&tv1, NULL); for (i=0; iM; ++i) { struct tm result4_; memcpy(&result4_, &result4, sizeof(result4_)); mktime(&result4_); } gettimeofday(&tv2, NULL); printf("TZ is NULL and isdst undefined: %ums\n", (tv2.tv_sec-tv1.tv_sec)*1000 + (tv2.tv_usec-tv1.tv_usec)/1000); } setenv("TZ", "", 0); // test5 { struct tm result5; localtime_r(&now, &result5); gettimeofday(&tv1, NULL); for (i=0; iM; ++i) { struct tm result5_; memcpy(&result5_, &result5, sizeof(result5_)); result5_.tm_isdst = 1; mktime(&result5_); } gettimeofday(&tv2, NULL); printf("TZ is empty and isdst=1: %ums\n", (tv2.tv_sec-tv1.tv_sec)*1000 + (tv2.tv_usec-tv1.tv_usec)/1000); } // test6 { struct tm result6; localtime_r(&now, &result6); gettimeofday(&tv1, NULL); for (i=0; iM; ++i) { struct tm result6_; memcpy(&result6_, &result6, sizeof(result6_)); result6_.tm_isdst = 0; mktime(&result6_); } gettimeofday(&tv2, NULL); printf("TZ is empty and isdst=0: %ums\n", (tv2.tv_sec-tv1.tv_sec)*1000 + (tv2.tv_usec-tv1.tv_usec)/1000); } // test7 { struct tm result7; localtime_r(&now, &result7); gettimeofday(&tv1, NULL); for (i=0; iM; ++i) { struct tm result7_; memcpy(&result7_, &result7, sizeof(result7_)); result7_.tm_isdst = -1; mktime(&result7_); } gettimeofday(&tv2, NULL); printf("TZ is empty and isdst=-1: %ums\n", (tv2.tv_sec-tv1.tv_sec)*1000 + (tv2.tv_usec-tv1.tv_usec)/1000); } // test8 { struct tm result8; localtime_r(&now, &result8); gettimeofday(&tv1, NULL); for (i=0; iM; ++i) { struct tm result8_; memcpy(&result8_, &result8, sizeof(result8_)); mktime(&result8_); } gettimeofday(&tv2, NULL); printf("TZ is empty and isdst undefined: %ums\n", (tv2.tv_sec-tv1.tv_sec)*1000 + (tv2.tv_usec-tv1.tv_usec)/1000); } setenv("TZ", "Asia/Shanghai", 0); // test9 { struct tm result9; localtime_r(&now, &result9); gettimeofday(&tv1, NULL); for (i=0; iM; ++i) { struct tm result9_; memcpy(&result9_, &result9, sizeof(result9_)); result9_.tm_isdst = 1; mktime(&result9_); } gettimeofday(&tv2, NULL); printf("TZ is Asia/Shanghai and isdst=1: %ums\n", (tv2.tv_sec-tv1.tv_sec)*1000 + (tv2.tv_usec-tv1.tv_usec)/1000); } // test10 { struct tm result10; localtime_r(&now, &result10); gettimeofday(&tv1, NULL); for (i=0; iM; ++i) { struct tm result10_; memcpy(&result10_, &result10, sizeof(result10_)); result10_.tm_isdst = 0; mktime(&result10_); } gettimeofday(&tv2, NULL); printf("TZ is Asia/Shanghai and isdst=0: %ums\n", (tv2.tv_sec-tv1.tv_sec)*1000 + (tv2.tv_usec-tv1.tv_usec)/1000); } // test11 { struct tm result11; localtime_r(&now, &result11); gettimeofday(&tv1, NULL); for (i=0; iM; ++i) { struct tm result11_; memcpy(&result11_, &result11, sizeof(result11_)); result11_.tm_isdst = -1; mktime(&result11_); } gettimeofday(&tv2, NULL); printf("TZ is Asia/Shanghai and isdst=-1: %ums\n", (tv2.tv_sec-tv1.tv_sec)*1000 + (tv2.tv_usec-tv1.tv_usec)/1000); } // test12 { struct tm result12; localtime_r(&now, &result12); gettimeofday(&tv1, NULL); for (i=0; iM; ++i) { struct tm result12_; memcpy(&result12_, &result12, sizeof(result12_)); mktime(&result12_); } gettimeofday(&tv2, NULL); printf("TZ is Asia/Shanghai and isdst undefined: %ums\n", (tv2.tv_sec-tv1.tv_sec)*1000 + (tv2.tv_usec-tv1.tv_usec)/1000); } }
cron和sh等可能被某些共享库hook,而这些共享库可能会触发SIGPIPE,导致crontab和shell工作异常,解决办法是程序忽略SIGPIPE或脚本中使用“trap '' SIGPIPE”。 问题描述1: shell中的ps、wc、sleep命令均工作异常,检查它们的“$?”值为141。 问题描述2: 在Crontab中仅配置如下一条命令(为简化问题的描述和定位,剔除所有其它的): */1 * * * * echo hello >> /tmp/hello.txt 也就是每分钟执行一下“echo hello >> /tmp/hello.txt”。 通过观察发现: 每次重启cron进程后,都只能连续正常工作5次,也就是可以看到“/tmp/hello.txt”新增5行“hello”。
nginx做反向代理时的真实IP.pdf 1. 编译 对于client -> nginx reverse proxy -> apache, 要想在程序中取得真实的IP,在执行nginx的configure时,必须指定参数“--with-http_realip_module”,示例: ./configure --prefix=/data/nginx --with-http_realip_module --with-stream --with-pcre=/tmp/X/pcre-8.32 --with-openssl=/tmp/X/openssl-1.0.2a 参数说明: --prefix= 指定安装目录,也就是make install后程序文件等的存放目录 --with-http_realip_module 使得程序可以通过环境变量HTTP_X_REAL_IP取得真实的客户端IP地址 --with-stream 表示启用TCP代理 --with-pcre= 指定依赖的pcre,注意为pcre源代码解压后的目录路径,而不是安装路径 --with-openssl= 指定依赖的openssl,注意为openssl源代码解压后的目录路径,而不是安装路径 另外,最简单的确认方法是使用nm命令查看nginx程序文件,看看是否有包含real相关的符号,对于版本nginx-1.9.4 ,可以发现存在“0809c54b t ngx_http_realip”。 2. 程序代码 测试程序代码(后续测试基于它): // g++ -g -o hello.cgi hello.cpp #include #include int main() { printf("Content-Type: text/html; charset=utf-8\r\n\r\n"); printf(" HTTP_X_FORWARDED_FOR: %s\n", getenv("HTTP_X_FORWARDED_FOR")); printf(" HTTP_X_REAL_IP: %s\n", getenv("HTTP_X_REAL_IP")); printf(" REMOTE_ADDR: %s\n", getenv("REMOTE_ADDR")); printf(" "); return 0; } 测试是在nginx自带配置文件nginx.conf上进行的修改: proxy_set_header可以添加在nginx.conf的http段,也可以是server段,还可以是location段,一级一级间是继承和覆盖关系。 3. 相关配置 示例: location / { # root html; # index index.html index.htm; proxy_pass http://20.61.28.11:80; proxy_redirect off; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; # 这个是必须的 proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; } X-Forwarded-For和X-Real-IP的区别是,如果请求时已带了X-Forwarded-For,则nginx追加方式,这样可以通过它显示转发的轨迹。 当然请求时完全可以构造假的X-Forwarded-For,在配置文件打开了X-Real-IP及编译指定了--with-http_realip_module时,环境变量HTTP_X_REAL_IP总是为真实的客户端IP。 如果是: client -> nginx reverse proxy (A) -> nginx reverse proxy (B) -> apache HTTP_X_REAL_IP又会是什么了? 4. 测试1 假设如下部署: client(10.6.81.39) -> nginx(10.6.223.44:8080) -> nginx(10.6.208.101:8080) -> apache(10.6.208.101:80) ? A 假设nginx(10.6.223.44:8080)的配置均为(在nginx默认配置上的修改部分): server { listen 8080; server_name 10.6.223.44; location / { # root html; # index index.html index.htm; proxy_pass http://10.6.208.101:8080; proxy_redirect off; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; } 假设nginx(10.6.208.101:8080)的配置均为(在nginx默认配置上的修改部分): server { listen 8080; server_name 10.6.208.101; location / { # root html; # index index.html index.htm; proxy_pass http://10.6.208.101:80; proxy_redirect off; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; } 上述测试程序输出的结果为: HTTP_X_FORWARDED_FOR: 10.6.81.39, 10.6.223.44 HTTP_X_REAL_IP: 10.6.223.44 REMOTE_ADDR: 10.6.81.39 ? B 但如果client在HTTP请求头中加入: X-FORWARDED-FOR:8.8.8.7 CLIENT-IP:8.8.8.8 X-REAL-IP:8.8.8.10 后输出结果变成: HTTP_X_FORWARDED_FOR: 8.8.8.7, 10.6.81.39, 10.6.223.44 HTTP_X_REAL_IP: 10.6.223.44 REMOTE_ADDR: 8.8.8.7 ? C 基于A,如果只nginx(10.6.223.44:8080)配置注释掉“X-Forwarded-For”,输出结果变成: HTTP_X_FORWARDED_FOR: 10.6.223.44 HTTP_X_REAL_IP: 10.6.223.44 REMOTE_ADDR: 10.6.223.44 ? D 基于A,如果只nginx(10.6.208.101:8080)配置注释掉“X-Forwarded-For”,输出结果变成: HTTP_X_FORWARDED_FOR: 10.6.81.39 HTTP_X_REAL_IP: 10.6.223.44 REMOTE_ADDR: 10.6.81.39 5. 测试2 基于测试1的配置, 当访问路径变成:client(10.6.81.39) -> nginx(10.6.208.101:8080) -> apache(10.6.208.101:80)时,程序输出结果为: HTTP_X_FORWARDED_FOR: 10.6.81.39 HTTP_X_REAL_IP: 10.6.81.39 REMOTE_ADDR: 10.6.81.39 但如果client在HTTP请求头中加入: X-FORWARDED-FOR:8.8.8.7 CLIENT-IP:8.8.8.8 X-REAL-IP:8.8.8.10 后输出结果变成: HTTP_X_FORWARDED_FOR: 8.8.8.7, 10.6.81.39 HTTP_X_REAL_IP: 10.6.81.39 REMOTE_ADDR: 8.8.8.7 从上可以看出,只配置正确使用了real-ip功能,除HTTP_X_REAL_IP外,其它内容可以被干扰,client可以篡改它们。 6. 结论 如果正确编译和配置了nginx反向代理,当只有一层nginx反向代理时,可以通过“HTTP_X_REAL_IP”取得client的真实IP。 如果有二层nginx反向代理,则client的真实IP被包含在“HTTP_X_FORWARDED_FOR”中。 最不可信的是“REMOTE_ADDR”,它的内容完全可以被client指定!总之只要编译和配置正确,“HTTP_X_FORWARDED_FOR”总是包含了client的真实IP。
HBase Thrift2 CPU过高问题分析.pdf 目录 目录 1 1. 现象描述 1 2. 问题定位 2 3. 解决方案 5 4. 相关代码 5 1. 现象描述 外界连接9090端口均超时,但telnet端口总是成功。使用top命令观察,发现单个线程的CPU最高达99.99%,但并不总是99.9%,而是在波动。当迁走往该机器的流量后,能够访问成功,但仍然有超时,读超时比写超时多: # ./hbase_stress --hbase=110.13.136.207:9090 --test=2 --timeout=10 [2016-11-27 10:15:21/771][139756154767104/31562][ERROR][hbase_stress.cpp:302]TransportException(thrift://110.13.136.207:9090): EAGAIN (timed out) [2016-11-27 10:15:31/775][139756154767104/31562][ERROR][hbase_stress.cpp:302]TransportException(thrift://110.13.136.207:9090): EAGAIN (timed out) PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 20727 zhangsan 20 0 10.843g 9.263g 26344 R 99.9 26.4 1448:00 java 20729 zhangsan 20 0 10.843g 9.263g 26344 R 99.9 26.4 1448:00 java 20730 zhangsan 20 0 10.843g 9.263g 26344 R 99.9 26.4 1449:10 java 20728 zhangsan 20 0 10.843g 9.263g 26344 R 99.8 26.4 1448:00 java 20693 zhangsan 20 0 10.843g 9.263g 26344 S 0.0 26.4 0:00.00 java 20727 zhangsan 20 0 10.843g 9.263g 26344 R 75.5 26.4 1448:06 java 20728 zhangsan 20 0 10.843g 9.263g 26344 R 75.2 26.4 1448:06 java 20729 zhangsan 20 0 10.843g 9.263g 26344 R 75.2 26.4 1448:06 java 20730 zhangsan 20 0 10.843g 9.263g 26344 R 75.2 26.4 1449:15 java 20716 zhangsan 20 0 10.843g 9.263g 26344 S 24.9 26.4 93:48.75 java 2. 问题定位 使用ps命令找出CPU最多的线程,和top显示的一致: $ ps -mp 20693 -o THREAD,tid,time | sort -rn zhangsan 18.8 19 - - - - 20730 1-00:11:23 zhangsan 18.7 19 - - - - 20729 1-00:10:13 zhangsan 18.7 19 - - - - 20728 1-00:10:13 zhangsan 18.7 19 - - - - 20727 1-00:10:13 zhangsan 16.1 19 - futex_ - - 20731 20:44:51 zhangsan 5.2 19 - futex_ - - 20732 06:46:39 然后借助jstack,发现为GC进程: "Gang worker#0 (Parallel CMS Threads)" os_prio=0 tid=0x00007fb7200d4000 nid=0x50f7 runnable "Gang worker#1 (Parallel CMS Threads)" os_prio=0 tid=0x00007fb7200d5800 nid=0x50f8 runnable "Gang worker#2 (Parallel CMS Threads)" os_prio=0 tid=0x00007fb7200d7800 nid=0x50f9 runnable "Gang worker#3 (Parallel CMS Threads)" os_prio=0 tid=0x00007fb7200d9000 nid=0x50fa runnable 使用jstat工具查看GC,情况很不乐观,问题就是有GC引起的: $ jstat -gcutil 20693 1000 100 S0 S1 E O M CCS YGC YGCT FGC FGCT GCT 0.00 99.67 100.00 100.00 98.08 94.41 42199 369.132 27084 34869.601 35238.733 0.00 99.67 100.00 100.00 98.08 94.41 42199 369.132 27084 34870.448 35239.580 0.00 99.67 100.00 100.00 98.08 94.41 42199 369.132 27084 34870.448 35239.580 0.00 99.67 100.00 100.00 98.08 94.41 42199 369.132 27084 34870.448 35239.580 $ jstat -gccapacity 20693 NGCMN NGCMX NGC S0C S1C EC OGCMN OGCMX OGC OC MCMN MCMX MC CCSMN CCSMX CCSC YGC FGC 191808.0 1107520.0 1107520.0 110720.0 110720.0 886080.0 383680.0 8094144.0 8094144.0 8094144.0 0.0 1077248.0 31584.0 0.0 1048576.0 3424.0 42199 27156 $ jstat -gcold 20693 MC MU CCSC CCSU OC OU YGC FGC FGCT GCT 31584.0 30978.7 3424.0 3232.7 8094144.0 8094144.0 42199 27174 34964.347 35333.479 $ jstat -gcoldcapacity 20693 OGCMN OGCMX OGC OC YGC FGC FGCT GCT 383680.0 8094144.0 8094144.0 8094144.0 42199 27192 34982.623 35351.755 $ jstat -gcnewcapacity 20693 NGCMN NGCMX NGC S0CMX S0C S1CMX S1C ECMX EC YGC FGC 191808.0 1107520.0 1107520.0 110720.0 110720.0 110720.0 110720.0 886080.0 886080.0 42199 27202 $ jstat -gc 20693 S0C S1C S0U S1U EC EU OC OU MC MU CCSC CCSU YGC YGCT FGC FGCT GCT 110720.0 110720.0 0.0 110395.9 886080.0 886080.0 8094144.0 8094144.0 31584.0 30978.7 3424.0 3232.7 42199 369.132 27206 34996.538 35365.671 $ jstat -gcnew 20693 S0C S1C S0U S1U TT MTT DSS EC EU YGC YGCT 110720.0 110720.0 0.0 110396.9 6 6 55360.0 886080.0 886080.0 42199 369.132 使用lsof显示该进程的连接数不多,完全在安全范围内,问题应当是有对象不能被回收。使用jmap查看内存详情,先看堆的使用情况: $ jmap -heap 20693 Attaching to process ID 20693, please wait... Debugger attached successfully. Server compiler detected. JVM version is 25.77-b03 using parallel threads in the new generation. using thread-local object allocation. Concurrent Mark-Sweep GC Heap Configuration: MinHeapFreeRatio = 40 MaxHeapFreeRatio = 70 MaxHeapSize = 9422503936 (8986.0MB) NewSize = 196411392 (187.3125MB) MaxNewSize = 1134100480 (1081.5625MB) OldSize = 392888320 (374.6875MB) NewRatio = 2 SurvivorRatio = 8 MetaspaceSize = 21807104 (20.796875MB) CompressedClassSpaceSize = 1073741824 (1024.0MB) MaxMetaspaceSize = 17592186044415 MB G1HeapRegionSize = 0 (0.0MB) Heap Usage: New Generation (Eden + 1 Survivor Space): capacity = 1020723200 (973.4375MB) used = 1020398064 (973.1274261474609MB) free = 325136 (0.3100738525390625MB) 99.96814650632022% used Eden Space: capacity = 907345920 (865.3125MB) used = 907345920 (865.3125MB) free = 0 (0.0MB) 100.0% used From Space: capacity = 113377280 (108.125MB) used = 113052144 (107.81492614746094MB) free = 325136 (0.3100738525390625MB) 99.71322649476156% used To Space: capacity = 113377280 (108.125MB) used = 0 (0.0MB) free = 113377280 (108.125MB) 0.0% used concurrent mark-sweep generation: capacity = 8288403456 (7904.4375MB) used = 8288403424 (7904.437469482422MB) free = 32 (3.0517578125E-5MB) 99.9999996139184% used 10216 interned Strings occupying 934640 bytes. 进一步查看对象的情况: $ jmap -histo 20693 num #instances #bytes class name ---------------------------------------------- 1: 72835212 2518411456 [B 2: 49827147 1993085880 java.util.TreeMap$Entry 3: 12855993 617087664 java.util.TreeMap 4: 4285217 445662568 org.apache.hadoop.hbase.client.ClientScanner 5: 4285222 377099536 org.apache.hadoop.hbase.client.Scan 6: 4284875 377069000 org.apache.hadoop.hbase.client.ScannerCallable 7: 4285528 342921344 [Ljava.util.HashMap$Node; 8: 4284880 308511360 org.apache.hadoop.hbase.client.ScannerCallableWithReplicas 9: 8570671 274261472 java.util.LinkedList 10: 4285579 205707792 java.util.HashMap 11: 4285283 205693584 org.apache.hadoop.hbase.client.RpcRetryingCaller 12: 3820914 152836560 org.apache.hadoop.hbase.filter.SingleColumnValueFilter 13: 4291904 137340928 java.util.concurrent.ConcurrentHashMap$Node 14: 8570636 137130176 java.util.TreeMap$EntrySet 15: 4285278 137128896 org.apache.hadoop.hbase.io.TimeRange 16: 8570479 137127664 java.util.concurrent.atomic.AtomicBoolean 17: 2891409 92525088 org.apache.hadoop.hbase.NoTagsKeyValue 18: 4286540 68584640 java.lang.Integer 19: 4285298 68564768 java.util.TreeMap$KeySet 20: 4285275 68564400 java.util.TreeSet 21: 4285006 68560096 java.util.HashSet 22: 4284851 68557616 java.util.HashMap$KeySet 23: 3176118 50817888 org.apache.hadoop.hbase.filter.BinaryComparator 24: 109 33607600 [Ljava.util.concurrent.ConcurrentHashMap$Node; 25: 418775 18479112 [Lorg.apache.hadoop.hbase.Cell; 26: 671443 17693224 [C 27: 418781 16751240 org.apache.hadoop.hbase.client.Result 28: 669739 16073736 java.lang.String 29: 644796 15475104 org.apache.hadoop.hbase.filter.SubstringComparator 30: 419134 10059216 java.util.LinkedList$Node 为使系统能够正常工作,先实施治标不治本的方案:监控GC,定时重启HBase Thrift2进程,然后再找出根本原因达到治本的目的。 从上面jmap的输出来看,猜测是不是因为额scanner没有被关闭导致的。而scanner没有被关闭的原因有两个:一是客户端程序问题没有关闭,也就是有内存泄漏了,二是客户端程序异常导致没机会关闭。 查看客户端源代码,确实存在openScanner的异常时未关闭。另外客户端被kill掉或断电等,也会导致无法释放,这一点是HBase Thrift2得解决的问题。 3. 解决方案 针对前面分析出的问题分别加以解决: 1) 客户端保证scanner全部释放; 2) HBase Thrift2增加自动释放长时间未操作的scanner; 3) 另外也可以使用getScannerResults替代getScannerRows来规避此问题。 补丁: https://issues.apache.org/jira/browse/HBASE-17182。 4. 相关代码 private final Map scannerMap = new ConcurrentHashMap(); @Override public int openScanner(ByteBuffer table, TScan scan) throws TIOError, TException { Table htable = getTable(table); ResultScanner resultScanner = null; try { resultScanner = htable.getScanner(scanFromThrift(scan)); } catch (IOException e) { throw getTIOError(e); } finally { closeTable(htable); } // 将scanner放入到scannerMap中, // 如果客户端没有调用closeScanner,则导致该scanner泄漏,GC无法回收改部分内存 return addScanner(resultScanner); } /** * Assigns a unique ID to the scanner and adds the mapping to an internal HashMap. * @param scanner to add * @return Id for this Scanner */ private int addScanner(ResultScanner scanner) { int id = nextScannerId.getAndIncrement(); scannerMap.put(id, scanner); // 将scanner放入到scannerMap中 return id; } /** * Returns the Scanner associated with the specified Id. * @param id of the Scanner to get * @return a Scanner, or null if the Id is invalid */ private ResultScanner getScanner(int id) { return scannerMap.get(id); } @Override public void closeScanner(int scannerId) throws TIOError, TIllegalArgument, TException { LOG.debug("scannerClose: id=" + scannerId); ResultScanner scanner = getScanner(scannerId); if (scanner == null) { String message = "scanner ID is invalid"; LOG.warn(message); TIllegalArgument ex = new TIllegalArgument(); ex.setMessage("Invalid scanner Id"); throw ex; } scanner.close(); // 关闭scanner removeScanner(scannerId); // 从scannerMap中移除scanner } /** * Removes the scanner associated with the specified ID from the internal HashMap. * @param id of the Scanner to remove * @return the removed Scanner, or null if the Id is invalid */ protected ResultScanner removeScanner(int id) { return scannerMap.remove(id); // 从scannerMap中移除scanner }
零停重启程序工具Huptime研究.pdf 目录 目录 1 1. 官网 1 2. 功能 1 3. 环境要求 2 4. 实现原理 2 5. SIGHUP信号处理 3 6. 重启线程 4 7. 重启目标程序 5 8. 系统调用钩子辅助 6 9. 被勾住系统调用exit 6 10. 被勾住系统调用listen 7 11. Symbol Versioning 8 12. 勾住bind等系统调用 10 13. 系统调用过程 13 14. 测试代码 13 14.1. Makefile 13 14.2. s.c 14 14.3. s.map 14 14.4. x.cpp 14 14.5. 体验方法 15 1. 官网 https://github.com/amscanne/huptime 2. 功能 零停重启目标程序,比如一个网络服务程序,不用丢失和中断任何消息实现重新启动,正在处理的消息也不会中断和丢失,重启的方法是给目标程序的进程发SIGHUP信号。 3. 环境要求 由于使用了Google牛人Tom Herbert为Linux内核打的补丁SO_REUSEPORT特性,因此要求Linux内核版本为3.9或以上,SO_REUSEPORT允许多个进程监听同一IP的同一端口。 4. 实现原理 利用SIGHUP + SO_REUSEPORT + LD_PRELOAD,通过LD_PRELOAD将自己(huptime.so)注入到目标进程空间。 使用Python脚本huptime启动时会设置LD_PRELOAD,将huptime.so注入到目标程序的进程空间。 huptime.so启动时会执行setup函数,在setup中会创建一个线程impl_restart_thread用于重启目标程序的进程,另外还会安装信号SIGHUP的处理器sighandler用于接收零重启信号SIGHUP: static void __attribute__((constructor)) setup(void) { #define likely(x) __builtin_expect (!!(x), 1) if( likely(initialized) ) // 只做一次 return; initialized = 1; #define GET_LIBC_FUNCTION(_name) \ libc._name = get_libc_function<_name>(# _name, &_name) // 初始化全局变量libc,让其指向GLIBC库的bind等 GET_LIBC_FUNCTION(bind); // libc.bind = dlsym(RTLD_NEXT, bind); // 系统的bind GET_LIBC_FUNCTION(listen); GET_LIBC_FUNCTION(accept); GET_LIBC_FUNCTION(accept4); GET_LIBC_FUNCTION(close); GET_LIBC_FUNCTION(fork); GET_LIBC_FUNCTION(dup); GET_LIBC_FUNCTION(dup2); GET_LIBC_FUNCTION(dup3); GET_LIBC_FUNCTION(exit); GET_LIBC_FUNCTION(wait); GET_LIBC_FUNCTION(waitpid); GET_LIBC_FUNCTION(syscall); GET_LIBC_FUNCTION(epoll_create); GET_LIBC_FUNCTION(epoll_create1); #undef GET_LIBC_FUNCTION impl_init(); // 安装信号SIGHUP处理器、创建重启线程等 } template static FUNC_T get_libc_function(const char* name, FUNC_T def) { char *error; FUNC_T result; /* Clear last error (if any). */ dlerror(); /* Try to get the symbol. */ result = (FUNC_T)dlsym(RTLD_NEXT, name); error = dlerror(); if( result == NULL || error != NULL ) { fprintf(stderr, "dlsym(RTLD_NEXT, \"%s\") failed: %s", name, error); result = def; } return result; } 5. SIGHUP信号处理 // 信号SIGHUP处理函数,作用是通过管道通知重启线程impl_restart_thread, // 这里其实可以考虑使用eventfd替代pipe static void* impl_restart_thread(void*); void sighandler(int signo) { /* Notify the restart thread. * We have to do this in a separate thread, because * we have no guarantees about which thread has been * interrupted in order to execute this signal handler. * Because this could have happened during a critical * section (i.e. locks held) we have no choice but to * fire the restart asycnhronously so that it too can * grab locks appropriately. */ if( restart_pipe[1] == -1 ) { /* We've already run. */ return; } while( 1 ) { char go = 'R'; int rc = write(restart_pipe[1], &go, 1); // 通知重启线程 if( rc == 0 ) { /* Wat? Try again. */ continue; } else if( rc == 1 ) { /* Done. */ libc.close(restart_pipe[1]); restart_pipe[1] = -1; break; } else if( rc { /* Go again. */ continue; } else { /* Shit. */ DEBUG("Restart pipe fubared!? Sorry."); break; } } } 6. 重启线程 void* impl_restart_thread(void* arg) { /* Wait for our signal. */ while( 1 ) { char go = 0; int rc = read(restart_pipe[0], &go, 1); // 等待SIGHUP信号 if( rc == 1 ) { /* Go. */ break; } else if( rc == 0 ) { /* Wat? Restart. */ DEBUG("Restart pipe closed?!"); break; } else if( rc { /* Keep trying. */ continue; } else { /* Real error. Let's restart. */ DEBUG("Restart pipe fubared?!"); break; } } libc.close(restart_pipe[0]); restart_pipe[0] = -1; /* See note above in sighandler(). */ impl_restart(); // 重启目标进程 return arg; } 7. 重启目标程序 void impl_restart(void) { /* Indicate that we are now exiting. */ L(); // 加锁 impl_exit_start(); impl_exit_check(); U(); // 解锁 } 8. 系统调用钩子辅助 funcs_t impl = { .bind = do_bind, .listen = do_listen, .accept = do_accept_retry, .accept4 = do_accept4_retry, .close = do_close, .fork = do_fork, .dup = do_dup, .dup2 = do_dup2, .dup3 = do_dup3, .exit = do_exit, .wait = do_wait, .waitpid = do_waitpid, .syscall = (syscall_t)do_syscall, .epoll_create = do_epoll_create, .epoll_create1 = do_epoll_create1, }; funcs_t libc; // 目标程序的进程调用的实际是huptime中的do_XXX系列 9. 被勾住系统调用exit static void do_exit(int status) { if( revive_mode == TRUE ) // 如果是复活模式,也就是需要重启时 { DEBUG("Reviving..."); impl_exec(); // 调用execve重新启动目标程序 } libc.exit(status); // 调用系统的exit } 10. 被勾住系统调用listen static int do_listen(int sockfd, int backlog) { int rval = -1; fdinfo_t *info = NULL; if( sockfd { errno = EINVAL; return -1; } DEBUG("do_listen(%d, ...) ...", sockfd); L(); info = fd_lookup(sockfd); if( info == NULL || info->type != BOUND ) { U(); DEBUG("do_listen(%d, %d) => -1 (not BOUND)", sockfd, backlog); errno = EINVAL; return -1; } /* Check if we can short-circuit this. */ if( info->bound.real_listened ) { info->bound.stub_listened = 1; U(); DEBUG("do_listen(%d, %d) => 0 (stub)", sockfd, backlog); return 0; } /* Can we really call listen() ? */ if( is_exiting == TRUE ) { info->bound.stub_listened = 1; U(); DEBUG("do_listen(%d, %d) => 0 (is_exiting)", sockfd, backlog); return 0; } /* We largely ignore the backlog parameter. People * don't really use sensible values here for the most * part. Hopefully (as is default on some systems), * tcp syn cookies are enabled, and there's no real * limit for this queue and this parameter is silently * ignored. If not, then we use the largest value we * can sensibly use. */ (void)backlog; rval = libc.listen(sockfd, SOMAXCONN); if( rval { U(); DEBUG("do_listen(%d, %d) => %d", sockfd, backlog, rval); return rval; } /* We're done. */ info->bound.real_listened = 1; info->bound.stub_listened = 1; U(); DEBUG("do_listen(%d, %d) => %d", sockfd, backlog, rval); return rval; } 11. Symbol Versioning Huptime使用到了GCC的基于符号的版本机制Symbol Versioning。本节内容主要源自:https://blog.blahgeek.com/glibc-and-symbol-versioning/。 在linux上运行一个在其他机器上编译的可执行文件时可能会遇到错误:/lib64/libc.so.6: version ‘GLIBC_2.14’ not found (required by ./a.out),该错误的原因是GLIBC的版本偏低。 从GLIBC 2.1开始引入了Symbol Versioning机制,每个符号对应一个版本号,一个glibc库可包含一个函数的多个版本: # nm /lib64/libc.so.6|grep memcpy 000000000008ee90 i memcpy@@GLIBC_2.14 00000000000892b0 i memcpy@GLIBC_2.2.5 其中memcpy@@GLIBC_2.14为默认版本。使用Symbol Versioning可以改变一个已存在的接口: __asm__(".symver original_foo,foo@"); __asm__(".symver old_foo,foo@VERS_1.1"); __asm__(".symver old_foo1,foo@VERS_1.2"); __asm__(".symver new_foo,foo@@VERS_2.0"); 如果没有指定版本号,这个示例中的“foo@”代表了符号foo。源文件应当包含四个C函数的实现:original_foo、old_foo、old_foo1和new_foo。它的MAP文件必须在VERS_1.1、VERFS_1.2和VERS_2.0中包含foo。 也可以在自己的库中使用Symbol Versioning,如: /// @file libx-1.c @date 05/10/2015 /// @author i@BlahGeek.com __asm__(".symver foo_1, foo@@libX_1.0"); int foo_1() { return 1; } __asm__(".symver bar_1, bar@@libX_1.0"); int bar_1() { return -1; } 配套的MAP文件: libX_1.0 { global: foo; bar; local: *; }; 编译: gcc -shared -fPIC -Wl,--version-script libx-1.map libx-1.c -o lib1/libx.so 当发布新版本,希望保持兼容,可以增加一个版本号: /// @file libx.c @date 05/10/2015 /// @author i@BlahGeek.com /* old foo */ __asm__(".symver foo_1, foo@libX_1.0"); int foo_1() { return 1; } /* new foo */ __asm__(".symver foo_2, foo@@libX_2.0"); int foo_2() { return 2; } __asm__(".symver bar_1, bar@@libX_1.0"); int bar_1() { return -1; } 相应的MAP文件变成: libX_1.0 { global: foo; bar; local: *; }; libX_2.0 { global: foo; local: *; }; 设置环境变量LD_DEBUG,可以打开动态链接器的调试功能。共享库的构造和析构函数: void __attribute__((constructor(5))) init_function(void); void __attribute__((destructor(10))) fini_function(void); 括号中的数字越小优先级越高,不可以使用gcc -nostartfiles或-nostdlib。通过链接脚本可以将几个现在的共享库通过一定方式组合产生新的库: GROUP( /lib/libc.so.6 /lib/libm.so.2 ) 12. 勾住bind等系统调用 /* Exports name as aliasname in .dynsym. */ #define PUBLIC_ALIAS(name, aliasname) \ typeof(name) aliasname __attribute__ ((alias (#name))) \ __attribute__ ((visibility ("default"))); /* Exports stub_ ##name as name@version. */ #define SYMBOL_VERSION(name, version, version_ident) \ PUBLIC_ALIAS(stub_ ## name, stub_ ## name ## _ ## version_ident); \ asm(".symver stub_" #name "_" #version_ident ", " #name "@" version); /* Exports stub_ ##name as name@@ (i.e., the unversioned symbol for name). */ #define GLIBC_DEFAULT(name) \ SYMBOL_VERSION(name, "@", default_) /* Exports stub_ ##name as name@@GLIBC_MAJOR.MINOR.PATCH. */ #define GLIBC_VERSION(name, major, minor) \ SYMBOL_VERSION(name, "GLIBC_" # major "." # minor, \ glibc_ ## major ## minor) #define GLIBC_VERSION2(name, major, minor, patch) \ SYMBOL_VERSION(name, "GLIBC_" # major "." # minor "." # patch, \ glibc_ ## major ## minor ## patch) GLIBC_DEFAULT(bind) // 当目标程序调用bind时,实际调用的将是Huptime库中的stub_bind GLIBC_VERSION2(bind, 2, 2, 5) GLIBC_DEFAULT(listen) GLIBC_VERSION2(listen, 2, 2, 5) GLIBC_DEFAULT(accept) GLIBC_VERSION2(accept, 2, 2, 5) GLIBC_DEFAULT(accept4) GLIBC_VERSION2(accept4, 2, 2, 5) GLIBC_DEFAULT(close) GLIBC_VERSION2(close, 2, 2, 5) GLIBC_DEFAULT(fork) GLIBC_VERSION2(fork, 2, 2, 5) GLIBC_DEFAULT(dup) GLIBC_VERSION2(dup, 2, 2, 5) GLIBC_DEFAULT(dup2) GLIBC_VERSION2(dup2, 2, 2, 5) GLIBC_DEFAULT(dup3) GLIBC_VERSION2(dup3, 2, 2, 5) GLIBC_DEFAULT(exit) GLIBC_VERSION(exit, 2, 0) GLIBC_DEFAULT(wait) GLIBC_VERSION2(wait, 2, 2, 5) GLIBC_DEFAULT(waitpid) GLIBC_VERSION2(waitpid, 2, 2, 5) GLIBC_DEFAULT(syscall) GLIBC_VERSION2(syscall, 2, 2, 5) GLIBC_DEFAULT(epoll_create) GLIBC_VERSION2(epoll_create, 2, 3, 2) GLIBC_DEFAULT(epoll_create1) GLIBC_VERSION(epoll_create1, 2, 9) 对应的MAP文件: GLIBC_2.2.5 { global: bind; listen; accept; accept4; close; fork; dup; dup2; dup3; syscall; local: *; }; GLIBC_2.3.2 { global: epoll_create; local: *; }; GLIBC_2.0 { global: exit; local: *; }; GLIBC_2.9 { global: epoll_create1; local: *; }; GLIBC_DEFAULT(bind)展开 typeof(stub_bind) stub_bind_default_ __attribute__ ((alias ("stub_bind"))) __attribute__ ((visibility ("default")));; asm(".symver stub_" "bind" "_" "default_" ", " "bind" "@" "@"); // 上面这一句等效于:asm(.symver stub_bind_default_, bind@@); GLIBC_VERSION2(bind, 2, 2, 5) typeof(stub_bind) stub_bind_glibc_225 __attribute__ ((alias ("stub_bind"))) __attribute__ ((visibility ("default")));; asm(".symver stub_" "bind" "_" "glibc_225" ", " "bind" "@" "GLIBC_" "2" "." "2" "." "5"); // 上面这一句等效于:asm(.symver stub_bind_glibc_225, bind@GLIBC_2.2.5); 13. 系统调用过程 以bind为例: 目标程序的进程 -> stub_bind -> impl.bind -> do_bind -> libc.bind impl为一全局变量,impl->bind为函数指针,指向于do_bind。而libc.bind也为一函数指针,指向系统的bind。 14. 测试代码 用于体验Symbol Versioning和勾住系统函数: 1) Makefile 用于编译和测试 2) s.c 实现勾住库函数memcpy。 3) s.map s.c的MAP文件。 4) x.cpp 用于测试被勾住的memcpy程序。 14.1. Makefile all: x x: x.cpp libS.so g++ -g -o $@ $ libS.so: s.c s.map gcc -g -shared -fPIC -D_GNU_SOURCE -Wl,--version-script s.map $ clean: rm -f libS.so x test: all export LD_PRELOAD=`pwd`/libS.so;./x;export LD_PRELOAD= 14.2. s.c #include #include void* stub_memcpy(void *dst, const void *src, size_t n) { printf("stub_memcpy\n"); void* (*libc_memcpy)(void*, const void*, size_t) = dlsym(RTLD_NEXT, "memcpy"); return libc_memcpy(dst, src, n); } typeof(stub_memcpy) stub_memcpy_default_ __attribute__ ((alias ("stub_memcpy"))) __attribute__ ((visibility ("default")));; asm(".symver stub_" "memcpy" "_" "default_" ", " "memcpy" "@" "@"); 14.3. s.map libS_1.0 { global: memcpy; local: *; }; 14.4. x.cpp // Test: // export LD_PRELOAD=`pwd`/libS.so;./x;export LD_PRELOAD= #include #include int main() { char dst[100] = { '1', '2', '3', '\0' }; const char* src = "abc"; memcpy(dst, src, strlen(src)+1); printf("%s\n", dst); return 0; } 14.5. 体验方法 直接执行make test即可: $ make test export LD_PRELOAD=`pwd`/libS.so;./x;export LD_PRELOAD= stub_memcpy abc 如果不勾,则不要设置LD_PRELOAD直接执行x: $ ./x abc
示例: # ls /usr/local/r3c/bin/lib /bin/ls: /usr/local/r3c/bin/lib: ????????? 查看系统字符集设置: # locale LANG=zh_CN.UTF-8 LC_CTYPE=POSIX LC_NUMERIC="zh_CN.UTF-8" 查看SecureCRT字符集设置: 问题出在LC_CTYPE,也应当设置为zh_CN.UTF-8: export LC_CTYPE=zh_CN.UTF-8 再次执行,乱码消除: # ls /usr/local/r3c/bin/lib /bin/ls: /usr/local/r3c/bin/lib: 没有那个文件或目录 对于vi或vim乱码,则需要在~/.vimrc文件中设置encoding、fileencoding或fileencodings,如: set encoding=utf8 set fileencoding=utf8 set fileencodings=utf8,gbk
百度没能解决“连接到icloud是出错”,突然发现是因为禁止了“设置”访问WIFI和蜂窝网络(第三张图所示)。
pwdx - report current working directory of a process,格式:pwdx pid 内存分析工具 valgrind valgrind辅助工具 qcachegrind 可视化查看valgrind结果 淘宝DBA团队发布的监控脚本,使用perl开发,可以完成对linux系统和MySql相关指标的实时监控 orzdba 取指定进程名的pid pidof 进程名 性能瓶颈查看: perf top -p pid 查看调用栈: pstack pid https://www.percona.com/ 查询程序执行聚合的GDB堆栈跟踪,先进性堆栈跟踪,然后将跟踪信息汇总: pt-pmp -p pid 格式化explain出来的执行计划按照tree方式输出,方便阅读: pt-visual-explain 从log文件中读取插叙语句,并用explain分析他们是如何利用索引,完成分析之后会生成一份关于索引没有被查询使用过的报告: pt-index-usage 其它: vmstat tcpdump 网络数据包分析器 objdump nm ldd strings iostat 输入/输出统计 ifstat 网络流量实时监控工具 vmstat 虚拟内存统计 sar (System Activity Reporter系统活动情况报告,最为全面的系统性能分析工具之一) iptraf 实时IP局域网监控 iftop 网络带宽监控 htop 进程监控 iotop 磁盘I/O监测工具 fuser 使用文件或文件结构识别进程 lsof 打开文件列表 dmesg slabtop free slurm 查看网络流量 byobu 类似于screen tmux 终端复用工具,类似于screen screen 在多个进程之间多路复用一个物理终端的窗口管理器 dtach 用来模拟screen的detach的功能的小工具 dstat 可以取代vmstat,iostat,netstat和ifstat这些命令的多功能产品 NetHogs 监视每个进程的网络带宽 MultiTail 同时监控多个文档、类似tail Monitorix 系统和网络监控 Arpwatch 以太网活动监控器 Suricata 网络安全监控 Nagios 网络/服务器监控 Collectl 一体化性能检测工具 mtr 网络连通性判断工具,集成了traceroute和ping socat 多功能的网络工具(Socket CAT,netcat加强版) netpipes socket操作 ab wget curl tsung 压力测试工具 siege 压力测试和评测工具 nmon 监控Linux系统性能 psacct 监视用户活动 ncdu 基于ncurses库的磁盘使用分析器 slurm 实时网络流量监控 findmnt 查找已经被挂载的文件系统 saidar 系统数据监控和统计工具 ss 可以替代netstat的网络连接查看工具(socket statistics) ccze 用不同颜色高亮日志协助管理员进行区分和查看分析 netstat 网络统计 ifconfig (ifup ifdown) Linux磁盘相关命令 sfdisk -l sfdisk -s fdisk -l dmesg |grep SCSI dmesg |grep -i raid df -h cat /proc/scsi/scsi hdparm /dev/sda mount 加载一块硬盘 mkfs 创建文件系统 /etc/fstab 文件内容mount命令输出一致 lscpu 查看CPU lspci 查看主板 lsscsi 查看SCSI卡 测速 hdparm -t /dev/sda parted parted是一个由GNU开发的功能强大的磁盘分区和分区大小调整工具。 cfdisk -Ps cfdisk是一个磁盘分区的程序,具有互动式操作界面。参数-P表示显示分区表的内容,附加参数“s”会依照磁区的顺序显示相关信息。 查看软RAID cat /proc/mdstat 一条命令取机器IP地址,不同Linux稍有不同: netstat -ie|awk /broadcast/'{print $2}' netstat -ie|awk -F '[ :]+' /cast/'{print $4}' netstat -ie|awk -F '[ :]+' /cast/'{print $3}' 查看CPU mpstat -P ALL 1 mpstat -I SUM 1 查看网卡 ethtool eth0 查看网卡统计 ethtool -S eth1 查看网卡RingBuffer大小 ethtool -g eth1 查看流量 sar -n DEV 1 # 流量信息 sar -n EDEV 1 # 错误信息 中断相关 cat /proc/interrupts 查看网卡队列 grep eth1 /proc/interrupts |awk '{print $NF}' 查看中断亲和性(以中断74为例) cat /proc/irq/74/smp_affinity /proc/irq/ 该目录下存放的是以IRQ号命名的目录,如/proc/irq/40/表示中断号为40的相关信息 /proc/irq/[irq_num]/smp_affinity 该文件存放的是CPU位掩码(十六进制),修改该文件中的值可以改变CPU和某中断的亲和性 /proc/irq/[irq_num]/smp_affinity_list 该文件存放的是CPU列表(十进制),注意CPU核心个数用表示编号从0开始,如cpu0和cpu1等
1) 配置HDFS HttpFS和WebHDFS 如果HDFS是HA方式部署的,则只能使用HttpFS,而不能用WebHDFS。 2) 安装依赖: apr-iconv-1.2.1 confuse-3.0 apr-util-1.5.4 libpng-1.6.26 apr-1.5.2 expat-2.2.0 pcre-8.38 libxml2-devel libxslt-devel sqlite-devel 。。。。。。 3) 编译安装Hue 解压Hue安装包,然后执行 make install PREFIX=/usr/local 进行安装! 可以考虑修改下Makefile.vars.priv中的INSTALL_DIR值为$(PREFIX),而不是默认的$(PREFIX)/hue, 这样改为执行: make install PREFIX=/usr/local/hue-3.11.0 带上版本号是个好习惯,安装好后再建一个软链接,如:ln -s /usr/local/hue-3.11.0 /usr/local/hue。 编译安装过程中最常遇到的是缺乏依赖库,只需要按提示进行补充然后重复继续即可。 4) 修改desktop/conf/hue.ini A) [desktop] I) 为secret_key指定一个值,如ABC123,可以不指定,但Hue Web将不能保持会话。 II) 修改http_port为Web端口,如80或8080等。 III) 建议time_zone为北京时区Asia/Shanghai B ) [[hdfs_clusters]] I) 修改fs_defaultfs的值为core-site.xml中的fs.defaultFS的值 II) logical_name值HDFS集群名 III) webhdfs_url值为http://$host:14000/webhdfs/v1,其中“$host”值需为提供HttpFS服务的IP或主机名 IV) 修改hadoop_conf_dir的值为hadoop配置目录路径 C) [[yarn_clusters]] I) 修改resourcemanager_host值为主ResourceManager的IP地址(默认为8032端口所在地址), 注意不能为备ResourceManager的IP,原因是备ResourceManager不会打开端口8032。 II) 修改logical_name值为集群名。 III) 修改resourcemanager_api_url的值,将localhost替换成ResourceManager的8088端口地址。 D) [hbase] I) 修改hbase_conf_dir为HBase的配置目录路径 II) 修改thrift_transport为HBase Thrift2 Server采用的Transport,两者必须一致。 III) 注意截止hue-3.11.0版本,只支持HBase ThriftServer,而不支持HBase Thrift2Server 因此hbase_clusters的值要配置指向ThriftServer,其中Cluster可以为其它自定义值,只是为在Web上显示, Cluster后面的值必须为HBase ThriftServer的服务地址和端口。 如果需要同时运行HBase ThriftServer和HBase Thrift2Server,请为两者指定不同的服务端口和信息端口。 E) [beeswax] 修改hive_conf_dir为Hive的配置目录路径。 5) 启动Hue 进入Hue的build/env/bin目录,然后执行supervisor即可启动Hue服务。 6) 打开Web 假设Hue安装在192.168.1.22,服务端口号为8080,则只需要在浏览器中输入:http://192.168.1.22:8080即可进入Hue Web界面。 如果是第一次运行,则必须先创建好管理员帐号才能进入。 如果遇到错误,则可以检查Hue的错误日志文件error.log来了解是什么错误。 Hue ERROR日志: 1) Failed to obtain user group information: org.apache.hadoop.security.authorize.AuthorizationException is not allowed to impersonate (error 403) 一般是因为core-site.xml或httpfs-site.xml没配置正确。 //////////////////////////// 附)配置HDFS HttpFS和WebHDFS HDFS支持两种RESTful接口:WebHDFS和HttpFS。 WebHDFS默认端口号为50070,HttpFS默认端口号为14000。 默认启动WebHDFS而不会启动HttpFS,而HttpFS需要通过sbin/httpfs.sh来启动。 WebHDFS模式客户端和DataNode直接交互,HttpFS是一个代理模式。如果HDFS是HA方式部署的,则只能使用HttpFS模式。 HttpFS是独立的模块,有自己的配置文件httpfs-site.xml、环境配置文件httpfs-env.sh和日志配置文件httpfs-log4j.properties,需独立启动。 而WebHDFS是HDFS内置模块,无自己的配置文件、环境配置文件和日志配置文件,随HDFS而启动。 WebHDFS配置,在core-site.xml中加入以下内容: hadoop.proxyuser.$username.hosts * hadoop.proxyuser.$groupname.groups * “$username”的值为启动HDFS的用户名,“$groupname”为启动HDFS的用户组名。 HttpFS配置,在core-site.xml中加入以下内容: hadoop.proxyuser.httpfs.hosts * hadoop.proxyuser.httpfs.groups * 对于HttpFS,还需要在httpfs-site.xml中加入以下内容: httpfs.proxyuser.$username.hosts * httpfs.proxyuser.$groupname.groups * “$username”的值为启动HttpFS的用户名,“$groupname”为启动HttpFS的用户组名。 环境配置文件httpfs-env.sh可以不用修改,直接使用默认的值,当使用sbin/httpfs.sh来启动HttpFS时会屏幕输出HTTPFS_HOME等值。
HDFS支持两种RESTful接口:WebHDFS和HttpFS。 WebHDFS默认端口号为50070,HttpFS默认端口号为14000。 默认启动WebHDFS而不会启动HttpFS,而HttpFS需要通过sbin/httpfs.sh来启动。 WebHDFS模式客户端和DataNode直接交互,HttpFS是一个代理模式。对于Hue,如果HDFS是HA方式部署的,则只能使用HttpFS模式。 HttpFS是独立的模块,有自己的配置文件httpfs-site.xml、环境配置文件httpfs-env.sh和日志配置文件httpfs-log4j.properties,需独立启动。 而WebHDFS是HDFS内置模块,无自己的配置文件、环境配置文件和日志配置文件,随HDFS而启动。 WebHDFS配置,在core-site.xml中加入以下内容: hadoop.proxyuser.$username.hosts * hadoop.proxyuser.$groupname.groups * “$username”的值为启动HDFS的用户名,“$groupname”为启动HDFS的用户组名。 HttpFS配置,在core-site.xml中加入以下内容: hadoop.proxyuser.httpfs.hosts * hadoop.proxyuser.httpfs.groups * 对于HttpFS,还需要在httpfs-site.xml中加入以下内容: httpfs.proxyuser.$username.hosts * httpfs.proxyuser.$groupname.groups * “$username”的值为启动HttpFS的用户名,“$groupname”为启动HttpFS的用户组名。 环境配置文件httpfs-env.sh可以不用修改,直接使用默认的值,当使用sbin/httpfs.sh来启动HttpFS时会屏幕输出HTTPFS_HOME等值。
编译hbase-1.2.3源代码.pdf 目录 目录 1 1. 约定 1 2. 安装jdk 1 3. 安装maven 1 4. 网络配置 2 4.1. eclipse 3 4.2. maven 3 5. 从hbase官网下载源代码包: 4 6. eclipse导入hbase源代码 4 7. 编译hbase-thrift 6 8. Problems opening an editor ... does not exist 10 9. hbase-common 11 1. 约定 确保机器可以正常访问Internet,如能正常访问https://repo.maven.apache.org等,如果是代理方式则需要设置好eclipse和maven的网络配置。 本文环境为64位版本Windows7,jre安装目录为C:\java\jdk1.8.0_111,jdk安装目录为C:\java\jre1.8.0_111。 最好将jre安装在在jdk目录下,否则编译时会遇到“Could not find artifact jdk.tools:jdk.tools:jar”错误。将jre安装在jdk目录下的目的是使得jre的上一级存在jdk的lib目录。 2. 安装jdk 略!安装好后请设置环境变量JAVA_HOME为jdk的安装目录(不是javac所在的bin目录,而是bin的上一级目录)。 3. 安装maven 从maven官网下载安装包(本文下载的是apache-maven-3.3.9-bin.zip): https://maven.apache.org/download.cgi 解压后,将maven的bin目录加入到环境变量PATH中,本文对应的目录为C:\Program Files\apache-maven-3.3.9\bin。并设置环境变量M2_HOME的值为maven的安装目录,对于本文M2_HOME值为C:\Program Files\apache-maven-3.3.9 然后设置eclipse使用外部的maven,进入eclipse的Preferences中按下图进行设置: 4. 网络配置 确保机器可以正常访问Internet,否则大量问题难以解决。如果是通过代理才能访问,则需要为eclipse和maven配置好代理。 4.1. eclipse 4.2. maven 编辑$HOME/.m2目录下的settings.xml,如果不存在该文件,则复制$MAVEN_HOME/conf目录下的settings.xml,然后再修改即可同。 MAVEN_HOME为maven的安装目录,$HOME/.m2为repository的默认目录,HOME为Windows用户目录,Windows7上假设用户名为mike则HOME为C:\Users\mike。 假设代理服务器的地址和端口分别为:proxy.oa.com和8080,则(不需要用户名或密码,则相应的值不设置即可): http-proxy true http proxy.oa.com 8080 local|127.0.0.1 https-proxy true https proxy.oa.com 8080 local|127.0.0.1 5. 从hbase官网下载源代码包: 以下网站均提供hbase源代码包下载: http://mirrors.hust.edu.cn/apache/hbase/ https://mirrors.tuna.tsinghua.edu.cn/apache/hbase/ http://mirror.bit.edu.cn/apache/hbase/ http://apache.fayea.com/hbase/ 本文下载的是hbase-1.2.3-src.tar.gz。 6. eclipse导入hbase源代码 本文使用的eclipse版本: 将hbase-1.2.3-src.tar.gz解压,本文将其解压到目录E:\bigdata\hbase-1.2.3-src,然后以“Existing Maven Projects”方式导入: 成功导入后如下图所示: 以maven编译hbase源代码,编译整个hbase容易遇到错误,比如编译hbase-common需要安装bash,hbase-thrift、但hbase-server、hbase-client等模块不依赖bash。为简单体验,先定一个小目标:编译hbase-thrift模块: 7. 编译hbase-thrift 鼠标右击hbase-thrift,按下图进入设置界面: 设置界面如下图所示,并设置Goals为clean install -DskipTests(注意不是clean install,需要加上-DskipTests,否则即使勾选了Skip Tests也可能无效): 然后点击“Run”即开始编译! 编译过程中如遇到下面的错误,请确认是否存在目录C:\java\jre1.8.0_111/../lib,其用意是jre安装在jdk的目录下,也就是说lib需要为jdk的lib目录。 简单的做法是复制jdk的lib目录到C:\java目录下。 [ERROR] Failed to execute goal on project hbase-thrift: Could not resolve dependencies for project org.apache.hbase:hbase-thrift:jar:1.2.3: Could not find artifact jdk.tools:jdk.tools:jar:1.8 at specified path C:\java\jre1.8.0_111/../lib/tools.jar -> [Help 1] 成功后如下图所示: 在目录E:\bigdata\hbase-1.2.3-src\hbase-thrift\target下可以看到编译生成的jar文件: 然后可以编译hbase-client,如果需要编译hadoop-common则需要安装bash先,也就是得安装cygwin(https://cygwin.com/install.html)。 建议从国内镜像安装cgywin,会快很多,可用镜像: http://mirrors.163.com/cygwin/ http://www.cygwin.cn/pub/ 选择从互联网安装,在“User URL”处输入国内镜像网址。 安装好cygwin后,需将cgywin的bin目录加入到环境变量PATH中,并需要重启eclipse才会生效。如果未安装bash,则用同样方法编译hadoop-common时,会报如下错误: [ERROR] Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.6:run (generate) on project hbase-common: An Ant BuildException has occured: Execute failed: java.io.IOException: Cannot run program "bash": CreateProcess error=2, 系统找不到指定的文件。 -> [Help 1] 8. Problems opening an editor ... does not exist 在eclipse里用F3想进入某个类的某方法时,提示以下错误(Problems opening an editor Reason: [项目名] does not exist): 解决办法(目的是生成“.project”和“.classpath”两个eclipse需要的文件): 按下图所示,进入项目的根目录,以hbase的hbase-thrift为例,如hbase-thrift所在目录为E:\bigdata\hbase-1.2.3-src\hbase-thrift,注意不是E:\bigdata\hbase-1.2.3-src,然后执行:mvn eclipse:eclipse,成功后重启eclipse上述问题即解决(mvn eclipse:eclipse的作用是将maven项目转化为eclipse项目,即生成两个eclipse导入所需的配置文件,并无其他改变,也就是生成eclipse需要的.project和.classpath两个文件): 其它诸于hbase-client、hbase-common、h base-server等同样处理即可。 9. hbase-common 编译hbase-common如遇到下述问题: [ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.5.1:compile (default-compile) on project hbase-common: Compilation failure: Compilation failure: [ERROR] E:\bigdata\hbase-1.2.3-src\hbase-common\target\generated-sources\java\org\apache\hadoop\hbase\package-info.java:[5,39] 错误: 非法转义符 [ERROR] E:\bigdata\hbase-1.2.3-src\hbase-common\target\generated-sources\java\org\apache\hadoop\hbase\package-info.java:[5,30] 错误: 未结束的字符串文字 [ERROR] E:\bigdata\hbase-1.2.3-src\hbase-common\target\generated-sources\java\org\apache\hadoop\hbase\package-info.java:[6,0] 错误: 需要class, interface或enum [ERROR] E:\bigdata\hbase-1.2.3-src\hbase-common\target\generated-sources\java\org\apache\hadoop\hbase\package-info.java:[6,9] 错误: 需要class, interface或enum [ERROR] -> [Help 1] 打开package-info.java: /* * Generated by src/saveVersion.sh */ @VersionAnnotation(version="1.2.3", revision="Unknown", user="mooon\mike ", date="Tue Oct 25 18:02:39 2016", url="file:///cygdrive/e/bigdata/hbase-1.2.3-src", srcChecksum="88f3dc17f75ffda6176faa649593b54e") package org.apache.hadoop.hbase; ,可以看到问题出在“mike”后多了一个换行符,正常应当是: /* * Generated by src/saveVersion.sh */ @VersionAnnotation(version="1.2.3", revision="Unknown", user="mooon\mike", date="Tue Oct 25 18:02:39 2016", url="file:///cygdrive/e/bigdata/hbase-1.2.3-src", srcChecksum="88f3dc17f75ffda6176faa649593b54e") package org.apache.hadoop.hbase; 查看hadoop-common的saveVersion.sh,部分内容如下: unset LANG unset LC_CTYPE version=$1 outputDirectory=$2 pushd . cd .. user=`whoami` date=`date` cwd=`pwd` 问题就出在whoami命令返回了mooon\mike,并且mike后跟了一个换行符导致的,因此可以如下消灭多余的换行符: unset LANG unset LC_CTYPE version=$1 outputDirectory=$2 pushd . cd .. user=`whoami|awk '{printf("%s",$1);}'` date=`date` cwd=`pwd` 再次编译,仍然报错: [ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.5.1:compile (default-compile) on project hbase-common: Compilation failure [ERROR] E:\bigdata\hbase-1.2.3-src\hbase-common\target\generated-sources\java\org\apache\hadoop\hbase\package-info.java:[5,39] 错误: 非法转义符 [ERROR] -> [Help 1] 再次打开package-info.java: /* * Generated by src/saveVersion.sh */ @VersionAnnotation(version="1.2.3", revision="Unknown", user="mooon\mike", date="Tue Oct 25 17:59:21 2016", url="file:///cygdrive/e/bigdata/hbase-1.2.3-src", srcChecksum="88f3dc17f75ffda6176faa649593b54e") package org.apache.hadoop.hbase; 问题出在“mooon\mike”间的斜杠,需要将单个斜杠改成双斜杠“mooon\\mike”或者干脆去掉“mooon\”仅保留“mike”也可以。 再次修改saveVersion.sh,直接写死user: unset LANG unset LC_CTYPE version=$1 outputDirectory=$2 pushd . cd .. user=mike date=`date` cwd=`pwd` 然后再次编译hadoop-common,终于成功了:
如果在运行spark-sql时遇到如下这样的错误,可能是因为yarn-site.xml中的配置项yarn.nodemanager.vmem-pmem-ratio值偏小,它的默认值为2.1,可以尝试改大一点再试。 ERROR cluster.YarnClientSchedulerBackend: Yarn application has already exited with state FINISHED! 16/10/13 10:23:19 ERROR client.TransportClient: Failed to send RPC 7614640087981520382 to /10.143.136.231:34800: java.nio.channels.ClosedChannelException java.nio.channels.ClosedChannelException 16/10/13 10:23:19 ERROR cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Sending RequestExecutors(0,0,Map()) to AM was unsuccessful java.io.IOException: Failed to send RPC 7614640087981520382 to /10.143.136.231:34800: java.nio.channels.ClosedChannelException at org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:249) at org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:233) at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680) at io.netty.util.concurrent.DefaultPromise$LateListeners.run(DefaultPromise.java:845) at io.netty.util.concurrent.DefaultPromise$LateListenerNotifier.run(DefaultPromise.java:873) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745)
如果在运行spark-sql时遇到如下这样的错误,可能是因为yarn-site.xml中的配置项yarn.nodemanager.vmem-pmem-ratio值偏小,它的默认值为2.1,可以尝试改大一点再试。 ERROR cluster.YarnClientSchedulerBackend: Yarn application has already exited with state FINISHED! 16/10/13 10:23:19 ERROR client.TransportClient: Failed to send RPC 7614640087981520382 to /10.143.136.231:34800: java.nio.channels.ClosedChannelException java.nio.channels.ClosedChannelException 16/10/13 10:23:19 ERROR cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Sending RequestExecutors(0,0,Map()) to AM was unsuccessful java.io.IOException: Failed to send RPC 7614640087981520382 to /10.143.136.231:34800: java.nio.channels.ClosedChannelException at org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:249) at org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:233) at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680) at io.netty.util.concurrent.DefaultPromise$LateListeners.run(DefaultPromise.java:845) at io.netty.util.concurrent.DefaultPromise$LateListenerNotifier.run(DefaultPromise.java:873) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745)
优点:可支持海量访问的频率控制,只需要增加Redis机器,单个Redis节点(只占用一个cpu core)即可支持10万/s以上的处理。 基于IP频率限制是种常见需求,基于Redis可以十分简单实现对IP的频率限制,具体手段为利用Redis的key过期和原子加减两个特性。 以IP作为key,频率为key过期时长,比如限制单个IP在2秒内频率为100,则key过期时长为2秒,基于r3c(a Redis Cluster C++ Client)的实现大致如下: r3c::CRedisClient redis("127.0.0.1:6379,127.0.0.1:6380"); int ret = redis.incrby(ip, 1); if (ret > 1000) // 超过频率 { } else // 访问放行 { if (1 == ret) redis.expire(ip, 2); // 频率控制为2秒内1000次访问 } 完整示例: // https://github.com/eyjian/r3c #include r3c/r3c.h> int main() { std::string ip = "127.0.0.1"; r3c::CRedisClient redis("10.223.25.102:6379"); r3c::set_debug_log_write(NULL); for (int i=0; i100000; ++i) { // r3c基于redis的EVAL命令提供了一个带过期参数的incrby, // 这样避免了两次操作的非原子时expire调用可能不成功问题。 int ret = redis.incrby(ip, 1); if (ret > 1000) // 限制单个IP每2秒最多访问1000次 { printf("[OVER] 超过频率,限制访问\n"); } else { if (1 == ret) { redis.expire(ip, 2); // 频率设定为2秒 printf("[FIRST] 第一次,访问放行\n"); } else { printf("[OK] 访问放行\n"); } } } redis.del(ip); return 0; }
#include #include #include // g++ -g -o x x.cpp -D__STDC_FORMAT_MACROS -std=c++11 int main() { int64_t a = 32; //printf("%"PRId64"\n", a); printf("%" PRId64"\n", a); // 在PRId64前保留一个空格 // 如果不保留空格,则C++11编译时将报如下警告: // invalid suffix on literal; C++11 requires a space between literal and identifier [-Wliteral-suffix] // 将PRId64换成其它宏,情况相同 return 0; }