Flume（二）【Flume 进阶使用】（2）-阿里云开发者社区

Flume（二）【Flume 进阶使用】（2）

2024-06-07 65 发布于黑龙江

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

本文涉及的产品

日志服务 SLS，月写入数据量 50GB 1个月

简介： Flume（二）【Flume 进阶使用】

Flume（二）【Flume 进阶使用】（1）https://developer.aliyun.com/article/1532352

3）需求实现

flume-file-flume.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
 
# 将数据流复制给所有 channel 默认就是 replicating 所以也可以不用配置
a1.sources.r1.selector.type = replicating 
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/hive-3.1.2/logs/hive.log
a1.sources.r1.shell = /bin/bash -c
 
# Describe the sink
# sink 端的 avro 是一个数据发送者
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop102
a1.sinks.k1.port = 4141
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop102
a1.sinks.k2.port = 4142
 
# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100
 
# Bind the source and sink to the channel
# 一个 sink 只可以指定一个 channel，但是一个 channel 可以指定多个 sink
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

flume-hdfs.conf

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1
 
# Describe/configure the source
# source 端的 avro 是一个数据接收服务
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop102
a2.sources.r1.port = 4141
 
# Describe the sink
a2.sinks.k1.type = hdfs
a2.sinks.k1.hdfs.path = hdfs://hadoop102:9820/flume2/%Y%m%d/%H
#上传文件的前缀
a2.sinks.k1.hdfs.filePrefix = flume2-
#是否按照时间滚动文件夹
a2.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a2.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a2.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a2.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a2.sinks.k1.hdfs.batchSize = 100
#设置文件类型，可支持压缩
a2.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a2.sinks.k1.hdfs.rollInterval = 30
#设置每个文件的滚动大小大概是 128M
a2.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与 Event 数量无关
a2.sinks.k1.hdfs.rollCount = 0
 
# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100
 
# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

flume-dir.conf

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c2
 
# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop102
a3.sources.r1.port = 4142
 
# Describe the sink
a3.sinks.k1.type = file_roll
a3.sinks.k1.sink.directory = /opt/module/data/flume3
 
# Describe the channel
a3.channels.c2.type = memory
a3.channels.c2.capacity = 1000
a3.channels.c2.transactionCapacity = 100
 
# Bind the source and sink to the channel
a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2

4）测试

bin/flume-ng agent -c conf/ -n a3 -f job/group1/flume-dir.conf
bin/flume-ng agent -n a1 -c conf/ -f job/group1/flume-file-flumc.conf
bin/flume-ng agent -n a2 -c conf/ -f job/group1/flume-hdfs.conf

查看结果：

注意：写入本地文件时，当一段时间没有新的日志时，它仍然会创建一个新的文件，而不像 hdfs sink 即使达到了设置的间隔时间但是没有新日志产生，那么它也不会创建一个新的文件。

这个需要注意的就是 hdfs 的端口不要写错，比如我的就不是 9870 而是 8020.

4.2、负载均衡和故障转移

1）案例需求

使用 Flume1 监控一个端口，其 sink 组中的 sink 分别对接 Flume2 和 Flume3，采用 FailoverSinkProcessor，实现故障转移的功能。

2）需求分析

开启一个端口 88888 来发送数据
使用 flume-1 监听该端口，并发送到 flume-2 和 flume-3 （需要 flume-1 的 sink 为 avro sink，flume-2 和 flume-3 的 source 为 avro source），flume-2 和 flume-3 发送日志到控制台（flume-2 和 flume-3 的 sink 为 logger sink）

3）需求实现

flume-nc-flume.conf

# Name the components on this agent
a1.sources = r1
a1.channels = c1
a1.sinkgroups = g1
a1.sinks = k1 k2
 
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
 
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10
a1.sinkgroups.g1.processor.maxpenalty = 10000
 
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop102
a1.sinks.k1.port = 4141
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop102
a1.sinks.k2.port = 4142
 
# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
 
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

flume-flume-console1.conf

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1
 
# Describe/configure the source
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop102
a2.sources.r1.port = 4141
 
# Describe the sink
a2.sinks.k1.type = logger
 
# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100
 
# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

flume-flume-console2.conf

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c2
 
# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop102
a3.sources.r1.port = 4142
 
# Describe the sink
a3.sinks.k1.type = logger
 
# Describe the channel
a3.channels.c2.type = memory
a3.channels.c2.capacity = 1000
a3.channels.c2.transactionCapacity = 100
 
# Bind the source and sink to the channel
a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2

4）案例测试

bin/flume-ng agent -c conf/ -n a3 -f job/group2/flume-flume-console2.conf -Dflume.root.logger=INFO,console
bin/flume-ng agent -c conf/ -n a2 -f job/group2/flume-flume-console1.conf -Dflume.root.logger=INFO,console
bin/flume-ng agent -c conf/ -n a1 -f job/group2/flume-nc-flume.conf

关闭 flume-flume-console1.conf 作业

我们发现，一开始我们开启三个 flume 作业，当向 netcat 输入数据时，只有 flume-flume-console1.conf 作业的控制台有日志输出，这是因为它的优先级更高，当把作业 flume-flume-console1.conf 关闭时，再次向端口 44444 发送数据，发现 flume-flume-console2.conf 作业开始输出。

如果要使用负载均衡，只需要替换上面 flume-nc-flume.conf 中：

a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10
a1.sinkgroups.g1.processor.maxpenalty = 10000

替换为：

a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.backoff = true
a1.sinkgroups.g1.processor.maxTimeOut = 30000

其中，backoff 代表退避，默认为 false，如果当前 sink 没有拉到数据，那么接下来一段时间就不用这个 sink 。maxTimeOut 代表最大的退避时间，因为退避默认是指数增长的（比如一个 sink 第一次没有拉到数据，需要等 1 s，第二次还没拉到，等 2s，第三次等 4s ...），默认最大值为 30 s。

4.3、聚合

1）案例需求

hadoop102 上的 Flume-1 监控文件/opt/module/group.log，
hadoop103 上的 Flume-2 监控某一个端口的数据流，
Flume-1 与 Flume-2 将数据发给 hadoop104 上的 Flume-3，Flume-3 将最终数据打印到控制台。

注意：主机只能在 hadoop104 上配，因为 avro source 在 hadoop104 上，客户端（hadoop02 和 hadoop103 的 sink）可以远程连接，但是服务端（hadoop104 的 source）只能绑定自己的端口号。

Flume（二）【Flume 进阶使用】（3）https://developer.aliyun.com/article/1532355

Flume（二）【Flume 进阶使用】（2）

3）需求实现

flume-file-flume.conf

flume-hdfs.conf

flume-dir.conf

4）测试

4.2、负载均衡和故障转移

1）案例需求

2）需求分析

3）需求实现

flume-nc-flume.conf

flume-flume-console1.conf

flume-flume-console2.conf

4）案例测试

4.3、聚合

1）案例需求

热门文章

最新文章

相关课程

相关电子书

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Flume（二）【Flume 进阶使用】（2）

3）需求实现

flume-file-flume.conf

flume-hdfs.conf

flume-dir.conf

4）测试

4.2、负载均衡和故障转移

1）案例需求

2）需求分析

3）需求实现

flume-nc-flume.conf

flume-flume-console1.conf

flume-flume-console2.conf

4）案例测试

4.3、聚合

1）案例需求

热门文章

最新文章

相关课程

相关电子书