大数据技术之 Flume3-阿里云开发者社区

3.4.2 负载均衡和故障转移

1）案例需求

使用 Flume1 监控一个端口，其 sink 组中的 sink 分别对接 Flume2 和 Flume3，采用FailoverSinkProcessor，实现故障转移的功能。

2）需求分析

3）实现步骤

（1）准备工作

在/opt/module/flume/job 目录下创建 group2 文件夹

[atguigu@hadoop102 job]$ cd group2/

（2）创建 flume-netcat-flume.conf

配置 1 个 netcat source 和 1 个 channel、1 个 sink group（2 个 sink），分别输送给flume-flume-console1 和 flume-flume-console2。

编辑flume1配置文件

[atguigu@hadoop102 group2]$ vim flume-netcat-flume.conf

添加如下内容

# Name the components on this agent
a1.sources = r1
a1.channels = c1
a1.sinkgroups = g1
a1.sinks = k1 k2
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10
a1.sinkgroups.g1.processor.maxpenalty = 10000
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop102
a1.sinks.k1.port = 4141  
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop102
a1.sinks.k2.port = 4142
# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

骚戴理解：

a1.sinkgroups.g1.processor.type=failover: 这告诉flume使用failover processor来处理该管道中的事件。当主通道不可用时，failover processor会自动切换到备用通道。

a1.sinkgroups.g1.processor.priority.k1=5: 这定义了备用通道的优先级。在此示例中，备用通道具有较低的优先级，因为其值较小。

a1.sinkgroups.g1.processor.priority.k2=10: 这定义了主要通道的优先级。在此示例中，主通道具有更高的优先级，因为其值较大。

a1.sinkgroups.g1.processor.maxpenalty=10000: 这指定在主通道恢复之前尝试连接备用通道的最大惩罚时间（以毫秒为单位）。如果主通道在这段时间内没有恢复，则failover processor将永久地使用备用通道。

（3）创建 flume-flume-console1.conf

配置上级 Flume 输出的 Source，输出是到本地控制台。

编辑flume2配置文件

[atguigu@hadoop102 group2]$ vim flume-flume-console1.conf

添加如下内容

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1
# Describe/configure the source
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop102
a2.sources.r1.port = 4141
# Describe the sink
a2.sinks.k1.type = logger
# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

（4）创建 flume-flume-console2.conf

配置上级 Flume 输出的 Source，输出是到本地控制台。

编辑flume3配置文件

[atguigu@hadoop102 group2]$ vim flume-flume-console2.conf
添加如下内容
# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c2 
# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop102
a3.sources.r1.port = 4142
# Describe the sink
a3.sinks.k1.type = logger
# Describe the channel
a3.channels.c2.type = memory
a3.channels.c2.capacity = 1000
a3.channels.c2.transactionCapacity = 100
# Bind the source and sink to the channel
a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2

（5）执行配置文件

分别开启对应配置文件：flume-flume-console2，flume-flume-console1，flume-netcat-flume。

[atguigu@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name 
a3 --conf-file job/group2/flume-flume-console2.conf -
Dflume.root.logger=INFO,console
[atguigu@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name 
a2 --conf-file job/group2/flume-flume-console1.conf -
Dflume.root.logger=INFO,console
[atguigu@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name 
a1 --conf-file job/group2/flume-netcat-flume.conf

（6）使用 netcat 工具向本机的 44444 端口发送内容

$ nc localhost 44444

（7）查看 Flume2 及 Flume3 的控制台打印日志

（8）将 Flume2 kill杀掉，观察 Flume3 的控制台打印情况。

注：使用 jps -ml 查看 Flume 进程。

3.4.3 聚合

1）案例需求：

hadoop102 上的 Flume-1 监控文件/opt/module/group.log，hadoop103 上的 Flume-2 监控某一个端口的数据流，Flume-1 与 Flume-2 将数据发送给 hadoop104 上的 Flume-3，Flume-3 将最终数据打印到控制台。

2）需求分析

多数据源汇总案例

3）实现步骤：

（1）准备工作

分发 Flume

[atguigu@hadoop102 module]$ xsync flume

在 hadoop102、hadoop103 以及 hadoop104 的/opt/module/flume/job 目录下创建一个group3 文件夹。

[atguigu@hadoop102 job]$ mkdir group3
[atguigu@hadoop103 job]$ mkdir group3
[atguigu@hadoop104 job]$ mkdir group3

（2）创建 flume1-logger-flume.conf

配置 Source 用于监控 hive.log 文件，配置 Sink 输出数据到下一级 Flume。在 hadoop102 上编辑配置文件

[atguigu@hadoop102 group3]$ vim flume1-logger-flume.conf 
添加如下内容
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/group.log
a1.sources.r1.shell = /bin/bash -c
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop104
a1.sinks.k1.port = 4141 
# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

（3）创建 flume2-netcat-flume.conf

配置 Source 监控端口 44444 数据流，配置 Sink 数据到下一级 Flume：在 hadoop103 上编辑配置文件

[atguigu@hadoop102 group3]$ vim flume2-netcat-flume.conf

添加如下内容

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1
# Describe/configure the source
a2.sources.r1.type = netcat
a2.sources.r1.bind = hadoop103
a2.sources.r1.port = 44444
# Describe the sink
a2.sinks.k1.type = avro
a2.sinks.k1.hostname = hadoop104
a2.sinks.k1.port = 4141
# Use a channel which buffers events in memory
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

（4）创建 flume3-flume-logger.conf

配置 source 用于接收 flume1 与 flume2 发送过来的数据流，最终合并后 sink 到控制台。

在 hadoop104 上编辑配置文件

[atguigu@hadoop104 group3]$ touch flume3-flume-logger.conf
[atguigu@hadoop104 group3]$ vim flume3-flume-logger.conf

添加如下内容

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c1
# Describe/configure the source
a3.sources.r1.type = avro 
a3.sources.r1.bind = hadoop104
a3.sources.r1.port = 4141
# Describe the sink
# Describe the sink
a3.sinks.k1.type = logger
# Describe the channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

骚戴理解:这里a3.sources.r1.bind = hadoop104和a3.sources.r1.port = 4141我一开始觉得应该有两个主机和两个端口，后面看前面两个Flume发现他们用的都是同一个主机和一个端口，所以这里只需要配置一个主机和一个端口

（5）执行配置文件

分别开启对应配置文件：flume3-flume-logger.conf，flume2-netcat-flume.conf，flume1-logger-flume.conf。

[atguigu@hadoop104 flume]$ bin/flume-ng agent --conf conf/ --name 
a3 --conf-file job/group3/flume3-flume-logger.conf -
Dflume.root.logger=INFO,console
[atguigu@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name 
a2 --conf-file job/group3/flume1-logger-flume.conf
[atguigu@hadoop103 flume]$ bin/flume-ng agent --conf conf/ --name 
a1 --conf-file job/group3/flume2-netcat-flume.conf

（6）在 hadoop103 上向/opt/module 目录下的 group.log 追加内容

[atguigu@hadoop103 module]$ echo 'hello' > group.log

（7）在 hadoop102 上向 44444 端口发送数据

[atguigu@hadoop102 flume]$ telnet hadoop102 44444

（8）检查 hadoop104 上数据

3.5 自定义 Interceptor

1）案例需求

使用 Flume 采集服务器本地日志，需要按照日志类型的不同，将不同种类的日志发往不同的分析系统。

2）需求分析

在实际的开发中，一台服务器产生的日志类型可能有很多种，不同类型的日志可能需要发送到不同的分析系统。此时会用到 Flume 拓扑结构中的 Multiplexing 结构，Multiplexing的原理是，根据 event 中 Header 的某个 key 的值，将不同的 event 发送到不同的 Channel 中，所以我们需要自定义一个 Interceptor，为不同类型的 event 的 Header 中的 key 赋予不同的值。

在该案例中，我们以端口数据模拟日志，以是否包含”atguigu”模拟不同类型的日志，我们需要自定义 interceptor 区分数据中是否包含”atguigu”，将其分别发往不同的分析系统（Channel）。

Interceptor和Multiplexing ChannelSelector案例

3）实现步骤

（1）创建一个 maven 项目，并引入以下依赖。

<dependency>
     <groupId>org.apache.flume</groupId>
     <artifactId>flume-ng-core</artifactId>
     <version>1.9.0</version>
</dependency>

（2）定义 CustomInterceptor 类并实现 Interceptor 接口。

package com.atguigu.interceptor;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
public class TypeInterceptor implements Interceptor {
     //声明一个存放事件的集合
     private List<Event> addHeaderEvents;
     @Override  
     public void initialize() {
         //初始化存放事件的集合
         addHeaderEvents = new ArrayList<>();
     }
     //单个事件拦截
     @Override
     public Event intercept(Event event) {
         //1.获取事件中的头信息
         Map<String, String> headers = event.getHeaders();
         //2.获取事件中的 body 信息
         String body = new String(event.getBody());
         //3.根据 body 中是否有"atguigu"来决定添加怎样的头信息
         if (body.contains("atguigu")) {
             //4.添加头信息
             headers.put("type", "first");
         } else {
             //4.添加头信息
             headers.put("type", "second");
         }
         return event;
     }
     //批量事件拦截
     @Override
     public List<Event> intercept(List<Event> events) {
         //1.清空集合
         addHeaderEvents.clear();
         //2.遍历 events
         for (Event event : events) {
             //3.给每一个事件添加头信息
             addHeaderEvents.add(intercept(event));
         }
         //4.返回结果
         return addHeaderEvents;
     }
     @Override
     public void close() {
     }
     public static class Builder implements Interceptor.Builder {
         @Override
         public Interceptor build() {
             return new TypeInterceptor();
         }
         @Override 
         public void configure(Context context) {
         }
     }
}

骚戴理解： headers.put("type", "first");和 headers.put("type", "second");这里的K必须是一致的，因为后面Multiplexing的配置文件里会根据K的值来分发到不同的Channel，也就是下面 flume 配置文件里的a1.sources.r1.selector.header = type。

以下代码在实现拦截器的时候很容易漏掉，这个静态内部类主要是后面配置文件里需要用到它来构造这个拦截器对象，也就是 flume 配置文件里com.atguigu.flume.interceptor.CustomInterceptor$Builder的$Builder，$Builder其实就是调用这个拦截器的静态内部类

     public static class Builder implements Interceptor.Builder {
         @Override
         public Interceptor build() {
             return new TypeInterceptor();
         }
         @Override 
         public void configure(Context context) {
         }
     }

（3）编辑 flume 配置文件

为 hadoop102 上的 Flume1 配置 1 个 netcat source，1 个 sink group（2 个 avro sink）+，并配置相应的 ChannelSelector 和 interceptor。

# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
Multiplexing结构的拦截器配置
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = com.atguigu.flume.interceptor.CustomInterceptor$Builder
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = type
a1.sources.r1.selector.mapping.first = c1
a1.sources.r1.selector.mapping.second = c2
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop103
a1.sinks.k1.port = 4141
a1.sinks.k2.type=avro
a1.sinks.k2.hostname = hadoop104
a1.sinks.k2.port = 4242
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Use a channel which buffers events in memory
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

骚戴理解：

a1.sources.r1.interceptors = i1这里是定义拦截器的别名，可以有多个，方便下面配置引用

a1.sources.r1.interceptors.i1.type = com.atguigu.flume.interceptor.CustomInterceptor$Builder

是定义了具体的拦截器是哪一个，通过全限定名类名然后反射去获取

a1.sources.r1.selector.type = multiplexing是定义了Source把Event分发给Channel的策略，也就是根据请求头的K的值来分发，这需要用到拦截器

a1.sources.r1.selector.header = type是定义请求头里的K，这个要和拦截器保持一致

a1.sources.r1.selector.mapping.first = c1

a1.sources.r1.selector.mapping.second = c2

大数据技术之 Flume3

3.4.2 负载均衡和故障转移

3.4.3 聚合

3.5 自定义 Interceptor

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

大数据技术之 Flume3

3.4.2 负载均衡和故障转移

3.4.3 聚合

3.5 自定义 Interceptor

热门文章

最新文章

相关课程

相关电子书

相关实验场景