Ububtu18.04安装Flume1.9.0以及相关知识点
Flume简介
官网地址:http://flume.apache.org/index.html
一个强烈推荐的中文翻译文档:https://flume.liyifeng.org/
Flume is a distributed, reliable, and available service for
efficiently collecting, aggregating, and moving large amounts of log
data. It has a simple and flexible architecture based on streaming
data flows. It is robust and fault tolerant with tunable reliability
mechanisms and many failover and recovery mechanisms. It uses a simple
extensible data model that allows for online analytic application.
译文:Flume是一种分布式的、可靠的、可用的服务,用于高效地收集、聚合和移动大量的日志数据。它具有基于流数据流的简单而灵活的架构。它具有可调的可靠性机制和许多故障转移和恢复机制,具有健壮性和容错能力。它使用一个简单的可扩展数据模型,允许在线分析应用程序。
版本选择
Flume1.9.0的稳定版是在2019年1月发布的,1.9.0也是目前最新版本,选择这个版本的另一个原因是,Flume向后兼容。
January 8, 2019 - Apache Flume 1.9.0 Released
The Apache Flume team is pleased to announce the release of Flume
1.9.0.
Flume is a distributed, reliable, and available service for
efficiently collecting, aggregating, and moving large amounts of
streaming event data.
Version 1.9.0 is the eleventh Flume release as an Apache top-level
project. Flume 1.9.0 is stable, production-ready software, and is
backwards-compatible with previous versions of the Flume 1.x codeline.
Several months of active development went into this release: about 70
patches were committed since 1.8.0, representing many features,
enhancements, and bug fixes. While the full change log can be found on
the 1.9.0 release page (link below), here are a few new feature
highlights:
下载及安装测试
下载地址:
http://flume.apache.org/download.html
本文下载的文件为:apache-flume-1.9.0-bin.tar.gz
备注:文件校验为可选
进行文件校验:打开在线校验网站:http://www.metools.info/code/c92.html
将Flume1.9.0上传,选择sha512,即可计算sha512的值。
a989c50389c779dd7554c98bdba687fa982d6079d308c85ac210d3e523aa54b4b7452f38fe30d9acdac327080fe316d604b5efb0f3943cbacb4839fb2261f535
打开上一步的校验文件(单击链接:apache-flume-1.9.0-bin.tar.gz.sha512),会弹出
a989c50389c779dd7554c98bdba687fa982d6079d308c85ac210d3e523aa54b4b7452f38fe30d9acdac327080fe316d604b5efb0f3943cbacb4839fb2261f535
将这段校验值,与上一步的校验值进行比较,即可判断下载的文件是否被篡改过。
安装:
Flume的运行环境需要:
System Requirements
Java Runtime Environment - Java 1.8 or later
Memory - Sufficient memory for configurations used by sources, channels or sinks
Disk Space - Sufficient disk space for configurations used by channels or sinks
Directory Permissions - Read/Write permissions for directories used by agent
JDK1.8+
足够的内存
足够的磁盘
目录的读写权限
在具备环境后,将apache-flume-1.9.0-bin.tar.gz解压到指定路径就好:
比如Ubuntu的/home/hadoop/opt/app/apache-flume-1.9.0-bin目录下
切换到apache-flume-1.9.0-bin.tar.gz所在目录,执行以下命令完成apache-flume-1.9.0-bin.tar.gz解压缩。
tar -zxf apache-flume-1.9.0-bin.tar.gz -C /home/hadoop/opt/app
查看当前的目录结构:
测试Flume:
在安装目录下的conf目录下新建空文件netcat2logger.conf
cd /home/hadoop/opt/opp/apache-flume-1.9.0-bin/conf vi netcat2logger.conf
加入一下内容:
# example.conf: A single-node Flume configuration # Name the components on this agent a1.sources = r1 a1.sinks = k1 a1.channels = c1 # Describe/configure the source a1.sources.r1.type = netcat a1.sources.r1.bind = localhost a1.sources.r1.port = 6666 # Describe the sink a1.sinks.k1.type = logger # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
上面的配置文件定义了一个agent的name为a1,a1的source监听6666端口,并且读取6666端口传过来的数据, a1的channel 采用内存作为缓存,a1的sink 类型为logs,具体含义可以参考官网,或是留言。
在flume的安装目录下执行如下命令,即可使用flume采集数据:
$ bin/flume-ng agent -n a1 -c conf -f conf/netcat2logger.conf -Dflume.root.logger=INFO,console
flume-ng agent :表示flume的启动一个agent,ng是表示这是new的版本命令
-n a1:-n 表示name ,a1表示agent的名字为a1 对应配置文件中的a1
-c conf :表示flume的配置文件目录所在位置
-f conf/netcat2logger.conf: 表示自定义的数据采集配置文件位置。
-Dflume.root.logger=INFO,console:表示我们制定flume的日志格式,并且输出到控制台。
执行命令后,反馈如下:
hadoop@master:~/opt/app/apache-flume-1.9.0-bin$ bin/flume-ng agent -n a1 -c conf -f conf/netcat2logger.conf -Dflume.root.logger=INFO,console ... ... 2021-11-18 10:15:07,052 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.source.NetcatSource.start(NetcatSource.java:155)] Source starting 2021-11-18 10:15:07,079 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.source.NetcatSource.start(NetcatSource.java:166)] Created serverSocket:sun.nio.ch.ServerSocketChannelImpl[/127.0.0.1:6666]
会看到NetcatSource已经启动成功。
这时,我们再开启一个新的终端,通过telnet 或 nc命令发送socket数据。
telnet 127.0.0.1 6666,然后输入hello world,会看到反馈的信息ok。
hadoop@master:~$ telnet 127.0.0.1 6666 Trying 127.0.0.1... Connected to 127.0.0.1. Escape character is '^]'. helloworld OK
切换到启动flume-ng命令的终端,查看信息,会看到,helloworld已经输出到控制台上了。
2021-11-18 10:22:44,799 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 68 65 6C 6C 6F 77 6F 72 6C 64 0D helloworld. }
Flume的相关知识点
Sources,Channels,Sinks配置
Flume的source配置,见http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html#flume-sources
Flume的channel配置,见
http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html#flume-channels
Flume的sink配置,见
http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html#flume-sinks
数据流模型
Flume中的数据传递被称为event事件event就是数据流单元。Flume中的agent被称为代理,agent的本质是一个(JVM)进程,每个agent中包含了source,channel,sink这几个组件,这些组件会把数据从一个地方(source)采集到目的地(sink)中(被称为一个hop,跳)。
可靠性
在每个agent中,event都会暂存在channel中。然后将event传递给下一个agent或是终端存储库中(如sink的类型为HDFS时)。这些event在存储到下一个agent的channel中或是存储到终端存储中(如HDFS)中后,才会在当前agent的channel中将event删除。这样只有在将事件存储到下一个代理的通道或终端存储库中之后,它们才会从通道中删除。这种方式提供了Flume在消息传递时的端到端可靠性。
可恢复性
当消息传递失败时,event由于已经暂存在channel中,可以从channel中恢复。Flume支持持久化channel(比如采用本地文件系统作为channel),如果追求性能,也可采用memory作为channel,但这样有可能存在数据丢失无法恢复的情况。
多个agent流
可以通过avro类型,实现让数据在多个agent之间传递。具体方法为前一个agent的sink类型为avro类型,下一个agent的source类型为avro类型,并配置好对应的主机名和端口号,这样就能实现数据在多个agent之间传递。
合并操作
Flume支持将多个位置的数据进行合并操作,比如将数百台服务器上的日志信息合并到一个HDFS文件系统中,配置加入如下:
多路复用流
Flume支持将事件流多路传输到一个或多个目的地。这是通过定义流多路复用器来实现的,流多路复用器可以将事件复制或有选择地路由到一个或多个通道。
多路复用的配置文件格式如下:
# list the sources, sinks and channels for the agent <Agent>.sources = <Source> <Agent>.sinks = <Sink> <Agent>.channels = <Channel1> <Channel2> # set channel for source <Agent>.sources.<Source>.channels = <Channel1> <Channel2> ... # set channel for sink <Agent>.sinks.<Sink>.channel = <Channel1>
Flume高可靠
failover故障迁移可参考:
http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html#failover-sink-processor
load-balancing负载均衡可参考:
http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html#load-balancing-sink-processor
如下为高可靠的架构图,具体可参考:http://www.zhuyongpeng.cn/1543.html