flume架构介绍
flume之所以这么神奇,是源于它自身的一个设计,这个设计就是agent,agent本身是一个Java进程,运行在日志收集节点—所谓日志收集节点就是服务器节点。
agent里面包含3个核心的组件:source—->channel—–>sink,类似生产者、仓库、消费者的架构。
source:source组件是专门用来收集数据的,可以处理各种类型、各种格式的日志数据,包括avro、thrift、exec、jms、spooling directory、netcat、sequence generator、syslog、http、legacy、自定义。
channel:source组件把数据收集来以后,临时存放在channel中,即channel组件在agent中是专门用来存放临时数据的——对采集到的数据进行简单的缓存,可以存放在memory、jdbc、file等等。
sink:sink组件是用于把数据发送到目的地的组件,目的地包括hdfs、logger、avro、thrift、ipc、file、null、Hbase、solr、自定义。
Agent文件的说明
定义Agent和组件的名字。
# a1: 自定义agent的名字
# 分别给 sources,channels,sinks取名
a1.sources = r1
a1.channels = c1
a1.sinks = k1
定义组件的配置信息
# r1获取的数据类型
a1.sources.r1.type = netcat
# r1的IP地址
a1.sources.r1.bind = localhost
# r1的端口
a1.sources.r1.port = 44444
# c1暂存数据的位置为内存里面
a1.channels.c1.type = memory
# 设置暂存空间的容量
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# k1输出的数据类型
a1.sinks.k1.type = logger
Agent通过组件的名字,将三个组件连接起来。
# sources 连接 channels
a1.sources.r1.channels = c1
# sinks 连接 channel(注意没有s)
a1.sinks.k1.channel = c1
NetCat Source:监听指定网络端口
只要应用程序向这个端口里面写数据,这个source组件就可以获取到信息。
logger Channel:memory
# 写创建Agent的脚本
gedit /opt/flume/flume1.8.0/conf/netcat.cof
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.37.130
a1.sources.r1.port = 44444
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sinks.k1.type = logger
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
cd /opt/flume/flume1.8.0/bin
# 启动flume输出控制台,打印获取到的数据。
/opt/flume/flume1.8.0/bin/flume-ng agent -n a1 /opt/flume/flume1.8.0/conf/ -f /opt/flume/flume1.8.0/conf/netcat.cof -Dflume.root.logger=DEBUG,console
使用Windows的Telnet工具向端口发送数据(Windows10 Telnet的开启方法
# 打开CMD
telnet 192.168.37.130 44444
输入需要发送的数据...
Sink:hdfs Channel:file
# 创建保存临时数据的文件夹
mkdir /opt/flume/data
mkdir /opt/flume/checkpoint
# 创建Agent
gedit /opt/flume/flume1.8.0/conf/netcat.cof
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.37.130
a1.sources.r1.port = 44444
# Describe the sink
a1.sinks.k1.type = hdfs
# 保存在hdfs上的路径
a1.sinks.k1.hdfs.path = hdfs://slave2:9000/dataoutput
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k1.hdfs.fileType = DataStream
# 写入一个文件的间隔
a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.rollSize = 0
a1.sinks.k1.hdfs.rollCount = 0
# 生成的文件名前缀
a1.sinks.k1.hdfs.filePrefix = %Y-%m-%d-%H-%M-%S
a1.sinks.k1.hdfs.useLocalTimeStamp = true
# Use a channel which buffers events in file
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /opt/flume/checkpoint
a1.channels.c1.dataDirs = /opt/flume/data
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
# 启动Hadoop
start-all.sh
# 开启flume Agent
/opt/flume/flume1.8.0/bin/flume-ng agent -n a1 /opt/flume/flume1.8.0/conf/ -f /opt/flume/flume1.8.0/conf/netcat.cof -Dflume.root.logger=DEBUG,console
# 打开CMD
telnet 192.168.37.130 44444
输入需要发送的数据...
数据保存在hdfs上的 /dataoutput
Spooling Directory Source:监听指定目录
只要应用程序向这个指定的目录中添加新的文件,source组件就可以获取到该信息,并解析该文件的内容,然后写入到channle。写入完成后,标记该文件已完成或者删除该文件。
注意事项:1. 拷贝到spool目录下的文件不可以再打开编辑 。2. 不能将具有相同文件名字的文件拷贝到这个目录下
logger Channel:memory
# 创建监听的文件夹
mkdir /opt/flume/FlumeInputdata/
# 编辑Agent文件
gedit /opt/flume/flume1.8.0/conf/netcat.cof
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /opt/flume/FlumeInputdata
a1.sources.r1.fileHeader = true
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp
# 是否删除已经标记的文件never(从不),immediate(立即)
a1.sources.r1.deletePolicy = never
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
# 开启flume Agent
/opt/flume/flume1.8.0/bin/flume-ng agent -n a1 /opt/flume/flume1.8.0/conf/ -f /opt/flume/flume1.8.0/conf/netcat.cof -Dflume.root.logger=DEBUG,console
发送数到监听的文件夹
vim /data1.txt
cp /data1.txt /opt/flume/flumeInputdata/
在控制台查看输出的结果。
hdfs Channel:file
# 编辑Agent文件
gedit /opt/flume/flume1.8.0/conf/netcat.cof
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /opt/flume/FlumeInputdata
a1.sources.r1.fileHeader = true
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp
# Describe the sink
# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://slave2:9000/dataoutput
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.rollSize = 0
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.filePrefix = %Y-%m-%d-%H-%M-%S
a1.sinks.k1.hdfs.useLocalTimeStamp = true
# Use a channel which buffers events in file
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /opt/flume/checkpoint
a1.channels.c1.dataDirs = /opt/flume/data
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
# 开启flume Agent
/opt/flume/flume1.8.0/bin/flume-ng agent -n a1 /opt/flume/flume1.8.0/conf/ -f /opt/flume/flume1.8.0/conf/netcat.cof -Dflume.root.logger=DEBUG,console
# 发生数据文件
cp /data1.txt /opt/flume/flumeInputdata/
数据保存在hdfs上的 /dataoutput
Exec Source:监听指定的命令的结果
获取一条命令的结果作为它的数据源
# 创建监视的文件
touch /words.txt
# 编辑Agent文件
gedit /opt/flume/flume1.8.0/conf/netcat.cof
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = exec
# 使用tail -F 命令来监视文件的内容
a1.sources.r1.command = tail -F /words.txt
# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://salve2:9000/dataoutput
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.rollSize = 0
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.filePrefix = %Y-%m-%d-%H-%M-%S
a1.sinks.k1.hdfs.useLocalTimeStamp = true
# Use a channel which buffers events in file
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /opt/flume/checkpoint
a1.channels.c1.dataDirs = /opt/flume/data
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
# 开启flume Agent
/opt/flume/flume1.8.0/bin/flume-ng agent -n a1 /opt/flume/flume1.8.0/conf/ -f /opt/flume/flume1.8.0/conf/netcat.cof -Dflume.root.logger=DEBUG,console
echo 'file word!' >> /words.txt
数据保存在hdfs上的 /dataoutput
Avro Source:监听指定的Avro 端口
通过Avro 端口可以获取到Avro client发送过来的文件 。即只要应用程序通过Avro 端口发送文件,source组件就可以获取到该文件中的内容。
# 编辑Agent文件
gedit /opt/flume/flume1.8.0/conf/netcat.cof
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = avro
a1.sources.r1.bind = 192.168.37.130
a1.sources.r1.port = 4141
# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://slave2:9000/dataoutput
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.rollSize = 0
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.filePrefix = %Y-%m-%d-%H-%M-%S
a1.sinks.k1.hdfs.useLocalTimeStamp = true
# Use a channel which buffers events in file
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /opt/flume/checkpoint
a1.channels.c1.dataDirs = /opt/flume/data
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
# 开启flume Agent
/opt/flume/flume1.8.0/bin/flume-ng agent -n a1 /opt/flume/flume1.8.0/conf/ -f /opt/flume/flume1.8.0/conf/netcat.cof -Dflume.root.logger=DEBUG,console
# 发送文件到端口
/opt/flume/flume1.8.0/bin/flume-ng avro-client -c /opt/flume/flume1.8.0/conf/ -H 192.168.37.130 -p 4141 -F /words.txt
数据保存在hdfs上的 /dataoutput
参考链接:https://blog.csdn.net/qq_33366098/article/details/81565618