Flume

1、Flume

概述：Flume是一种分布式，可靠且可用的服务，用于有效地收集，聚合和移动大量日
志数据。它具有基于流数据流的简单灵活的架构。它具有可靠的可靠性机制和许多故障
转移和恢复机制，具有强大的容错性。它使用简单的可扩展数据模型，允许在线分析应
用程序。
1）数据采集（爬虫日志数据flume）
2）数据存储（hdfs/hive/hbase(nosql)）
3）数据计算（mapreduce/hive/sparkSQL/sparkStreaming/flink）
4）数据可视化

2、Flume角色

1）source
数据源，用户采集数据，source产生数据流，同时会把产生的数据流传输到channel

2）channel
传输通道，用于桥接source和sink

3）sink
下沉，用于收集channel传输的数据，将数据源传递到目标源

4）agent
在flume中使用事件作为传输的基本单元

3、flume使用

简单易用，只需要写配置文件即可

4、flume安装配置

1）下载flume
2）上传到Linux
3）解压
tar -zxvf apache-flume-1.6.0-bin.tar.gz -C /root/hd
4）重命名
mv apache-flume-1.6.0-bin/ flume
cp flume-env.sh.template flume-env.sh
5）修改配置
vi flume-env.sh
export JAVA_HOME=/root/hd/jdk1.8.0_192

5、flume监听端口

启动命令：
bin/flume-ng agent –conf conf/log4j.properties –name a1 –conf-file conf/flumejob_telnet.conf

我已经排坑了，这里我建议–conf 后面指定的路径建议是全路径，指定到log4j.properties或，我当时老师讲的是直接conf/，我实际操作是有问题的，不能实时的反馈

bin/flume-ng agent 使用ng启动agent
--conf conf/log4j.properties 指定配置所在的文件夹
--name a1 指定的agent别名
--conf-file conf/flumejob_telnet.conf 文件
-Dflume.root.logger=INFO,console 日志级别

flumejob_telnet.conf

#smple.conf: A single-node Flume configuration
# Name the components on this agent 定义变量方便调用 加s可以有多个此角色
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source 描述source角色 进行内容定制
# 此配置属于tcp source 必须是netcat类型
a1.sources.r1.type = netcat 
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink 输出日志文件
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory（file） 使用内存 总大小1000 每次传输100
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel 一个source可以绑定多个channel 
# 一个sinks可以只能绑定一个channel  使用的是图二的模型
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

[root@hsiehchou121 flume]# bin/flume-ng agent 
> --conf conf/ 
> --name a1 
> --conf-file conf/flumejob_telnet.conf 
> -Dflume.root.logger=INFO.console

yum search telnet
yum install telnet.x86_64

6、flume监听本地linux文件采集到hdfs

启动命令：
bin/flume-ng agent –conf conf/log4j.properties –name a1 –conf-file conf/flum
ejob_hdfs.conf

flumejob_hdfs.conf

# Name the components on this agent agent别名设置
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source  设置数据源监听本地文件配置
# exec 执行一个命令的方式去查看文件 tail -F 实时查看
a1.sources.r1.type = exec
# 要执行的脚本command tail -F 默认10行 man tail  查看帮助
a1.sources.r1.command = tail -F /tmp/root/hive.log
# 执行这个command使用的是哪个脚本 -c 指定使用什么命令
# whereis bash
# bash: /usr/bin/bash /usr/share/man/man1/bash.1.gz 
a1.sources.r1.shell = /usr/bin/bash -c

# Describe the sink 
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://hsiehchou121:9000/flume/%Y%m%d/%H
#上传文件的前缀
a1.sinks.k1.hdfs.filePrefix = logs-
#是否按照时间滚动文件夹
a1.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹  秒 （默认30s）
a1.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位（每小时滚动一个文件夹）
a1.sinks.k1.hdfs.roundUnit = minute
#是否使用本地时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a1.sinks.k1.hdfs.batchSize = 500
#设置文件类型，可支持压缩
a1.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件 秒
a1.sinks.k1.hdfs.rollInterval = 30
#设置每个文件的滚动大小 字节（最好128M,合理）
a1.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与 Event 数量无关
a1.sinks.k1.hdfs.rollCount = 0
#最小冗余数(备份数 生成滚动功能则生效roll hadoop本身有此功能 无需配置) 1份 不冗余 hdfs已经备份3份
a1.sinks.k1.hdfs.minBlockReplicas = 1

# Use a channel which buffers events in memory 
a1.channels.c1.type = memory 
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

[root@hsiehchou121 flume]# bin/flume-ng agent 
> --conf conf/log4j.properties 
> --name a1 
> --conf-file conf/flumejob_hdfs.conf

7、监听文件夹

flumejob_dir.conf

# 定义别名
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = spooldir
# 监控的文件夹
a1.sources.r1.spoolDir = /root/testdir
# 上传成功后显示后缀名 
a1.sources.r1.fileSuffix = .COMPLETED
# 如论如何 加绝对路径的文件名 默认false
a1.sources.r1.fileHeader = true
#忽略所有以.tmp 结尾的文件（正在被写入），不上传
# ^以任何开头 出现无限次 以.tmp结尾的
a1.sources.r1.ignorePattern = ([^ ]*.tmp)

# Describe the sink 下沉到hdfs
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://hsiehchou121:9000/flume/testdir/%Y%m%d/%H
#上传文件的前缀
a1.sinks.k1.hdfs.filePrefix = testdir-
#是否按照时间滚动文件夹
a1.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a1.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a1.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a1.sinks.k1.hdfs.batchSize = 100
#设置文件类型，可支持压缩
a1.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a1.sinks.k1.hdfs.rollInterval = 600
#设置每个文件的滚动大小大概是 128M 
a1.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与 Event 数量无关
a1.sinks.k1.hdfs.rollCount = 0
#最小副本数
a1.sinks.k1.hdfs.minBlockReplicas = 1

# Use a channel which buffers events in memory 
a1.channels.c1.type = memory 
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1 
a1.sinks.k1.channel = c1

[root@hsiehchou121 conf]# bin/flume-ng agent –conf conf/log4j.properties –name a1 –conf-file conf/flumejob_dir.conf

[root@hsiehchou121 flume]# bin/flume-ng agent 
> --conf conf/log4j.properties 
> --name a1 
> --conf-file conf/flumejob_dir.conf

8、多个channel/sink

需求：监控hive.log文件，用同时产生两个channel，一个channel对应的sink存储到hdfs中，另外一个channel对应的sink存储到本地
flumejob_1.conf

# name the components on this agent 别名设置
a1.sources = r1
a1.sinks = k1 k2 
a1.channels = c1 c2

# 将数据流复制给多个 channel
a1.sources.r1.selector.type = replicating

# Describe/configure the source 
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /tmp/root/hive.log
a1.sources.r1.shell = /bin/bash -c

# Describe the sink
# 分两个端口发送数据 
a1.sinks.k1.type = avro 
a1.sinks.k1.hostname = hsiehchou121 
a1.sinks.k1.port = 4141

a1.sinks.k2.type = avro 
a1.sinks.k2.hostname = hsiehchou121 
a1.sinks.k2.port = 4142

# Describe the channel 
a1.channels.c1.type = memory 
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type = memory 
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel 
a1.sources.r1.channels = c1 c2 
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

[root@hsiehchou121 flume]# bin/flume-ng agent –conf conf/log4j.properties –name a1 –conf-file conf/flumejob_1.conf

flumejob_2.conf

# Name the components on this agent 
a2.sources = r1
a2.sinks = k1 
a2.channels = c1

# Describe/configure the source
a2.sources.r1.type = avro 
# 端口抓取数据
a2.sources.r1.bind = hsiehchou121
a2.sources.r1.port = 4141

# Describe the sink 
a2.sinks.k1.type = hdfs
a2.sinks.k1.hdfs.path = hdfs://hsiehchou121:9000/flume2/%Y%m%d/%H

#上传文件的前缀
a2.sinks.k1.hdfs.filePrefix = flume2-
#是否按照时间滚动文件夹
a2.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a2.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a2.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a2.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a2.sinks.k1.hdfs.batchSize = 100

#设置文件类型，可支持压缩
a2.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a2.sinks.k1.hdfs.rollInterval = 600
#设置每个文件的滚动大小大概是 128M 
a2.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与 Event 数量无关
a2.sinks.k1.hdfs.rollCount = 0
#最小副本数
a2.sinks.k1.hdfs.minBlockReplicas = 1

# Describe the channel 
a2.channels.c1.type = memory 
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel 
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

[root@hsiehchou121 flume]# bin/flume-ng agent –conf conf/log4j.properties –name a2 –conf-file conf/flumejob_1.conf

flumejob_3.conf

# Name the components on this agent 
a3.sources = r1
a3.sinks = k1 
a3.channels = c1

# Describe/configure the source 
a3.sources.r1.type = avro
a3.sources.r1.bind = hsiehchou121
a3.sources.r1.port = 4142

# Describe the sink 
a3.sinks.k1.type = file_roll
a3.sinks.k1.sink.directory = /root/flume2

# Describe the channel 
a3.channels.c1.type = memory 
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100


# Bind the source and sink to the channel 
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

[root@hsiehchou121 flume]# bin/flume-ng agent –conf conf/log4j.properties –name a3 –conf-file conf/flumejob_1.conf

相关阅读:
如何只通过Sandboxed Solution启动一个定时执行的操作
 创建与SharePoint 2010风格一致的下拉菜单 (续) 整合Feature Custom Action框架
 创建与SharePoint 2010风格一致的下拉菜单
 《SharePoint 2010 应用程序开发指南》第二章预览
 SharePoint 2013 App 开发 (1) 什么是SharePoint App？
使用Jscex增强SharePoint 2010 JavaScript Client Object Model (JSOM)
搜索范围的管理
 SharePoint 2010 服务应用程序(Service Application)架构（1）
SharePoint 2010 服务应用程序(Service Application)架构（2）
SharePoint 2013 App 开发 (2) 建立开发环境
原文地址：https://www.cnblogs.com/hsiehchou/p/10502457.html