Apache Flume 简介

转自：http://blog.163.com/guaiguai_family/blog/static/20078414520138100562883/

Flume 是 Cloudera 公司开源出来的一套日志收集系统，早期版本依赖 ZooKeeper，现在的 FumeNG 去掉了这个依赖，我没用过之前的版本，想来失去整个日志收集系统的全局视图是挺可惜的，但 FlumeNG 上手以及使用挺简单，搭配监测系统也能用的不赖，有利有弊了:-)

下图展示了一种常见的 Flume 使用场景，服务器上发送事件给本地的 Flume agent 或者让本地 Flume agent 去 tail -f 日志文件，日志被发送给同一个数据中心里的多个下游 Flume agent，这些下游 Flume agent 将数据写到 HDFS，同时在本地磁盘留一个短期副本以供调试。

Flume 的配置文件挺易懂的，官方文档有很详细的描述，从结构上讲分成两部分，声明一个 Flume agent 会运行哪些 source、channel、sink，然后是配置各个 source、channel、sink 的属性以及互相的连接关系。

source：日志来源，可以调用外部命令比如 tail -f，也可以监听端口，接收 avro、thrift、文本行格式的日志。source 和 channel 是多对多的关系，source 往 channel 里写数据可以是 replicating(默认）或者 multiplexing 方式，比如上图里 log collector 里的 source 就是复制了两份日志写到两个 channel 里。
channel：其实个人觉得叫 queue 更合适，免得跟 sink 的用途混淆。channel 用来做日志缓存以及日志分发，粗略来说channel 和 sink 是一对多的关系，channel 传数据到 sink 可以有 default, failover 和 load_balance 三种方式，文档里这个地方既叫 sink processor 又叫 sink group，个人觉得 sink group 理解起来更容易，channel 实际是发送数据到 sink group，所以是 channel 和 sink group 一对一(这大概就是为什么sink group 又叫 sink processor，指如何把事件从 channel 里转到 sinks 这个处理过程)，sink group 和 sink 是一对多。default 方式下一个 sink group 只允许有一个 sink；failover 指总是先写优先级高的 sink，失败的话写优先级次高的 sink；load_balance 就容易理解了，轮流写 sink。
sink：处理日志写出，注意一个 sink 只能写一个地方，比如本地文件或者某单个远程主机的网络端口，并不是 sink 来做 load balance，所以上图中针对每个 log collector，log emitter 那里都要各配置一个 sink。Flume 标配的 sink 挺多，local fs, hdfs, hbase, solr, elasticsearch, avro, thrift 等，难能可贵的是 hdfs 和 hbase sink 都支持 Kerberos 认证，真不愧是 Cloudera 家做的东西，跟 Hadoop 集成就是好。

Flume agent 的一个进程里可以包含多个 source、channel、sink，这些元素之间组成的 flow 可以互相没关系，比如一套 source-channel-sink 收集 access.log，一套 source-channel-sink 收集 error.log，两者没有数据交互。同一台机器上也可以运行多个 flume agent 进程。注意同一个 agent 进程里 memory channel 里的 event 是共享的，但是 Flume 在估算内存消耗时不考虑共享这个事情。

Flume agent 进程会每隔三十秒检测配置文件，如果修改了会重新载入，所以虽然没有 ZooKeeper 集中管理配置信息，但利用 Puppet/Chef + Nagios/Ganglia 之类帮忙也不是太大问题。

Flume 没有像 Scribe 那样直接支持 category，而是允许给 event 添加 header，在 multiplexing channel selector 里可以按照 event header 映射到不同 channel，这样就可以在整个 flow 的末端把日志切分开来。如果使用 hdfs sink 的话，hdfs 文件名可以插入 event header 的值，所以不必用 multiplexing channel selector 即可达到按 category 切分日志的效果。

Flume 的设计还是挺灵活挺简单的，我小测试了下，稳定性不错，但是性能不怎么样(可能我测试不规范)，尤其是用 file channel 的时候，Flume 把事件缓存在 JVM 里，这个设计没有 Kafka 高明以及高效。另一个担心是它没有像 Kafka 那样把 replication 作为一个核心设计，需要使用者去 event flow 的各个环节显式配置，比如每个 log collector 加一个 memory channel 和一个 avro sink 写到另一个 log collector 去，这个过程没有 ZooKeeper 的帮助，实际是没有实用价值的。如果项目时间允许，我觉得在 Kafka 基础上构建 sink 是个更高效、方便且可靠的日志收集方案，如同 LinkedIn 的 data pipeline 架构那样。

下面是一段 Perl 脚本，用于生成 Flume 配置，不直接手工配置的原因是很多地方的 source、channel、sink 配置是基本一样的，手工维护有点累。下面那行 tail 脚本有点长，显示不完整，应该是

tail -F -n 0--pid `ps -o ppid= $$` $log_file | sed -e "s/^/host=`hostname --fqdn` category=$category:/"

话说 Scribe、Flume 这些二货为啥不直接提供 tail -f 的功能。。。。

#!/usr/bin/perl
#
# Emitter:
#   Server -> access.log -> tail -F -> Flume exec source(access) ->
#       Flume file channel(c1) -> Flume Avro Sinks with load balancing sink processor(g1: s1 s2 s3)
#
# Collector:
#   Flume Avro source with replicating channel selects(source1) ->
#       Flume memory channel(file1) -> Flume file roll sink(file1)
#       Flume memory channel(hdfs1) -> Flume HDFS sink(hdfs2)
#       Flume memory channel(hdfs2) -> Flume HDFS sink(hdfs2), in another data center
#       Flume memory channel(hbase1) -> Flume HBase sink(hbase1), the standard HBase sink uses
#                       hbase-site.xml to get server address, so can't use two HBase sinks
#                       except starting another Flume agent process.

use strict;
use warnings;
useGetopt::Long;

my%g_log_files =(
"access"=>[ qw(/tmp/access.log )],
);
my@g_collector_hosts= qw( collector1 collector2 collector3 );
my $g_emitter_avro_port =3000;
my $g_emitter_thrift_port =3001;
my $g_emitter_nc_port =3002;
my $g_collector_avro_port =3000;
my $g_flume_work_dir ="/tmp/flume";
my $g_data_dir ="/tmp/log-data";
my%g_hdfs_paths =(
"hdfs1"=>"hdfs://namenode1:8020/user/gg/data",
"hdfs2"=>"hdfs://namenode2:8020/user/gg/data",
);
my $g_emitter_conf ="emitter.properties";
my $g_collector_conf ="collector.properties";
my $g_overwrite_conf =0;

GetOptions("force!"=> $g_overwrite_conf);

generate_emitter_config();
generate_collector_config();

exit(0);

#######################################
sub generate_emitter_config {
my $conf ="";
my $sources = join(" ", sort(keys %g_log_files));
my $sinks = join(" ", map {"s$_"}(1..@g_collector_hosts));

    $conf .=<<EOF;
        emitter.sources = $sources avro1 thrift1 nc1
        emitter.channels = c1
        emitter.sinks = $sinks
        emitter.sinkgroups = g1


EOF

formy $category ( sort keys %g_log_files){
my $log_files = $g_log_files{$category};

formy $log_file (@$log_files){
            $conf .=<<EOF;
        emitter.sources.$category.channels = c1
        emitter.sources.$category.type =exec
        emitter.sources.$category.command = tail -F -n 0--pid `ps -o ppid= $$` $log_file | sed -e "s/^/host=`hostname --fqdn` category=$category:/"
        emitter.sources.$category.shell = /bin/sh -c
        emitter.sources.$category.restartThrottle =5000
        emitter.sources.$category.restart =true
        emitter.sources.$category.logStdErr =true
        emitter.sources.$category.interceptors = i1 i2 i3
        emitter.sources.$category.interceptors.i1.type = timestamp
        emitter.sources.$category.interceptors.i2.type = host
        emitter.sources.$category.interceptors.i2.useIP =false
        emitter.sources.$category.interceptors.i3.type =static
        emitter.sources.$category.interceptors.i3.key = category
        emitter.sources.$category.interceptors.i3.value = $category

EOF
}
}

    $conf .=<<EOF;
        emitter.sources.avro1.channels = c1
        emitter.sources.avro1.type = avro
        emitter.sources.avro1.bind = localhost
        emitter.sources.avro1.port = $g_emitter_avro_port
        emitter.sources.avro1.interceptors = i1 i2 i3
        emitter.sources.avro1.interceptors.i1.type = timestamp
        emitter.sources.avro1.interceptors.i2.type = host
        emitter.sources.avro1.interceptors.i2.useIP =false
        emitter.sources.avro1.interceptors.i3.type =static
        emitter.sources.avro1.interceptors.i3.key = category
        emitter.sources.avro1.interceptors.i3.value =default

        emitter.sources.thrift1.channels = c1
        emitter.sources.thrift1.type = thrift
        emitter.sources.thrift1.bind = localhost
        emitter.sources.thrift1.port = $g_emitter_thrift_port
        emitter.sources.thrift1.interceptors = i1 i2 i3
        emitter.sources.thrift1.interceptors.i1.type = timestamp
        emitter.sources.thrift1.interceptors.i2.type = host
        emitter.sources.thrift1.interceptors.i2.useIP =false
        emitter.sources.thrift1.interceptors.i3.type =static
        emitter.sources.thrift1.interceptors.i3.key = category
        emitter.sources.thrift1.interceptors.i3.value =default

        emitter.sources.nc1.channels = c1
        emitter.sources.nc1.type = netcat
        emitter.sources.nc1.bind = localhost
        emitter.sources.nc1.port = $g_emitter_nc_port
        emitter.sources.nc1.max-line-length =20480
        emitter.sources.nc1.interceptors = i1 i2 i3
        emitter.sources.nc1.interceptors.i1.type = timestamp
        emitter.sources.nc1.interceptors.i2.type = host
        emitter.sources.nc1.interceptors.i2.useIP =false
        emitter.sources.nc1.interceptors.i3.type =static
        emitter.sources.nc1.interceptors.i3.key = category
        emitter.sources.nc1.interceptors.i3.value =default


        emitter.channels.c1.type = file
        emitter.channels.c1.checkpointDir = $g_flume_work_dir/emitter-c1/checkpoint
#emitter.channels.c1.useDualCheckpoints = true
#emitter.channels.c1.backupCheckpointDir = $g_flume_work_dir/emitter-c1/checkpointBackup
        emitter.channels.c1.dataDirs = $g_flume_work_dir/emitter-c1/data


EOF

my $i =0;
my $port = $g_collector_avro_port;
my $onebox = is_one_box();
formy $host ( sort @g_collector_hosts){
++$i;
        $port +=1000if $onebox;

        $conf .=<<EOF;
        emitter.sinks.s$i.channel = c1
        emitter.sinks.s$i.type = avro
        emitter.sinks.s$i.hostname = $host
        emitter.sinks.s$i.port = $port
        emitter.sinks.s$i.batch-size =100
#emitter.sinks.s$i.reset-connection-interval = 600
        emitter.sinks.s$i.compression-type = deflate

EOF
}

    $conf .=<<EOF;

        emitter.sinkgroups.g1.sinks = $sinks
        emitter.sinkgroups.g1.processor.type = load_balance
        emitter.sinkgroups.g1.processor.backoff =true
        emitter.sinkgroups.g1.processor.selector = round_robin

EOF

    $conf =~ s/^+//mg;

die"$g_emitter_conf already exists!
"if! $g_overwrite_conf &&-e $g_emitter_conf;
    open my $fh,">", $g_emitter_conf ordie"Can't write $g_emitter_conf: $!
";
print $fh $conf;
    close $fh;
}

sub generate_collector_config {
my $conf ="";
my@sinks= qw(file1 hdfs1 hdfs2 hbase1);
my $sinks = join(" ",@sinks);

my $port = $g_collector_avro_port;
my $onebox = is_one_box();
    $port +=1000if $onebox;

    $conf .=<<EOF;
        collector.sources = source1
        collector.channels = $sinks
        collector.sinks = $sinks


        collector.sources.source1.channels = $sinks
        collector.sources.source1.type = avro
        collector.sources.source1.bind =0.0.0.0
        collector.sources.source1.port = $port
        collector.sources.source1.compression-type = deflate
        collector.sources.source1.interceptors = i1 i2 i3 i4

        collector.sources.source1.interceptors.i1.type = timestamp
        collector.sources.source1.interceptors.i1.preserveExisting =true

        collector.sources.source1.interceptors.i2.type = host
        collector.sources.source1.interceptors.i2.preserveExisting =true
        collector.sources.source1.interceptors.i2.useIP =false

        collector.sources.source1.interceptors.i3.type =static
        collector.sources.source1.interceptors.i3.preserveExisting =true
        collector.sources.source1.interceptors.i3.key = category
        collector.sources.source1.interceptors.i3.value =default

        collector.sources.source1.interceptors.i4.type = host
        collector.sources.source1.interceptors.i4.preserveExisting =false
        collector.sources.source1.interceptors.i4.useIP =false
        collector.sources.source1.interceptors.i4.hostHeader = collector


EOF

formy $sink (@sinks){
        $conf .=<<EOF;
        collector.channels.$sink.type = memory
        collector.channels.$sink.capacity =10000
        collector.channels.$sink.transactionCapacity =100
        collector.channels.$sink.byteCapacityBufferPercentage =20
        collector.channels.$sink.byteCapacity =0

EOF
}

     $conf .=<<EOF;

        collector.sinks.file1.channel = file1
        collector.sinks.file1.type = file_roll
        collector.sinks.file1.sink.directory = $g_data_dir/collector-$port-file1
        collector.sinks.file1.sink.rollInterval =3600
        collector.sinks.file1.batchSize =100
        collector.sinks.file1.sink.serializer = text
        collector.sinks.file1.sink.serializer.appendNewline =true
#collector.sinks.file1.sink.serializer = avro_event
#collector.sinks.file1.sink.serializer.syncIntervalBytes = 2048000
#collector.sinks.file1.sink.serializer.compressionCodec = snappy

        collector.sinks.hdfs1.channel = hdfs1
        collector.sinks.hdfs1.type = hdfs
        collector.sinks.hdfs1.hdfs.path = $g_hdfs_paths{hdfs1}/%{category}/%Y%m%d/%H
        collector.sinks.hdfs1.hdfs.filePrefix =%{collector}-$port
        collector.sinks.hdfs1.hdfs.rollInterval =600
        collector.sinks.hdfs1.hdfs.rollSize =0
        collector.sinks.hdfs1.hdfs.rollCount =0
        collector.sinks.hdfs1.hdfs.idleTimeout =0
        collector.sinks.hdfs1.hdfs.batchSize =100
        collector.sinks.hdfs1.hdfs.codeC = snappy
        collector.sinks.hdfs1.hdfs.fileType =SequenceFile
#collector.sinks.hdfs1.serializer = text
#collector.sinks.hdfs1.serializer.appendNewline = true
        collector.sinks.hdfs1.serializer = avro_event
        collector.sinks.hdfs1.serializer.syncIntervalBytes =2048000
        collector.sinks.hdfs1.serializer.compressionCodec =null
#collector.sinks.hdfs2.serializer.compressionCodec = snappy

        collector.sinks.hdfs2.channel = hdfs2
        collector.sinks.hdfs2.type = hdfs
        collector.sinks.hdfs2.hdfs.path = $g_hdfs_paths{hdfs2}/%{category}/%Y%m%d/%H
        collector.sinks.hdfs2.hdfs.filePrefix =%{collector}-$port
        collector.sinks.hdfs2.hdfs.rollInterval =600
        collector.sinks.hdfs2.hdfs.rollSize =0
        collector.sinks.hdfs2.hdfs.rollCount =0
        collector.sinks.hdfs2.hdfs.idleTimeout =0
        collector.sinks.hdfs2.hdfs.batchSize =100
        collector.sinks.hdfs2.hdfs.codeC = snappy
        collector.sinks.hdfs2.hdfs.fileType =SequenceFile
#collector.sinks.hdfs2.serializer = text
#collector.sinks.hdfs2.serializer.appendNewline = true
        collector.sinks.hdfs2.serializer = avro_event
        collector.sinks.hdfs2.serializer.syncIntervalBytes =2048000
        collector.sinks.hdfs2.serializer.compressionCodec =null
#collector.sinks.hdfs2.serializer.compressionCodec = snappy

        collector.sinks.hbase1.channel = hbase1
        collector.sinks.hbase1.type = hbase
        collector.sinks.hbase1.table = log
        collector.sinks.hbase1.columnFamily = log

EOF

    $conf =~ s/^+//mg;

die"$g_collector_conf already exists!
"if! $g_overwrite_conf &&-e $g_collector_conf;
    open my $fh,">", $g_collector_conf ordie"Can't write $g_collector_conf: $!
";
print $fh $conf;
    close $fh;
}

sub is_one_box {
my%h = map { $_ =>1}@g_collector_hosts;
return keys %h <@g_collector_hosts;
}

相关阅读:
Com组件的两种复用方式：包容与集合的区别
 LAPACK/CLAPACK超级无敌算法库
 转：.h头文件 .lib库文件 .dll动态链接库文件关系
 python中类的变量与对象的变量
 重温：Martin Fowler的持续集成
 转：OOD的五个设计原则
 文章资源分享
 代码奔腾 Code Rush
非ie6、ie7和ie8中iframe在dom中位置改变会引起iframe的指向重新加载
 前端开发利器webStorm
原文地址：https://www.cnblogs.com/DjangoBlog/p/3535497.html