• Apache Flume 简介


    转自:http://blog.163.com/guaiguai_family/blog/static/20078414520138100562883/

    Flume 是 Cloudera 公司开源出来的一套日志收集系统,早期版本依赖 ZooKeeper,现在的 FumeNG 去掉了这个依赖,我没用过之前的版本,想来失去整个日志收集系统的全局视图是挺可惜的,但 FlumeNG 上手以及使用挺简单,搭配监测系统也能用的不赖,有利有弊了:-)

    下图展示了一种常见的 Flume 使用场景,服务器上发送事件给本地的 Flume agent 或者让本地 Flume agent 去 tail -f 日志文件,日志被发送给同一个数据中心里的多个下游 Flume agent,这些下游 Flume agent 将数据写到 HDFS,同时在本地磁盘留一个短期副本以供调试。

    Apache Flume 简介 - 乖乖 - 乖乖之家

    Flume 的配置文件挺易懂的, 官方文档 有很详细的描述,从结构上讲分成两部分,声明一个 Flume agent 会运行哪些 source、channel、sink,然后是配置各个 source、channel、sink 的属性以及互相的连接关系。
    • source:日志来源,可以调用外部命令比如 tail -f,也可以监听端口,接收 avro、thrift、文本行格式的日志。source 和 channel 是多对多的关系,source 往 channel 里写数据可以是 replicating(默认)或者 multiplexing 方式,比如上图里 log collector 里的 source 就是复制了两份日志写到两个 channel 里。
    • channel:其实个人觉得叫 queue 更合适,免得跟 sink 的用途混淆。channel 用来做日志缓存以及日志分发,粗略来说channel 和 sink 是一对多的关系,channel 传数据到 sink 可以有 default, failover 和 load_balance 三种方式,文档里这个地方既叫 sink processor 又叫 sink group,个人觉得 sink group 理解起来更容易,channel 实际是发送数据到 sink group,所以是 channel 和 sink group 一对一(这大概就是为什么sink group 又叫 sink processor,指如何把事件从 channel 里转到 sinks 这个处理过程),sink group 和 sink 是一对多。default 方式下一个 sink group 只允许有一个 sink;failover 指总是先写优先级高的 sink,失败的话写优先级次高的 sink;load_balance 就容易理解了,轮流写 sink。
    • sink:处理日志写出,注意一个 sink 只能写一个地方,比如本地文件或者某单个远程主机的网络端口,并不是 sink 来做 load balance,所以上图中针对每个 log collector,log emitter 那里都要各配置一个 sink。Flume 标配的 sink 挺多,local fs, hdfs, hbase, solr, elasticsearch, avro, thrift 等,难能可贵的是 hdfs 和 hbase sink 都支持 Kerberos 认证,真不愧是 Cloudera 家做的东西,跟 Hadoop 集成就是好。

    Flume agent 的一个进程里可以包含多个 source、channel、sink,这些元素之间组成的 flow 可以互相没关系,比如一套 source-channel-sink 收集 access.log,一套 source-channel-sink 收集 error.log,两者没有数据交互。同一台机器上也可以运行多个 flume agent 进程。注意同一个 agent 进程里 memory channel 里的 event 是共享的,但是 Flume 在估算内存消耗时不考虑共享这个事情。

    Flume agent 进程会每隔三十秒检测配置文件,如果修改了会重新载入,所以虽然没有 ZooKeeper 集中管理配置信息,但利用 Puppet/Chef + Nagios/Ganglia 之类帮忙也不是太大问题。

    Flume 没有像 Scribe 那样直接支持 category,而是允许给 event 添加 header,在 multiplexing channel selector 里可以按照 event header 映射到不同 channel,这样就可以在整个 flow 的末端把日志切分开来。如果使用 hdfs sink 的话,hdfs 文件名可以插入 event header 的值,所以不必用 multiplexing channel selector 即可达到按 category 切分日志的效果。

    Flume 的设计还是挺灵活挺简单的,我小测试了下,稳定性不错,但是性能不怎么样(可能我测试不规范),尤其是用 file channel 的时候,Flume 把事件缓存在 JVM 里,这个设计没有 Kafka 高明以及高效。另一个担心是它没有像 Kafka 那样把 replication 作为一个核心设计,需要使用者去 event flow 的各个环节显式配置,比如每个 log collector 加一个 memory channel 和一个 avro sink 写到另一个 log collector 去,这个过程没有 ZooKeeper 的帮助,实际是没有实用价值的。如果项目时间允许,我觉得在 Kafka 基础上构建 sink 是个更高效、方便且可靠的日志收集方案,如同 LinkedIn 的 data pipeline 架构那样。

    下面是一段 Perl 脚本,用于生成 Flume 配置,不直接手工配置的原因是很多地方的 source、channel、sink 配置是基本一样的,手工维护有点累。下面那行 tail 脚本有点长,显示不完整,应该是 

    tail -F -n 0--pid `ps -o ppid= $$` $log_file | sed -e "s/^/host=`hostname --fqdn` category=$category:/"

    话说 Scribe、Flume 这些二货为啥不直接提供 tail -f 的功能。。。。

    #!/usr/bin/perl
    #
    # Emitter:
    # Server -> access.log -> tail -F -> Flume exec source(access) ->
    # Flume file channel(c1) -> Flume Avro Sinks with load balancing sink processor(g1: s1 s2 s3)
    #
    # Collector:
    # Flume Avro source with replicating channel selects(source1) ->
    # Flume memory channel(file1) -> Flume file roll sink(file1)
    # Flume memory channel(hdfs1) -> Flume HDFS sink(hdfs2)
    # Flume memory channel(hdfs2) -> Flume HDFS sink(hdfs2), in another data center
    # Flume memory channel(hbase1) -> Flume HBase sink(hbase1), the standard HBase sink uses
    # hbase-site.xml to get server address, so can't use two HBase sinks
    # except starting another Flume agent process.

    use strict;
    use warnings;
    useGetopt::Long;

    my%g_log_files =(
    "access"=>[ qw(/tmp/access.log )],
    );
    my@g_collector_hosts= qw( collector1 collector2 collector3 );
    my $g_emitter_avro_port =3000;
    my $g_emitter_thrift_port =3001;
    my $g_emitter_nc_port =3002;
    my $g_collector_avro_port =3000;
    my $g_flume_work_dir ="/tmp/flume";
    my $g_data_dir ="/tmp/log-data";
    my%g_hdfs_paths =(
    "hdfs1"=>"hdfs://namenode1:8020/user/gg/data",
    "hdfs2"=>"hdfs://namenode2:8020/user/gg/data",
    );
    my $g_emitter_conf ="emitter.properties";
    my $g_collector_conf ="collector.properties";
    my $g_overwrite_conf =0;

    GetOptions("force!"=> $g_overwrite_conf);

    generate_emitter_config();
    generate_collector_config();

    exit(0);

    #######################################
    sub generate_emitter_config {
    my $conf ="";
    my $sources = join(" ", sort(keys %g_log_files));
    my $sinks = join(" ", map {"s$_"}(1..@g_collector_hosts));

    $conf .=<<EOF;
    emitter.sources = $sources avro1 thrift1 nc1
    emitter.channels = c1
    emitter.sinks = $sinks
    emitter.sinkgroups = g1


    EOF

    formy $category ( sort keys %g_log_files){
    my $log_files = $g_log_files{$category};

    formy $log_file (@$log_files){
    $conf .=<<EOF;
    emitter.sources.$category.channels = c1
    emitter.sources.$category.type =exec
    emitter.sources.$category.command = tail -F -n 0--pid `ps -o ppid= $$` $log_file | sed -e "s/^/host=`hostname --fqdn` category=$category:/"
    emitter.sources.$category.shell = /bin/sh -c
    emitter.sources.$category.restartThrottle =5000
    emitter.sources.$category.restart =true
    emitter.sources.$category.logStdErr =true
    emitter.sources.$category.interceptors = i1 i2 i3
    emitter.sources.$category.interceptors.i1.type = timestamp
    emitter.sources.$category.interceptors.i2.type = host
    emitter.sources.$category.interceptors.i2.useIP =false
    emitter.sources.$category.interceptors.i3.type =static
    emitter.sources.$category.interceptors.i3.key = category
    emitter.sources.$category.interceptors.i3.value = $category

    EOF
    }
    }

    $conf .=<<EOF;
    emitter.sources.avro1.channels = c1
    emitter.sources.avro1.type = avro
    emitter.sources.avro1.bind = localhost
    emitter.sources.avro1.port = $g_emitter_avro_port
    emitter.sources.avro1.interceptors = i1 i2 i3
    emitter.sources.avro1.interceptors.i1.type = timestamp
    emitter.sources.avro1.interceptors.i2.type = host
    emitter.sources.avro1.interceptors.i2.useIP =false
    emitter.sources.avro1.interceptors.i3.type =static
    emitter.sources.avro1.interceptors.i3.key = category
    emitter.sources.avro1.interceptors.i3.value =default

    emitter.sources.thrift1.channels = c1
    emitter.sources.thrift1.type = thrift
    emitter.sources.thrift1.bind = localhost
    emitter.sources.thrift1.port = $g_emitter_thrift_port
    emitter.sources.thrift1.interceptors = i1 i2 i3
    emitter.sources.thrift1.interceptors.i1.type = timestamp
    emitter.sources.thrift1.interceptors.i2.type = host
    emitter.sources.thrift1.interceptors.i2.useIP =false
    emitter.sources.thrift1.interceptors.i3.type =static
    emitter.sources.thrift1.interceptors.i3.key = category
    emitter.sources.thrift1.interceptors.i3.value =default

    emitter.sources.nc1.channels = c1
    emitter.sources.nc1.type = netcat
    emitter.sources.nc1.bind = localhost
    emitter.sources.nc1.port = $g_emitter_nc_port
    emitter.sources.nc1.max-line-length =20480
    emitter.sources.nc1.interceptors = i1 i2 i3
    emitter.sources.nc1.interceptors.i1.type = timestamp
    emitter.sources.nc1.interceptors.i2.type = host
    emitter.sources.nc1.interceptors.i2.useIP =false
    emitter.sources.nc1.interceptors.i3.type =static
    emitter.sources.nc1.interceptors.i3.key = category
    emitter.sources.nc1.interceptors.i3.value =default


    emitter.channels.c1.type = file
    emitter.channels.c1.checkpointDir = $g_flume_work_dir/emitter-c1/checkpoint
    #emitter.channels.c1.useDualCheckpoints = true
    #emitter.channels.c1.backupCheckpointDir = $g_flume_work_dir/emitter-c1/checkpointBackup
    emitter.channels.c1.dataDirs = $g_flume_work_dir/emitter-c1/data


    EOF

    my $i =0;
    my $port = $g_collector_avro_port;
    my $onebox = is_one_box();
    formy $host ( sort @g_collector_hosts){
    ++$i;
    $port +=1000if $onebox;

    $conf .=<<EOF;
    emitter.sinks.s$i.channel = c1
    emitter.sinks.s$i.type = avro
    emitter.sinks.s$i.hostname = $host
    emitter.sinks.s$i.port = $port
    emitter.sinks.s$i.batch-size =100
    #emitter.sinks.s$i.reset-connection-interval = 600
    emitter.sinks.s$i.compression-type = deflate

    EOF
    }

    $conf .=<<EOF;

    emitter.sinkgroups.g1.sinks = $sinks
    emitter.sinkgroups.g1.processor.type = load_balance
    emitter.sinkgroups.g1.processor.backoff =true
    emitter.sinkgroups.g1.processor.selector = round_robin

    EOF

    $conf =~ s/^+//mg;

    die"$g_emitter_conf already exists! "if! $g_overwrite_conf &&-e $g_emitter_conf;
    open my $fh,">", $g_emitter_conf ordie"Can't write $g_emitter_conf: $! ";
    print $fh $conf;
    close $fh;
    }

    sub generate_collector_config {
    my $conf ="";
    my@sinks= qw(file1 hdfs1 hdfs2 hbase1);
    my $sinks = join(" ",@sinks);

    my $port = $g_collector_avro_port;
    my $onebox = is_one_box();
    $port +=1000if $onebox;

    $conf .=<<EOF;
    collector.sources = source1
    collector.channels = $sinks
    collector.sinks = $sinks


    collector.sources.source1.channels = $sinks
    collector.sources.source1.type = avro
    collector.sources.source1.bind =0.0.0.0
    collector.sources.source1.port = $port
    collector.sources.source1.compression-type = deflate
    collector.sources.source1.interceptors = i1 i2 i3 i4

    collector.sources.source1.interceptors.i1.type = timestamp
    collector.sources.source1.interceptors.i1.preserveExisting =true

    collector.sources.source1.interceptors.i2.type = host
    collector.sources.source1.interceptors.i2.preserveExisting =true
    collector.sources.source1.interceptors.i2.useIP =false

    collector.sources.source1.interceptors.i3.type =static
    collector.sources.source1.interceptors.i3.preserveExisting =true
    collector.sources.source1.interceptors.i3.key = category
    collector.sources.source1.interceptors.i3.value =default

    collector.sources.source1.interceptors.i4.type = host
    collector.sources.source1.interceptors.i4.preserveExisting =false
    collector.sources.source1.interceptors.i4.useIP =false
    collector.sources.source1.interceptors.i4.hostHeader = collector


    EOF

    formy $sink (@sinks){
    $conf .=<<EOF;
    collector.channels.$sink.type = memory
    collector.channels.$sink.capacity =10000
    collector.channels.$sink.transactionCapacity =100
    collector.channels.$sink.byteCapacityBufferPercentage =20
    collector.channels.$sink.byteCapacity =0

    EOF
    }

    $conf .=<<EOF;

    collector.sinks.file1.channel = file1
    collector.sinks.file1.type = file_roll
    collector.sinks.file1.sink.directory = $g_data_dir/collector-$port-file1
    collector.sinks.file1.sink.rollInterval =3600
    collector.sinks.file1.batchSize =100
    collector.sinks.file1.sink.serializer = text
    collector.sinks.file1.sink.serializer.appendNewline =true
    #collector.sinks.file1.sink.serializer = avro_event
    #collector.sinks.file1.sink.serializer.syncIntervalBytes = 2048000
    #collector.sinks.file1.sink.serializer.compressionCodec = snappy

    collector.sinks.hdfs1.channel = hdfs1
    collector.sinks.hdfs1.type = hdfs
    collector.sinks.hdfs1.hdfs.path = $g_hdfs_paths{hdfs1}/%{category}/%Y%m%d/%H
    collector.sinks.hdfs1.hdfs.filePrefix =%{collector}-$port
    collector.sinks.hdfs1.hdfs.rollInterval =600
    collector.sinks.hdfs1.hdfs.rollSize =0
    collector.sinks.hdfs1.hdfs.rollCount =0
    collector.sinks.hdfs1.hdfs.idleTimeout =0
    collector.sinks.hdfs1.hdfs.batchSize =100
    collector.sinks.hdfs1.hdfs.codeC = snappy
    collector.sinks.hdfs1.hdfs.fileType =SequenceFile
    #collector.sinks.hdfs1.serializer = text
    #collector.sinks.hdfs1.serializer.appendNewline = true
    collector.sinks.hdfs1.serializer = avro_event
    collector.sinks.hdfs1.serializer.syncIntervalBytes =2048000
    collector.sinks.hdfs1.serializer.compressionCodec =null
    #collector.sinks.hdfs2.serializer.compressionCodec = snappy

    collector.sinks.hdfs2.channel = hdfs2
    collector.sinks.hdfs2.type = hdfs
    collector.sinks.hdfs2.hdfs.path = $g_hdfs_paths{hdfs2}/%{category}/%Y%m%d/%H
    collector.sinks.hdfs2.hdfs.filePrefix =%{collector}-$port
    collector.sinks.hdfs2.hdfs.rollInterval =600
    collector.sinks.hdfs2.hdfs.rollSize =0
    collector.sinks.hdfs2.hdfs.rollCount =0
    collector.sinks.hdfs2.hdfs.idleTimeout =0
    collector.sinks.hdfs2.hdfs.batchSize =100
    collector.sinks.hdfs2.hdfs.codeC = snappy
    collector.sinks.hdfs2.hdfs.fileType =SequenceFile
    #collector.sinks.hdfs2.serializer = text
    #collector.sinks.hdfs2.serializer.appendNewline = true
    collector.sinks.hdfs2.serializer = avro_event
    collector.sinks.hdfs2.serializer.syncIntervalBytes =2048000
    collector.sinks.hdfs2.serializer.compressionCodec =null
    #collector.sinks.hdfs2.serializer.compressionCodec = snappy

    collector.sinks.hbase1.channel = hbase1
    collector.sinks.hbase1.type = hbase
    collector.sinks.hbase1.table = log
    collector.sinks.hbase1.columnFamily = log

    EOF

    $conf =~ s/^+//mg;

    die"$g_collector_conf already exists! "if! $g_overwrite_conf &&-e $g_collector_conf;
    open my $fh,">", $g_collector_conf ordie"Can't write $g_collector_conf: $! ";
    print $fh $conf;
    close $fh;
    }

    sub is_one_box {
    my%h = map { $_ =>1}@g_collector_hosts;
    return keys %h <@g_collector_hosts;
    }
  • 相关阅读:
    Com组件的两种复用方式:包容与集合的区别
    LAPACK/CLAPACK超级无敌算法库
    转:.h头文件 .lib库文件 .dll动态链接库文件关系
    python中类的变量与对象的变量
    重温:Martin Fowler的持续集成
    转:OOD的五个设计原则
    文章资源分享
    代码奔腾 Code Rush
    非ie6、ie7和ie8中iframe在dom中位置改变会引起iframe的指向重新加载
    前端开发利器webStorm
  • 原文地址:https://www.cnblogs.com/DjangoBlog/p/3535497.html
Copyright © 2020-2023  润新知