Flume入门

1、介绍

Apache Flume是为有效收集聚合和移动大量来自不同源到中心数据存储而设计的可分布，可靠的，可用的系统。flume是一个分布式、可靠、和高可用的海量日志采集、聚合和传输的系统。支持在日志系统中定制各类数据发送方，用于收集数据，同时，Flume提供对数据进行简单处理，并写到各种数据接受方(比如文本、HDFS、Hbase等)的能力。

2、核心概念

这里写图片描述

如图所示，Flume传输的数据的基本单位是event，如果是文本文件，通常是一行记录，这也是事务的基本单位。Flume以agent为最小的独立运行单位。一个agent就是一个JVM。一个agent由Source、Sink和Channel三大组件构成。

基本组件介绍如下图：
这里写图片描述

3、重要组件介绍

（1）Source:用来接受数据，类型有多种。
主要类型如下图：
这里写图片描述

（2）channel: 临时存放地，对Source中的数据进行缓存，知道sink消费完。
主要类型如下图：
这里写图片描述

（3）Sink:从channel中提取数据存放到中央化存储（hdfs/hbase）
主要类型如下图：
这里写图片描述

（4）flume的架构：除了单Agent的架构外，还有其他复杂的数据架构。

<1>多个Agent顺序连接:
这里写图片描述

<2>多个Agent的数据汇聚到同一个Agent
这里写图片描述

<3>多路（Multiplexing）Agent
这里写图片描述

<4>实现Load Balance功能
这里写图片描述

4、入门案例

（1）在flume/conf/下,首先创建一个hello.conf文件

    #声明三种组件
    a1.sources = r1
    a1.channels = c1
    a1.sinks = k1

    #定义source信息
    a1.sources.r1.type=netcat
    a1.sources.r1.bind=localhost
    a1.sources.r1.port=8888

    #定义sink信息
    a1.sinks.k1.type=logger

    #定义channel信息
    a1.channels.c1.type=memory

    #绑定在一起
    a1.sources.r1.channels=c1
    a1.sinks.k1.channel=c1

首先这个配置文件中，source的类型是Netcat source，channel类型是memory，sink的类型是logger。
进行运行测试

a)启动flume agent
bin/flume-ng agent -f ../conf/helloworld.conf -n a1 -Dflume.root.logger=INFO,console
b)启动nc的客户端
nc localhost 8888
最后在nc客户端上输入 hello
c)在flume的终端输出hello world.

这里写图片描述

（2）实时日志收集，在/home/txp/下要有一个test.txt文件

        a1.sources = r1
        a1.sinks = k1
        a1.channels = c1

        a1.sources.r1.type=exec
        a1.sources.r1.command=tail -F /home/txp/test.txt

        a1.sinks.k1.type=logger

        a1.channels.c1.type=memory

        a1.sources.r1.channels=c1
        a1.sinks.k1.channel=c1

（3）目录监控

            a1.sources = r1
            a1.channels = c1
            a1.sinks = k1

            a1.sources.r1.type=spooldir
            a1.sources.r1.spoolDir=/home/txp/spool
            a1.sources.r1.fileHeader=true

            a1.sinks.k1.type=logger

            a1.channels.c1.type=memory

            a1.sources.r1.channels=c1
            a1.sinks.k1.channel=c1

其中spool是一个目录。从spool外部文件创建文件然后放入spool中，会出现结果–监控spool目录中的文件变化。

（4）hdfs–日志存放到hdfs中

        a1.sources = r1
        a1.channels = c1
        a1.sinks = k1

        a1.sources.r1.type = netcat
        a1.sources.r1.bind = localhost
        a1.sources.r1.port = 8888

        a1.sinks.k1.type = hdfs
        a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H/%M/%S
        #前缀events-
        a1.sinks.k1.hdfs.filePrefix = events-

        #是否是产生新目录,每十分钟产生一个新目录,一般控制的目录方面。
        #2017-12-12 -->
        #2017-12-12 -->%H%M%S
        #10秒收集一一次
        a1.sinks.k1.hdfs.round = true           
        a1.sinks.k1.hdfs.roundValue = 10
        a1.sinks.k1.hdfs.roundUnit = second
        #使用本地时间
        a1.sinks.k1.hdfs.useLocalTimeStamp=true

        #是否产生新文件。滚动
        a1.sinks.k1.hdfs.rollInterval=10
        a1.sinks.k1.hdfs.rollSize=10
        a1.sinks.k1.hdfs.rollCount=3

        a1.channels.c1.type=memory

        a1.sources.r1.channels = c1
        a1.sinks.k1.channel = c1

（4）hbase

        a1.sources = r1
        a1.channels = c1
        a1.sinks = k1

        a1.sources.r1.type = netcat
        a1.sources.r1.bind = localhost
        a1.sources.r1.port = 8888

        a1.sinks.k1.type = hbase
        a1.sinks.k1.table = ns1:t12
        a1.sinks.k1.columnFamily = f1
        a1.sinks.k1.serializer = org.apache.flume.sink.hbase.RegexHbaseEventSerializer

        a1.channels.c1.type=memory

        a1.sources.r1.channels = c1
        a1.sinks.k1.channel = c1

(5)使用avroSource和AvroSink实现跃点agent处理

        #agent----a1
        a1.sources = r1
        a1.sinks= k1
        a1.channels = c1

        a1.sources.r1.type=netcat
        a1.sources.r1.bind=localhost
        a1.sources.r1.port=8888

        a1.sinks.k1.type = avro
        a1.sinks.k1.hostname=localhost
        a1.sinks.k1.port=9999

        a1.channels.c1.type=memory

        a1.sources.r1.channels = c1
        a1.sinks.k1.channel = c1

        #agent----a2
        a2.sources = r2
        a2.sinks= k2
        a2.channels = c2

        a2.sources.r2.type=avro
        a2.sources.r2.bind=localhost
        a2.sources.r2.port=9999

        a2.sinks.k2.type = logger

        a2.channels.c2.type=memory

        a2.sources.r2.channels = c2
        a2.sinks.k2.channel = c2

要先启动a2,再启动a1
启动a2：flume-ng agent -f /soft/flume/conf/avro_hop.conf -n a2 -Dflume.root.logger=INFO,console:
启动a1：flume-ng agent -f /soft/flume/conf/avro_hop.conf -n a1

5、自定义拦截器入门

Flume中的拦截器（interceptor），用户Source读取events发送到Sink的时候，在events header中加入一些有用的信息，或者对events的内容进行过滤，完成初步的数据清洗。源码中已有的拦截器
Timestamp Interceptor；
Host Interceptor；
Static Interceptor；
UUID Interceptor；
Morphline Interceptor；
Search and Replace Interceptor；
Regex Filtering Interceptor；
Regex Extractor Interceptor；

/**
 * 自定义flume的拦截器,提取body中的createTimeMS字段作为header
 */
public class LogCollInterceptor implements Interceptor {

    private final boolean preserveExisting;

    private LogCollInterceptor(boolean preserveExisting) {
        this.preserveExisting = preserveExisting;
    }

    public void initialize() {
    }

    /**
     * Modifies events in-place.
     */
    public Event intercept(Event event) {
        Map<String, String> headers = event.getHeaders();
        //处理时间
        byte[] json = event.getBody();
        String jsonStr = new String(json);
        save(jsonStr);
        AppBaseLog log = JSONObject.parseObject(jsonStr , AppBaseLog.class);
        long time = log.getCreatedAtMs();
        headers.put(TIMESTAMP, Long.toString(time));
        save(time +"");

        //处理log类型的头
        //pageLog
        String logType = "" ;
        if(jsonStr.contains("pageId")){
            logType = "page" ;
        }
        //eventLog
        else if (jsonStr.contains("eventId")) {
            logType = "event";
        }
        //usageLog
        else if (jsonStr.contains("singleUseDurationSecs")) {
            logType = "usage";
        }
        //error
        else if (jsonStr.contains("errorBrief")) {
            logType = "error";
        }
        //startup
        else if (jsonStr.contains("network")) {
            logType = "startup";
        }
        headers.put("logType", logType);
        save(logType);
        return event;
    }

    /**
     * Delegates to {@link #intercept(Event)} in a loop.
     *
     * @param events
     * @return
     */
    public List<Event> intercept(List<Event> events) {
        for (Event event : events) {
            intercept(event);
        }
        return events;
    }

    public void close() {
    }

    /**
     */
    public static class Builder implements Interceptor.Builder {

        private boolean preserveExisting = PRESERVE_DFLT;

        public Interceptor build() {
            return new LogCollInterceptor(preserveExisting);
        }

        public void configure(Context context) {
            preserveExisting = context.getBoolean(PRESERVE, PRESERVE_DFLT);
        }
    }

    /**
     *保存
     */
    private void save(String log)  {
        try {
            FileWriter fw = new FileWriter("/home/centos/l.log",true);
            fw.append(log + "
");
            fw.flush();
            fw.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    public static class Constants {
        public static String TIMESTAMP = "timestamp";
        public static String PRESERVE = "preserveExisting";
        public static boolean PRESERVE_DFLT = false;
    }

}

Flume参考资料
　　　　官方网站： http://flume.apache.org/
　　　　用户文档： http://flume.apache.org/FlumeUserGuide.html
　　　　开发文档： http://flume.apache.org/FlumeDeveloperGuide.html

参考文档：
https://www.cnblogs.com/ximengchj/p/6423689.html
https://www.cnblogs.com/zhangyinhua/p/7803486.html
https://blog.csdn.net/yuan_xw/article/details/51143698

希望在知识中书写人生的代码

相关阅读:
Java接口（interface），扫盲贴
 Java抽象类，扫盲贴
 Java类的继承、super关键字、复写
 Java内部类，扫盲贴
 数据结构学习笔记1--简单排序
 7.1 通用的职责分配软件原则 GRASP原则一: 创建者 Creator
6.6 面向对象设计
 6.5 开始进入设计 … Transition to Design
6.4 操作契约 Operation Contracts
6.3 契约式设计
原文地址：https://www.cnblogs.com/tongxupeng/p/10259543.html