• Hadoop-(Flume)


    Hadoop-(Flume)

    1. Flume 介绍

    1.1. 概述

    • Flume是一个分布式、可靠、和高可用的海量日志采集、聚合和传输的系统。
    • Flume可以采集文件,socket数据包、文件、文件夹、kafka等各种形式源数据,又可以将采集到的数据(下沉sink)输出到HDFS、hbase、hive、kafka等众多外部存储系统中
    • 一般的采集需求,通过对flume的简单配置即可实现
    • Flume针对特殊场景也具备良好的自定义扩展能力, 
      因此,flume可以适用于大部分的日常数据采集场景

    1.2. 运行机制

    1. Flume分布式系统中最核心的角色是agent,flume采集系统就是由一个个agent所连接起来形成

    2. 每一个agent相当于一个数据传递员,内部有三个组件:

      1. Source:采集组件,用于跟数据源对接,以获取数据

      2. Sink:下沉组件,用于往下一级agent传递数据或者往最终存储系统传递数据

      3. Channel:传输通道组件,用于从source将数据传递到sink

        1566284693606

    1.3. Flume 结构图

    简单结构

    单个 Agent 采集数据

    1566284704254

    复杂结构

    多级 Agent 之间串联

    1566091210646

    1566284714483

    2. Flume 实战案例

    案例:使用网络telent命令向一台机器发送一些网络数据,然后通过flume采集网络端口数据

    1566284725926

    2.1. Flume 的安装部署

    Step 1: 下载解压修改配置文件

    下载地址:

    http://archive.apache.org/dist/flume/1.8.0/apache-flume-1.8.0-bin.tar.gz

    Flume的安装非常简单,只需要解压即可,当然,前提是已有hadoop环境

    上传安装包到数据源所在节点上

    这里我们采用在第三台机器来进行安装

    1566091372163

    1. cd /export/softwares/
    2. tar -zxvf apache-flume-1.8.0-bin.tar.gz -C ../servers/
    3. cd /export/servers/apache-flume-1.8.0-bin/conf
    4. cp flume-env.sh.template flume-env.sh
    5. vim flume-env.sh
    6. export JAVA_HOME=/export/servers/jdk1.8.0_141
    Step 2: 开发配置文件

    根据数据采集的需求配置采集方案,描述在配置文件中(文件名可任意自定义)

    配置我们的网络收集的配置文件 
    在flume的conf目录下新建一个配置文件(采集方案)

    1. vim /export/servers/apache-flume-1.8.0-bin/conf/netcat-logger.conf
    1. # 定义这个agent中各组件的名字
    2. a1.sources = r1
    3. a1.sinks = k1
    4. a1.channels = c1
    5. # 描述和配置source组件:r1
    6. a1.sources.r1.type = netcat
    7. a1.sources.r1.bind = 192.168.174.
    8. a1.sources.r1.port = 44444
    9. # 描述和配置sink组件:k1
    10. a1.sinks.k1.type = logger
    11. # 描述和配置channel组件,此处使用是内存缓存的方式
    12. a1.channels.c1.type = memory
    13. a1.channels.c1.capacity = 1000
    14. a1.channels.c1.transactionCapacity = 100
    15. # 描述和配置source channel sink之间的连接关系
    16. a1.sources.r1.channels = c1
    17. a1.sinks.k1.channel = c1

    1566092795186

    Step 3: 启动配置文件

    指定采集方案配置文件,在相应的节点上启动flume agent

    先用一个最简单的例子来测试一下程序环境是否正常 
    启动agent去采集数据

    1. bin/flume-ng agent -c conf -f conf/netcat-logger.conf -n a1 -Dflume.root.logger=INFO,console
    • -c conf 指定flume自身的配置文件所在目录
    • -f conf/netcat-logger.con 指定我们所描述的采集方案
    • -n a1 指定我们这个agent的名字
    Step 4: 安装 Telnet 准备测试

    在node02机器上面安装telnet客户端,用于模拟数据的发送

    1. yum -y install telnet
    2. telnet node03 44444 # 使用telnet模拟数据发送

    2.2. 采集案例

    2.2.3. 采集目录到 HDFS

    1566284746540

    需求

    某服务器的某特定目录下,会不断产生新的文件,每当有新文件出现,就需要把文件采集到HDFS中去

    思路

    根据需求,首先定义以下3大要素

    1. 数据源组件,即source ——监控文件目录 : spooldir
      1. 监视一个目录,只要目录中出现新文件,就会采集文件中的内容
      2. 采集完成的文件,会被agent自动添加一个后缀:COMPLETED
      3. 所监视的目录中不允许重复出现相同文件名的文件,否则报错罢工
    2. 下沉组件,即sink——HDFS文件系统 : hdfs sink
    3. 通道组件,即channel——可用file channel 也可以用内存channel
    Step 1: Flume 配置文件
    1. cd /export/servers/apache-flume-1.8.0-bin/conf
    2. mkdir -p /export/servers/dirfile
    3. vim spooldir.conf
    1. # Name the components on this agent
    2. a1.sources = r1
    3. a1.sinks = k1
    4. a1.channels = c1
    5. # Describe/configure the source
    6. ##注意:不能往监控目中重复丢同名文件
    7. a1.sources.r1.type = spooldir
    8. a1.sources.r1.spoolDir = /export/servers/dirfile
    9. a1.sources.r1.fileHeader = true
    10. # Describe the sink
    11. a1.sinks.k1.type = hdfs
    12. a1.sinks.k1.channel = c1
    13. a1.sinks.k1.hdfs.path = hdfs://node01:8020/spooldir/files/%y-%m-%d/%H%M/
    14. a1.sinks.k1.hdfs.filePrefix = events-
    15. #控制文件夹以多少时间滚动 10分钟
    16. a1.sinks.k1.hdfs.round = true
    17. a1.sinks.k1.hdfs.roundValue = 10
    18. a1.sinks.k1.hdfs.roundUnit = minute
    19. #roll控制写入hdfs文件,以何种方式滚动
    20. #时间间隔
    21. a1.sinks.k1.hdfs.rollInterval = 3
    22. #文件大小
    23. a1.sinks.k1.hdfs.rollSize = 20
    24. #even数量
    25. a1.sinks.k1.hdfs.rollCount = 5
    26. a1.sinks.k1.hdfs.batchSize = 1
    27. #不想滚动,设置0
    28. a1.sinks.k1.hdfs.useLocalTimeStamp = true
    29. #生成的文件类型,默认是Sequencefile,可用DataStream,则为普通文本
    30. a1.sinks.k1.hdfs.fileType = DataStream
    31. # Use a channel which buffers events in memory
    32. a1.channels.c1.type = memory
    33. #胶囊容量
    34. a1.channels.c1.capacity = 1000
    35. #一次向sink运输多少个event
    36. a1.channels.c1.transactionCapacity = 100
    37. # Bind the source and sink to the channel
    38. a1.sources.r1.channels = c1
    39. a1.sinks.k1.channel = c1

    Channel参数解释

    capacity:默认该通道中最大的可以存储的event数量 
    trasactionCapacity:每次最大可以从source中拿到或者送到sink中的event数量 
    keep-alive:event添加到通道中或者移出的允许时间

    Step 2: 启动 Flume
    1. bin/flume-ng agent -c ./conf -f ./conf/spooldir.conf -n a1 -Dflume.root.logger=INFO,console
    2. #命令的精简版
    Step 3: 上传文件到指定目录

    将不同的文件放到下面目录里面去,注意文件不能重名

    1. cd /export/servers/dirfile

    2.2.4. 采集文件到 HDFS

    需求

    比如业务系统使用log4j生成的日志,日志内容不断增加,需要把追加到日志文件中的数据实时采集到hdfs

    分析

    根据需求,首先定义以下3大要素

    • 采集源,即source——监控文件内容更新 : exec ‘tail -F file’
    • 下沉目标,即sink——HDFS文件系统 : hdfs sink
    • Source和sink之间的传递通道——channel,可用file channel 也可以用 内存channel
    Step 1: 定义 Flume 配置文件
    1. cd /export/servers/apache-flume-1.8.0-bin/conf
    2. vim tail-file.conf
    1. # Name the components on this agent
    2. a1.sources = r1
    3. a1.sinks = k1
    4. a1.channels = c1
    5. # Describe/configure the source
    6. a1.sources.r1.type = exec
    7. a1.sources.r1.command = tail -F /root/logs/test.log
    8. a1.sources.r1.channels = c1
    9. # Describe the sink
    10. a1.sinks.k1.type = hdfs
    11. a1.sinks.k1.channel = c1
    12. a1.sinks.k1.hdfs.path = /flume/tailout/%y-%m-%d/%H-%M/
    13. a1.sinks.k1.hdfs.filePrefix = events-
    14. a1.sinks.k1.hdfs.round = true
    15. a1.sinks.k1.hdfs.roundValue = 10
    16. a1.sinks.k1.hdfs.roundUnit = minute
    17. a1.sinks.k1.hdfs.rollInterval = 3
    18. a1.sinks.k1.hdfs.rollSize = 20
    19. a1.sinks.k1.hdfs.rollCount = 5
    20. a1.sinks.k1.hdfs.batchSize = 1
    21. a1.sinks.k1.hdfs.useLocalTimeStamp = true
    22. #生成的文件类型,默认是Sequencefile,可用DataStream,则为普通文本
    23. a1.sinks.k1.hdfs.fileType = DataStream
    24. # Use a channel which buffers events in memory
    25. a1.channels.c1.type = memory
    26. a1.channels.c1.capacity = 1000
    27. a1.channels.c1.transactionCapacity = 100
    28. # Bind the source and sink to the channel
    29. a1.sources.r1.channels = c1
    30. a1.sinks.k1.channel = c1
    Step 2: 启动 Flume
    1. cd /export/servers/apache-flume-1.8.0-bin
    2. bin/flume-ng agent -c conf -f conf/tail-file.conf -n agent1 -Dflume.root.logger=INFO,console
    Step 3: 开发 Shell 脚本定时追加文件内容
    1. mkdir -p /export/servers/shells/
    2. cd /export/servers/shells/
    3. vim tail-file.sh
    1. #!/bin/bash
    2. while true
    3. do
    4. date >> /export/servers/taillogs/access_log;
    5. sleep 0.5;
    6. done
    Step 4: 启动脚本
    1. # 创建文件夹
    2. mkdir -p /export/servers/taillogs
    3. # 启动脚本
    4. sh /export/servers/shells/tail-file.sh

    2.2.5. Agent 级联

    1566284897069

    分析

    第一个agent负责收集文件当中的数据,通过网络发送到第二个agent当中去 
    第二个agent负责接收第一个agent发送的数据,并将数据保存到hdfs上面去

    Step 1: Node02 安装 Flume

    将node03机器上面解压后的flume文件夹拷贝到node02机器上面去

    1. cd /export/servers
    2. scp -r apache-flume-1.8.0-bin/ node02:$PWD
    Step 2: Node02 配置 Flume

    在node02机器配置我们的flume

    1. cd /export/servers/ apache-flume-1.8.0-bin/conf
    2. vim tail-avro-avro-logger.conf
    1. ##################
    2. # Name the components on this agent
    3. a1.sources = r1
    4. a1.sinks = k1
    5. a1.channels = c1
    6. # Describe/configure the source
    7. a1.sources.r1.type = exec
    8. a1.sources.r1.command = tail -F /export/servers/taillogs/access_log
    9. a1.sources.r1.channels = c1
    10. # Describe the sink
    11. ##sink端的avro是一个数据发送者
    12. a1.sinks = k1
    13. a1.sinks.k1.type = avro
    14. a1.sinks.k1.channel = c1
    15. a1.sinks.k1.hostname = node03
    16. a1.sinks.k1.port = 4141
    17. a1.sinks.k1.batch-size = 10
    18. # Use a channel which buffers events in memory
    19. a1.channels.c1.type = memory
    20. a1.channels.c1.capacity = 1000
    21. a1.channels.c1.transactionCapacity = 100
    22. # Bind the source and sink to the channel
    23. a1.sources.r1.channels = c1
    24. a1.sinks.k1.channel = c1
    Step 3: 开发脚本向文件中写入数据

    直接将node03下面的脚本和数据拷贝到node02即可,node03机器上执行以下命令

    1. cd /export/servers
    2. scp -r shells/ taillogs/ node02:$PWD
    Step 4: Node03 Flume 配置文件

    在node03机器上开发flume的配置文件

    1. cd /export/servers/apache-flume-1.8.0-bin/conf
    2. vim avro-hdfs.conf
    1. # Name the components on this agent
    2. a1.sources = r1
    3. a1.sinks = k1
    4. a1.channels = c1
    5. # Describe/configure the source
    6. ##source中的avro组件是一个接收者服务
    7. a1.sources.r1.type = avro
    8. a1.sources.r1.channels = c1
    9. a1.sources.r1.bind = node03
    10. a1.sources.r1.port = 4141
    11. # Describe the sink
    12. a1.sinks.k1.type = hdfs
    13. a1.sinks.k1.hdfs.path = hdfs://node01:8020/av/%y-%m-%d/%H%M/
    14. a1.sinks.k1.hdfs.filePrefix = events-
    15. a1.sinks.k1.hdfs.round = true
    16. a1.sinks.k1.hdfs.roundValue = 10
    17. a1.sinks.k1.hdfs.roundUnit = minute
    18. a1.sinks.k1.hdfs.rollInterval = 3
    19. a1.sinks.k1.hdfs.rollSize = 20
    20. a1.sinks.k1.hdfs.rollCount = 5
    21. a1.sinks.k1.hdfs.batchSize = 1
    22. a1.sinks.k1.hdfs.useLocalTimeStamp = true
    23. #生成的文件类型,默认是Sequencefile,可用DataStream,则为普通文本
    24. a1.sinks.k1.hdfs.fileType = DataStream
    25. # Use a channel which buffers events in memory
    26. a1.channels.c1.type = memory
    27. a1.channels.c1.capacity = 1000
    28. a1.channels.c1.transactionCapacity = 100
    29. # Bind the source and sink to the channel
    30. a1.sources.r1.channels = c1
    31. a1.sinks.k1.channel = c1
    Step 5: 顺序启动

    node03机器启动flume进程

    1. cd /export/servers/apache-flume-1.8.0-bin
    2. bin/flume-ng agent -c conf -f conf/avro-hdfs.conf -n a1 -Dflume.root.logger=INFO,console

    node02机器启动flume进程

    1. cd /export/servers/apache-flume-1.8.0-bin/
    2. bin/flume-ng agent -c conf -f conf/tail-avro-avro-logger.conf -n a1 -Dflume.root.logger=INFO,console

    node02机器启shell脚本生成文件

    1. cd /export/servers/shells
    2. sh tail-file.sh

    3. flume的高可用方案-failover

    在完成单点的Flume NG搭建后,下面我们搭建一个高可用的Flume NG集群,架构图如下所示:

    3.1. 角色分配

    Flume的Agent和Collector分布如下表所示:

    名称HOST角色
    Agent1node01Web Server
    Collector1node02AgentMstr1
    Collector2node03AgentMstr2

    图中所示,Agent1数据分别流入到Collector1和Collector2,Flume NG本身提供了Failover机制,可以自动切换和恢复。在上图中,有3个产生日志服务器分布在不同的机房,要把所有的日志都收集到一个集群中存储。下 面我们开发配置Flume NG集群

    3.2. Node01 安装和配置

    将node03机器上面的flume安装包以及文件生产的两个目录拷贝到node01机器上面去

    node03机器执行以下命令

    1. cd /export/servers
    2. scp -r apache-flume-1.8.0-bin/ node01:$PWD
    3. scp -r shells/ taillogs/ node01:$PWD

    node01机器配置agent的配置文件

    1. cd /export/servers/apache-flume-1.8.0-bin/conf
    2. vim agent.conf
    1. #agent1 name
    2. agent1.channels = c1
    3. agent1.sources = r1
    4. agent1.sinks = k1 k2
    5. #
    6. ##set gruop
    7. agent1.sinkgroups = g1
    8. #
    9. agent1.sources.r1.channels = c1
    10. agent1.sources.r1.type = exec
    11. agent1.sources.r1.command = tail -F /export/servers/taillogs/access_log
    12. #
    13. ##set channel
    14. agent1.channels.c1.type = memory
    15. agent1.channels.c1.capacity = 1000
    16. agent1.channels.c1.transactionCapacity = 100
    17. #
    18. ## set sink1
    19. agent1.sinks.k1.channel = c1
    20. agent1.sinks.k1.type = avro
    21. agent1.sinks.k1.hostname = node02
    22. agent1.sinks.k1.port = 52020
    23. #
    24. ## set sink2
    25. agent1.sinks.k2.channel = c1
    26. agent1.sinks.k2.type = avro
    27. agent1.sinks.k2.hostname = node03
    28. agent1.sinks.k2.port = 52020
    29. #
    30. ##set sink group
    31. agent1.sinkgroups.g1.sinks = k1 k2
    32. #
    33. ##set failover
    34. agent1.sinkgroups.g1.processor.type = failover
    35. agent1.sinkgroups.g1.processor.priority.k1 = 10
    36. agent1.sinkgroups.g1.processor.priority.k2 = 1
    37. agent1.sinkgroups.g1.processor.maxpenalty = 10000
    1. #agent1 name
    2. agent1.channels = c1
    3. agent1.sources = r1
    4. agent1.sinks = k1 k2
    5. #set gruop
    6. agent1.sinkgroups = g1
    7. #set channel
    8. agent1.channels.c1.type = memory
    9. agent1.channels.c1.capacity = 1000
    10. agent1.channels.c1.transactionCapacity = 100
    11. agent1.sources.r1.channels = c1
    12. agent1.sources.r1.type = exec
    13. agent1.sources.r1.command = tail -F /root/logs/456.log
    14. # set sink1
    15. agent1.sinks.k1.channel = c1
    16. agent1.sinks.k1.type = avro
    17. agent1.sinks.k1.hostname = node02
    18. agent1.sinks.k1.port = 52020
    19. # set sink2
    20. agent1.sinks.k2.channel = c1
    21. agent1.sinks.k2.type = avro
    22. agent1.sinks.k2.hostname = node03
    23. agent1.sinks.k2.port = 52020
    24. #set sink group
    25. agent1.sinkgroups.g1.sinks = k1 k2
    26. #set failover
    27. agent1.sinkgroups.g1.processor.type = failover
    28. agent1.sinkgroups.g1.processor.priority.k1 = 10
    29. agent1.sinkgroups.g1.processor.priority.k2 = 1
    30. agent1.sinkgroups.g1.processor.maxpenalty = 10000

    3.3. Node02 与 Node03 配置 FlumeCollection

    node02机器修改配置文件

    1. cd /export/servers/apache-flume-1.8.0-bin/conf
    2. vim collector.conf
    1. #set Agent name
    2. a1.sources = r1
    3. a1.channels = c1
    4. a1.sinks = k1
    5. #
    6. ##set channel
    7. a1.channels.c1.type = memory
    8. a1.channels.c1.capacity = 1000
    9. a1.channels.c1.transactionCapacity = 100
    10. #
    11. ## other node,nna to nns
    12. a1.sources.r1.type = avro
    13. a1.sources.r1.bind = node02
    14. a1.sources.r1.port = 52020
    15. a1.sources.r1.channels = c1
    16. #
    17. ##set sink to hdfs
    18. a1.sinks.k1.type=hdfs
    19. a1.sinks.k1.hdfs.path= hdfs://node01:8020/flume/failover/
    20. a1.sinks.k1.hdfs.fileType=DataStream
    21. a1.sinks.k1.hdfs.writeFormat=TEXT
    22. a1.sinks.k1.hdfs.rollInterval=10
    23. a1.sinks.k1.channel=c1
    24. a1.sinks.k1.hdfs.filePrefix=%Y-%m-%d
    25. #
    1. # Name the components on this agent
    2. a1.sources = r1
    3. a1.sinks = k1
    4. a1.channels = c1
    5. # Describe/configure the source
    6. a1.sources.r1.type = avro
    7. a1.sources.r1.channels = c1
    8. a1.sources.r1.bind = node02
    9. a1.sources.r1.port = 52020
    10. # Describe the sink
    11. a1.sinks.k1.type = logger
    12. # Use a channel which buffers events in memory
    13. a1.channels.c1.type = memory
    14. a1.channels.c1.capacity = 1000
    15. a1.channels.c1.transactionCapacity = 100
    16. # Bind the source and sink to the channel
    17. a1.sources.r1.channels = c1
    18. a1.sinks.k1.channel = c1

    node03机器修改配置文件

    1. cd /export/servers/apache-flume-1.8.0-bin/conf
    2. vim collector.conf
    1. #set Agent name
    2. a1.sources = r1
    3. a1.channels = c1
    4. a1.sinks = k1
    5. #
    6. ##set channel
    7. a1.channels.c1.type = memory
    8. a1.channels.c1.capacity = 1000
    9. a1.channels.c1.transactionCapacity = 100
    10. #
    11. ## other node,nna to nns
    12. a1.sources.r1.type = avro
    13. a1.sources.r1.bind = node03
    14. a1.sources.r1.port = 52020
    15. a1.sources.r1.channels = c1
    16. #
    17. ##set sink to hdfs
    18. a1.sinks.k1.type=hdfs
    19. a1.sinks.k1.hdfs.path= hdfs://node01:8020/flume/failover/
    20. a1.sinks.k1.hdfs.fileType=DataStream
    21. a1.sinks.k1.hdfs.writeFormat=TEXT
    22. a1.sinks.k1.hdfs.rollInterval=10
    23. a1.sinks.k1.channel=c1
    24. a1.sinks.k1.hdfs.filePrefix=%Y-%m-%d
    1. # Name the components on this agent
    2. a1.sources = r1
    3. a1.sinks = k1
    4. a1.channels = c1
    5. # Describe/configure the source
    6. a1.sources.r1.type = avro
    7. a1.sources.r1.channels = c1
    8. a1.sources.r1.bind = node03
    9. a1.sources.r1.port = 52020
    10. # Describe the sink
    11. a1.sinks.k1.type = logger
    12. # Use a channel which buffers events in memory
    13. a1.channels.c1.type = memory
    14. a1.channels.c1.capacity = 1000
    15. a1.channels.c1.transactionCapacity = 100
    16. # Bind the source and sink to the channel
    17. a1.sources.r1.channels = c1
    18. a1.sinks.k1.channel = c1

    3.4. 顺序启动

    node03机器上面启动flume

    1. cd /export/servers/apache-flume-1.8.0-bin
    2. bin/flume-ng agent -n a1 -c conf -f conf/collector.conf -Dflume.root.logger=DEBUG,console

    node02机器上面启动flume

    1. cd /export/servers/apache-flume-1.8.0-bin
    2. bin/flume-ng agent -n a1 -c conf -f conf/collector.conf -Dflume.root.logger=DEBUG,console

    node01机器上面启动flume

    1. cd /export/servers/apache-flume-1.8.0-bin
    2. bin/flume-ng agent -n agent1 -c conf -f conf/agent.conf -Dflume.root.logger=DEBUG,console

    node01机器启动文件产生脚本

    1. cd /export/servers/shells
    2. sh tail-file.sh

    3.5. Failover 测试

    下面我们来测试下Flume NG集群的高可用(故障转移)。场景如下:我们在Agent1节点上传文件,由于我们配置Collector1的权重比Collector2大,所以 Collector1优先采集并上传到存储系统。然后我们kill掉Collector1,此时有Collector2负责日志的采集上传工作,之后,我 们手动恢复Collector1节点的Flume服务,再次在Agent1上次文件,发现Collector1恢复优先级别的采集工作。具体截图如下所 示:

    Collector1优先上传

    1566284970705

    HDFS集群中上传的log内容预览

    1566284977902

    Collector1宕机,Collector2获取优先上传权限

    1566284988161

    重启Collector1服务,Collector1重新获得优先上传的权限

    4. flume 的负载均衡

    负载均衡是用于解决一台机器(一个进程)无法解决所有请求而产生的一种算法。Load balancing Sink Processor 能够实现 load balance 功能,如下图Agent1 是一个路由节点,负责将 Channel 暂存的 Event 均衡到对应的多个 Sink组件上,而每个 Sink 组件分别连接到一个独立的 Agent 上,示例配置,如下所示:

    1566285002934

    在此处我们通过三台机器来进行模拟flume的负载均衡

    三台机器规划如下:

    node01:采集数据,发送到node02和node03机器上去

    node02:接收node01的部分数据

    node03:接收node01的部分数据

    第一步:开发node01服务器的flume配置

    node01服务器配置:

    1. cd /export/servers/apache-flume-1.8.0-bin/conf
    2. vim load_banlancer_client.conf
    1. #agent name
    2. a1.channels = c1
    3. a1.sources = r1
    4. a1.sinks = k1 k2
    5. #set gruop
    6. a1.sinkgroups = g1
    7. #set channel
    8. a1.channels.c1.type = memory
    9. a1.channels.c1.capacity = 1000
    10. a1.channels.c1.transactionCapacity = 100
    11. a1.sources.r1.channels = c1
    12. a1.sources.r1.type = exec
    13. a1.sources.r1.command = tail -F /export/servers/taillogs/access_log
    14. # set sink1
    15. a1.sinks.k1.channel = c1
    16. a1.sinks.k1.type = avro
    17. #对接端口
    18. a1.sinks.k1.hostname = node02
    19. a1.sinks.k1.port = 52020
    20. # set sink2
    21. a1.sinks.k2.channel = c1
    22. a1.sinks.k2.type = avro
    23. #对接端口
    24. a1.sinks.k2.hostname = node03
    25. a1.sinks.k2.port = 52020
    26. #set sink group
    27. a1.sinkgroups.g1.sinks = k1 k2
    28. #set failover
    29. #负载均衡
    30. a1.sinkgroups.g1.processor.type = load_balance
    31. a1.sinkgroups.g1.processor.backoff = true
    32. #轮训
    33. a1.sinkgroups.g1.processor.selector = round_robin
    34. a1.sinkgroups.g1.processor.selector.maxTimeOut=10000

    第二步:开发node02服务器的flume配置

    1. cd /export/servers/apache-flume-1.8.0-bin/conf
    2. vim load_banlancer_server.conf
    1. # Name the components on this agent
    2. a1.sources = r1
    3. a1.sinks = k1
    4. a1.channels = c1
    5. # Describe/configure the source
    6. a1.sources.r1.type = avro
    7. a1.sources.r1.channels = c1
    8. a1.sources.r1.bind = node02
    9. a1.sources.r1.port = 52020
    10. # Describe the sink
    11. a1.sinks.k1.type = logger
    12. # Use a channel which buffers events in memory
    13. a1.channels.c1.type = memory
    14. a1.channels.c1.capacity = 1000
    15. a1.channels.c1.transactionCapacity = 100
    16. # Bind the source and sink to the channel
    17. a1.sources.r1.channels = c1
    18. a1.sinks.k1.channel = c1

    第三步:开发node03服务器flume配置

    node03服务器配置

    1. cd /export/servers/apache-flume-1.8.0-bin/conf
    2. vim load_banlancer_server.conf
    1. # Name the components on this agent
    2. a1.sources = r1
    3. a1.sinks = k1
    4. a1.channels = c1
    5. # Describe/configure the source
    6. a1.sources.r1.type = avro
    7. a1.sources.r1.channels = c1
    8. a1.sources.r1.bind = node03
    9. a1.sources.r1.port = 52020
    10. # Describe the sink
    11. a1.sinks.k1.type = logger
    12. # Use a channel which buffers events in memory
    13. a1.channels.c1.type = memory
    14. a1.channels.c1.capacity = 1000
    15. a1.channels.c1.transactionCapacity = 100
    16. # Bind the source and sink to the channel
    17. a1.sources.r1.channels = c1
    18. a1.sinks.k1.channel = c1

    第四步:准备启动flume服务

    启动node03的flume服务

    1. cd /export/servers/apache-flume-1.8.0-bin
    2. bin/flume-ng agent -n a1 -c conf -f conf/load_banlancer_server.conf -Dflume.root.logger=DEBUG,console

    启动node02的flume服务

    1. cd /export/servers/apache-flume-1.8.0-bin
    2. bin/flume-ng agent -n a1 -c conf -f conf/load_banlancer_server.conf -Dflume.root.logger=DEBUG,console

    启动node01的flume服务

    1. cd /export/servers/apache-flume-1.8.0-bin
    2. bin/flume-ng agent -n a1 -c conf -f conf/load_banlancer_client.conf -Dflume.root.logger=DEBUG,console

    第五步:node01服务器运行脚本产生数据

    cd /export/servers/shells

    sh tail-file.sh

    5. Flume 案例-静态拦截器

    1. 案例场景

    A、B两台日志服务机器实时生产日志主要类型为access.log、nginx.log、web.log

    现在要求:

    把A、B 机器中的access.log、nginx.log、web.log 采集汇总到C机器上然后统一收集到hdfs中。

    但是在hdfs中要求的目录为:

    1. /source/logs/access/20180101/**
    2. /source/logs/nginx/20180101/**
    3. /source/logs/web/20180101/**

    2. 场景分析

    1566285061121

    ​ 图一

    3. 数据流程处理分析

    1566285081174

    4、实现

    服务器A对应的IP为 192.168.174.100

    服务器B对应的IP为 192.168.174.110

    服务器C对应的IP为 node03

    采集端配置文件开发

    node01与node02服务器开发flume的配置文件

    1. cd /export/servers/apache-flume-1.8.0-bin/conf
    2. vim exec_source_avro_sink.conf
    1. # Name the components on this agent
    2. a1.sources = r1 r2 r3
    3. a1.sinks = k1
    4. a1.channels = c1
    5. # Describe/configure the source
    6. a1.sources.r1.type = exec
    7. a1.sources.r1.command = tail -F /export/servers/taillogs/access.log
    8. a1.sources.r1.interceptors = i1
    9. a1.sources.r1.interceptors.i1.type = static
    10. ## static拦截器的功能就是往采集到的数据的header中插入自己定## 义的key-value对
    11. a1.sources.r1.interceptors.i1.key = type
    12. a1.sources.r1.interceptors.i1.value = access
    13. a1.sources.r2.type = exec
    14. a1.sources.r2.command = tail -F /export/servers/taillogs/nginx.log
    15. a1.sources.r2.interceptors = i2
    16. a1.sources.r2.interceptors.i2.type = static
    17. a1.sources.r2.interceptors.i2.key = type
    18. a1.sources.r2.interceptors.i2.value = nginx
    19. a1.sources.r3.type = exec
    20. a1.sources.r3.command = tail -F /export/servers/taillogs/web.log
    21. a1.sources.r3.interceptors = i3
    22. a1.sources.r3.interceptors.i3.type = static
    23. a1.sources.r3.interceptors.i3.key = type
    24. a1.sources.r3.interceptors.i3.value = web
    25. # Describe the sink
    26. a1.sinks.k1.type = avro
    27. a1.sinks.k1.hostname = node03
    28. a1.sinks.k1.port = 41414
    29. # Use a channel which buffers events in memory
    30. a1.channels.c1.type = memory
    31. a1.channels.c1.capacity = 20000
    32. a1.channels.c1.transactionCapacity = 10000
    33. # Bind the source and sink to the channel
    34. a1.sources.r1.channels = c1
    35. a1.sources.r2.channels = c1
    36. a1.sources.r3.channels = c1
    37. a1.sinks.k1.channel = c1

    服务端配置文件开发

    在node03上面开发flume配置文件

    1. cd /export/servers/apache-flume-1.8.0-bin/conf
    2. vim avro_source_hdfs_sink.conf
    1. a1.sources = r1
    2. a1.sinks = k1
    3. a1.channels = c1
    4. #定义source
    5. a1.sources.r1.type = avro
    6. a1.sources.r1.bind = node03
    7. a1.sources.r1.port =41414
    8. #添加时间拦截器
    9. a1.sources.r1.interceptors = i1
    10. a1.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.TimestampInterceptor$Builder
    11. #定义channels
    12. a1.channels.c1.type = memory
    13. a1.channels.c1.capacity = 20000
    14. a1.channels.c1.transactionCapacity = 10000
    15. #定义sink
    16. a1.sinks.k1.type = hdfs
    17. a1.sinks.k1.hdfs.path=hdfs://node01:8020/source/logs/%{type}/%Y%m%d
    18. a1.sinks.k1.hdfs.filePrefix =events
    19. a1.sinks.k1.hdfs.fileType = DataStream
    20. a1.sinks.k1.hdfs.writeFormat = Text
    21. #时间类型
    22. a1.sinks.k1.hdfs.useLocalTimeStamp = true
    23. #生成的文件不按条数生成
    24. a1.sinks.k1.hdfs.rollCount = 0
    25. #生成的文件按时间生成
    26. a1.sinks.k1.hdfs.rollInterval = 30
    27. #生成的文件按大小生成
    28. a1.sinks.k1.hdfs.rollSize = 10485760
    29. #批量写入hdfs的个数
    30. a1.sinks.k1.hdfs.batchSize = 10000
    31. #flume操作hdfs的线程数(包括新建,写入等)
    32. a1.sinks.k1.hdfs.threadsPoolSize=10
    33. #操作hdfs超时时间
    34. a1.sinks.k1.hdfs.callTimeout=30000
    35. #组装source、channel、sink
    36. a1.sources.r1.channels = c1
    37. a1.sinks.k1.channel = c1

    采集端文件生成脚本

    在node01与node02上面开发shell脚本,模拟数据生成

    1. cd /export/servers/shells
    2. vim server.sh
    1. #!/bin/bash
    2. while true
    3. do
    4. date >> /export/servers/taillogs/access.log;
    5. date >> /export/servers/taillogs/web.log;
    6. date >> /export/servers/taillogs/nginx.log;
    7. sleep 0.5;
    8. done

    顺序启动服务

    node03启动flume实现数据收集

    1. cd /export/servers/apache-flume-1.8.0-bin
    2. bin/flume-ng agent -c conf -f conf/avro_source_hdfs_sink.conf -name a1 -Dflume.root.logger=DEBUG,console

    node01与node02启动flume实现数据监控

    1. cd /export/servers/apache-flume-1.8.0-bin
    2. bin/flume-ng agent -c conf -f conf/exec_source_avro_sink.conf -name a1 -Dflume.root.logger=DEBUG,console

    node01与node02启动生成文件脚本

    1. cd /export/servers/shells
    2. sh server.sh

    5、项目实现截图

    1566285113421

    1566285121716

    6. Flume 案例二

    案例需求:

    在数据采集之后,通过flume的拦截器,实现不需要的数据过滤掉,并将指定的第一个字段进行加密,加密之后再往hdfs上面保存

    原始数据与处理之后的数据对比

    1566285132227

    图一 原始文件内容

    1566285139361

    图二 HDFS上产生收集到的处理数

    实现步骤

    第一步:创建maven java工程,导入jar包

    1. <?xml version="1.0" encoding="UTF-8"?>
    2. <project xmlns="http://maven.apache.org/POM/4.0.0"
    3. xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    4. xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    5. <modelVersion>4.0.0</modelVersion>
    6. <groupId>cn.le.cloud</groupId>
    7. <artifactId>example-flume-intercepter</artifactId>
    8. <version>1.0-SNAPSHOT</version>
    9. <dependencies>
    10. <dependency>
    11. <groupId>org.apache.flume</groupId>
    12. <artifactId>flume-ng-sdk</artifactId>
    13. <version>1.8.0</version>
    14. </dependency>
    15. <dependency>
    16. <groupId>org.apache.flume</groupId>
    17. <artifactId>flume-ng-core</artifactId>
    18. <version>1.8.0</version>
    19. </dependency>
    20. </dependencies>
    21. <build>
    22. <plugins>
    23. <plugin>
    24. <groupId>org.apache.maven.plugins</groupId>
    25. <artifactId>maven-compiler-plugin</artifactId>
    26. <version>3.0</version>
    27. <configuration>
    28. <source>1.8</source>
    29. <target>1.8</target>
    30. <encoding>UTF-8</encoding>
    31. <!-- <verbal>true</verbal>-->
    32. </configuration>
    33. </plugin>
    34. <plugin>
    35. <groupId>org.apache.maven.plugins</groupId>
    36. <artifactId>maven-shade-plugin</artifactId>
    37. <version>3.1.1</version>
    38. <executions>
    39. <execution>
    40. <phase>package</phase>
    41. <goals>
    42. <goal>shade</goal>
    43. </goals>
    44. <configuration>
    45. <filters>
    46. <filter>
    47. <artifact>*:*</artifact>
    48. <excludes>
    49. <exclude>META-INF/*.SF</exclude>
    50. <exclude>META-INF/*.DSA</exclude>
    51. <exclude>META-INF/*.RSA</exclude>
    52. </excludes>
    53. </filter>
    54. </filters>
    55. <transformers>
    56. <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
    57. <mainClass></mainClass>
    58. </transformer>
    59. </transformers>
    60. </configuration>
    61. </execution>
    62. </executions>
    63. </plugin>
    64. </plugins>
    65. </build>
    66. </project>

    第二步:自定义flume的拦截器

    1. package cn.le.iterceptor;
    2. import com.google.common.base.Charsets;
    3. import org.apache.flume.Context;
    4. import org.apache.flume.Event;
    5. import org.apache.flume.interceptor.Interceptor;
    6. import java.security.MessageDigest;
    7. import java.security.NoSuchAlgorithmException;
    8. import java.util.ArrayList;
    9. import java.util.List;
    10. import java.util.regex.Matcher;
    11. import java.util.regex.Pattern;
    12. import static cn.le.iterceptor.CustomParameterInterceptor.Constants.*;
    13. public class CustomParameterInterceptor implements Interceptor {
    14. /** The field_separator.指明每一行字段的分隔符 */
    15. private final String fields_separator;
    16. /** The indexs.通过分隔符分割后,指明需要那列的字段 下标*/
    17. private final String indexs;
    18. /** The indexs_separator. 多个下标的分隔符*/
    19. private final String indexs_separator;
    20. /**
    21. *
    22. * @param indexs
    23. * @param indexs_separator
    24. */
    25. public CustomParameterInterceptor( String fields_separator,
    26. String indexs, String indexs_separator,String encrypted_field_index) {
    27. String f = fields_separator.trim();
    28. String i = indexs_separator.trim();
    29. this.indexs = indexs;
    30. this.encrypted_field_index=encrypted_field_index.trim();
    31. if (!f.equals("")) {
    32. f = UnicodeToString(f);
    33. }
    34. this.fields_separator =f;
    35. if (!i.equals("")) {
    36. i = UnicodeToString(i);
    37. }
    38. this.indexs_separator = i;
    39. }
    40. /*
    41. *
    42. * 制表符 ('u0009') 新行(换行)符 (' ') 回车符 (' ') f 换页符 ('u000C') a 报警
    43. * (bell) 符 ('u0007') e 转义符 ('u001B') cx 空格(u0020)对应于 x 的控制符
    44. *
    45. * @param str
    46. * @return
    47. * @data:2015-6-30
    48. */
    49. /** The encrypted_field_index. 需要加密的字段下标*/
    50. private final String encrypted_field_index;
    51. public static String UnicodeToString(String str) {
    52. Pattern pattern = Pattern.compile("(\\u(\p{XDigit}{4}))");
    53. Matcher matcher = pattern.matcher(str);
    54. char ch;
    55. while (matcher.find()) {
    56. ch = (char) Integer.parseInt(matcher.group(2), 16);
    57. str = str.replace(matcher.group(1), ch + "");
    58. }
    59. return str;
    60. }
    61. /*
    62. * @see org.apache.flume.interceptor.Interceptor#intercept(org.apache.flume.Event)
    63. * 单个event拦截逻辑
    64. */
    65. public Event intercept(Event event) {
    66. if (event == null) {
    67. return null;
    68. }
    69. try {
    70. String line = new String(event.getBody(), Charsets.UTF_8);
    71. String[] fields_spilts = line.split(fields_separator);
    72. String[] indexs_split = indexs.split(indexs_separator);
    73. String newLine="";
    74. for (int i = 0; i < indexs_split.length; i++) {
    75. int parseInt = Integer.parseInt(indexs_split[i]);
    76. //对加密字段进行加密
    77. if(!"".equals(encrypted_field_index)&&encrypted_field_index.equals(indexs_split[i])){
    78. newLine+=StringUtils.GetMD5Code(fields_spilts[parseInt]);
    79. }else{
    80. newLine+=fields_spilts[parseInt];
    81. }
    82. if(i!=indexs_split.length-1){
    83. newLine+=fields_separator;
    84. }
    85. }
    86. event.setBody(newLine.getBytes(Charsets.UTF_8));
    87. return event;
    88. } catch (Exception e) {
    89. return event;
    90. }
    91. }
    92. /*
    93. * @see org.apache.flume.interceptor.Interceptor#intercept(java.util.List)
    94. * 批量event拦截逻辑
    95. */
    96. public List<Event> intercept(List<Event> events) {
    97. List<Event> out = new ArrayList<Event>();
    98. for (Event event : events) {
    99. Event outEvent = intercept(event);
    100. if (outEvent != null) {
    101. out.add(outEvent);
    102. }
    103. }
    104. return out;
    105. }
    106. /*
    107. * @see org.apache.flume.interceptor.Interceptor#initialize()
    108. */
    109. public void initialize() {
    110. // TODO Auto-generated method stub
    111. }
    112. /*
    113. * @see org.apache.flume.interceptor.Interceptor#close()
    114. */
    115. public void close() {
    116. // TODO Auto-generated method stub
    117. }
    118. /**
    119. * 相当于自定义Interceptor的工厂类
    120. * 在flume采集配置文件中通过制定该Builder来创建Interceptor对象
    121. * 可以在Builder中获取、解析flume采集配置文件中的拦截器Interceptor的自定义参数:
    122. * 字段分隔符,字段下标,下标分隔符、加密字段下标 ...等
    123. * @author
    124. *
    125. */
    126. public static class Builder implements Interceptor.Builder {
    127. /** The fields_separator.指明每一行字段的分隔符 */
    128. private String fields_separator;
    129. /** The indexs.通过分隔符分割后,指明需要那列的字段 下标*/
    130. private String indexs;
    131. /** The indexs_separator. 多个下标下标的分隔符*/
    132. private String indexs_separator;
    133. /** The encrypted_field. 需要加密的字段下标*/
    134. private String encrypted_field_index;
    135. /*
    136. * @see org.apache.flume.conf.Configurable#configure(org.apache.flume.Context)
    137. */
    138. public void configure(Context context) {
    139. fields_separator = context.getString(FIELD_SEPARATOR, DEFAULT_FIELD_SEPARATOR);
    140. indexs = context.getString(INDEXS, DEFAULT_INDEXS);
    141. indexs_separator = context.getString(INDEXS_SEPARATOR, DEFAULT_INDEXS_SEPARATOR);
    142. encrypted_field_index= context.getString(ENCRYPTED_FIELD_INDEX, DEFAULT_ENCRYPTED_FIELD_INDEX);
    143. }
    144. /*
    145. * @see org.apache.flume.interceptor.Interceptor.Builder#build()
    146. */
    147. public Interceptor build() {
    148. return new CustomParameterInterceptor(fields_separator, indexs, indexs_separator,encrypted_field_index);
    149. }
    150. }
    151. /**
    152. * 常量
    153. *
    154. */
    155. public static class Constants {
    156. /** The Constant FIELD_SEPARATOR. */
    157. public static final String FIELD_SEPARATOR = "fields_separator";
    158. /** The Constant DEFAULT_FIELD_SEPARATOR. */
    159. public static final String DEFAULT_FIELD_SEPARATOR =" ";
    160. /** The Constant INDEXS. */
    161. public static final String INDEXS = "indexs";
    162. /** The Constant DEFAULT_INDEXS. */
    163. public static final String DEFAULT_INDEXS = "0";
    164. /** The Constant INDEXS_SEPARATOR. */
    165. public static final String INDEXS_SEPARATOR = "indexs_separator";
    166. /** The Constant DEFAULT_INDEXS_SEPARATOR. */
    167. public static final String DEFAULT_INDEXS_SEPARATOR = ",";
    168. /** The Constant ENCRYPTED_FIELD_INDEX. */
    169. public static final String ENCRYPTED_FIELD_INDEX = "encrypted_field_index";
    170. /** The Constant DEFAUL_TENCRYPTED_FIELD_INDEX. */
    171. public static final String DEFAULT_ENCRYPTED_FIELD_INDEX = "";
    172. /** The Constant PROCESSTIME. */
    173. public static final String PROCESSTIME = "processTime";
    174. /** The Constant PROCESSTIME. */
    175. public static final String DEFAULT_PROCESSTIME = "a";
    176. }
    177. /**
    178. * 工具类:字符串md5加密
    179. */
    180. public static class StringUtils {
    181. // 全局数组
    182. private final static String[] strDigits = { "0", "1", "2", "3", "4", "5",
    183. "6", "7", "8", "9", "a", "b", "c", "d", "e", "f" };
    184. // 返回形式为数字跟字符串
    185. private static String byteToArrayString(byte bByte) {
    186. int iRet = bByte;
    187. // System.out.println("iRet="+iRet);
    188. if (iRet < 0) {
    189. iRet += 256;
    190. }
    191. int iD1 = iRet / 16;
    192. int iD2 = iRet % 16;
    193. return strDigits[iD1] + strDigits[iD2];
    194. }
    195. // 返回形式只为数字
    196. private static String byteToNum(byte bByte) {
    197. int iRet = bByte;
    198. System.out.println("iRet1=" + iRet);
    199. if (iRet < 0) {
    200. iRet += 256;
    201. }
    202. return String.valueOf(iRet);
    203. }
    204. // 转换字节数组为16进制字串
    205. private static String byteToString(byte[] bByte) {
    206. StringBuffer sBuffer = new StringBuffer();
    207. for (int i = 0; i < bByte.length; i++) {
    208. sBuffer.append(byteToArrayString(bByte[i]));
    209. }
    210. return sBuffer.toString();
    211. }
    212. public static String GetMD5Code(String strObj) {
    213. String resultString = null;
    214. try {
    215. resultString = new String(strObj);
    216. MessageDigest md = MessageDigest.getInstance("MD5");
    217. // md.digest() 该函数返回值为存放哈希值结果的byte数组
    218. resultString = byteToString(md.digest(strObj.getBytes()));
    219. } catch (NoSuchAlgorithmException ex) {
    220. ex.printStackTrace();
    221. }
    222. return resultString;
    223. }
    224. }
    225. }

    第三步:打包上传服务器

    将我们的拦截器打成jar包放到flume的lib目录下

    1566285178862

    第四步:开发flume的配置文件

    第三台机器开发flume的配置文件

    1. cd /export/servers/apache-flume-1.8.0-bin/conf
    2. vim spool-interceptor-hdfs.conf
    1. a1.channels = c1
    2. a1.sources = r1
    3. a1.sinks = s1
    4. #channel
    5. a1.channels.c1.type = memory
    6. a1.channels.c1.capacity=100000
    7. a1.channels.c1.transactionCapacity=50000
    8. #source
    9. a1.sources.r1.channels = c1
    10. a1.sources.r1.type = spooldir
    11. a1.sources.r1.spoolDir = /export/servers/intercept
    12. a1.sources.r1.batchSize= 50
    13. a1.sources.r1.inputCharset = UTF-8
    14. a1.sources.r1.interceptors =i1 i2
    15. a1.sources.r1.interceptors.i1.type =cn.le.iterceptor.CustomParameterInterceptor$Builder
    16. a1.sources.r1.interceptors.i1.fields_separator=\u0009
    17. a1.sources.r1.interceptors.i1.indexs =0,1,3,5,6
    18. a1.sources.r1.interceptors.i1.indexs_separator =\u002c
    19. a1.sources.r1.interceptors.i1.encrypted_field_index =0
    20. a1.sources.r1.interceptors.i2.type = org.apache.flume.interceptor.TimestampInterceptor$Builder
    21. #sink
    22. a1.sinks.s1.channel = c1
    23. a1.sinks.s1.type = hdfs
    24. a1.sinks.s1.hdfs.path =hdfs://node01:8020/flume/intercept/%Y%m%d
    25. a1.sinks.s1.hdfs.filePrefix = event
    26. a1.sinks.s1.hdfs.fileSuffix = .log
    27. a1.sinks.s1.hdfs.rollSize = 10485760
    28. a1.sinks.s1.hdfs.rollInterval =20
    29. a1.sinks.s1.hdfs.rollCount = 0
    30. a1.sinks.s1.hdfs.batchSize = 1500
    31. a1.sinks.s1.hdfs.round = true
    32. a1.sinks.s1.hdfs.roundUnit = minute
    33. a1.sinks.s1.hdfs.threadsPoolSize = 25
    34. a1.sinks.s1.hdfs.useLocalTimeStamp = true
    35. a1.sinks.s1.hdfs.minBlockReplicas = 1
    36. a1.sinks.s1.hdfs.fileType =DataStream
    37. a1.sinks.s1.hdfs.writeFormat = Text
    38. a1.sinks.s1.hdfs.callTimeout = 60000
    39. a1.sinks.s1.hdfs.idleTimeout =60

    第五步:上传测试数据

    上传我们的测试数据到/export/servers/intercept 这个目录下面去,如果目录不存在则创建

    1. mkdir -p /export/servers/intercept

    测试数据如下

    1. 13601249301 100 200 300 400 500 600 700
    2. 13601249302 100 200 300 400 500 600 700
    3. 13601249303 100 200 300 400 500 600 700
    4. 13601249304 100 200 300 400 500 600 700
    5. 13601249305 100 200 300 400 500 600 700
    6. 13601249306 100 200 300 400 500 600 700
    7. 13601249307 100 200 300 400 500 600 700
    8. 13601249308 100 200 300 400 500 600 700
    9. 13601249309 100 200 300 400 500 600 700
    10. 13601249310 100 200 300 400 500 600 700
    11. 13601249311 100 200 300 400 500 600 700
    12. 13601249312 100 200 300 400 500 600 700

    第六步:启动flume

    1. cd /export/servers/apache-flume-1.8.0-bin
    2. bin/flume-ng agent -c conf -f conf/spool-interceptor-hdfs.conf -name a1 -Dflume.root.logger=DEBUG,console

    1566146311181

    1566146325940

    1566146341091

    1566146363100156614637714215661463965111566146411253


    1566146441320

    1566146453275

    1. package cn.le.flumesource;
    2. import org.apache.flume.Context;
    3. import org.apache.flume.Event;
    4. import org.apache.flume.EventDeliveryException;
    5. import org.apache.flume.PollableSource;
    6. import org.apache.flume.conf.Configurable;
    7. import org.apache.flume.event.SimpleEvent;
    8. import org.apache.flume.source.AbstractSource;
    9. import org.slf4j.Logger;
    10. import java.util.ArrayList;
    11. import java.util.HashMap;
    12. import java.util.List;
    13. import static org.slf4j.LoggerFactory.*;
    14. public class MySqlSource extends AbstractSource implements Configurable, PollableSource {
    15. //打印日志
    16. private static final Logger LOG = getLogger(MySqlSource.class);
    17. //定义sqlHelper
    18. private QueryMySql sqlSourceHelper;
    19. @Override
    20. public long getBackOffSleepIncrement() {
    21. return 0;
    22. }
    23. @Override
    24. public long getMaxBackOffSleepInterval() {
    25. return 0;
    26. }
    27. @Override
    28. public void configure(Context context) {
    29. //初始化
    30. sqlSourceHelper = new QueryMySql(context);
    31. }
    32. @Override
    33. public PollableSource.Status process() throws EventDeliveryException {
    34. try {
    35. //查询数据表
    36. List<List<Object>> result = sqlSourceHelper.executeQuery();
    37. //存放event的集合
    38. List<Event> events = new ArrayList<>();
    39. //存放event头集合
    40. HashMap<String, String> header = new HashMap<>();
    41. //如果有返回数据,则将数据封装为event
    42. if (!result.isEmpty()) {
    43. List<String> allRows = sqlSourceHelper.getAllRows(result);
    44. Event event = null;
    45. for (String row : allRows) {
    46. event = new SimpleEvent();
    47. event.setBody(row.getBytes());
    48. event.setHeaders(header);
    49. events.add(event);
    50. }
    51. //将event写入channel
    52. this.getChannelProcessor().processEventBatch(events);
    53. //更新数据表中的offset信息
    54. sqlSourceHelper.updateOffset2DB(result.size());
    55. }
    56. //等待时长
    57. Thread.sleep(sqlSourceHelper.getRunQueryDelay());
    58. return Status.READY;
    59. } catch (InterruptedException e) {
    60. LOG.error("Error procesing row", e);
    61. return Status.BACKOFF;
    62. }
    63. }
    64. @Override
    65. public synchronized void stop() {
    66. LOG.info("Stopping sql source {} ...", getName());
    67. try {
    68. //关闭资源
    69. sqlSourceHelper.close();
    70. } finally {
    71. super.stop();
    72. }
    73. }
    74. }
    1. package cn.le.flumesource;
    2. import org.apache.flume.Context;
    3. import org.apache.flume.conf.ConfigurationException;
    4. import org.apache.http.ParseException;
    5. import org.slf4j.Logger;
    6. import org.slf4j.LoggerFactory;
    7. import java.sql.*;
    8. import java.util.ArrayList;
    9. import java.util.List;
    10. import java.util.Properties;
    11. public class QueryMySql {
    12. private static final Logger LOG = LoggerFactory.getLogger(QueryMySql.class);
    13. private int runQueryDelay, //两次查询的时间间隔
    14. startFrom, //开始id
    15. currentIndex, //当前id
    16. recordSixe = 0, //每次查询返回结果的条数
    17. maxRow; //每次查询的最大条数
    18. private String table, //要操作的表
    19. columnsToSelect, //用户传入的查询的列
    20. customQuery, //用户传入的查询语句
    21. query, //构建的查询语句
    22. defaultCharsetResultSet;//编码集
    23. //上下文,用来获取配置文件
    24. private Context context;
    25. //为定义的变量赋值(默认值),可在flume任务的配置文件中修改
    26. private static final int DEFAULT_QUERY_DELAY = 10000;
    27. private static final int DEFAULT_START_VALUE = 0;
    28. private static final int DEFAULT_MAX_ROWS = 2000;
    29. private static final String DEFAULT_COLUMNS_SELECT = "*";
    30. private static final String DEFAULT_CHARSET_RESULTSET = "UTF-8";
    31. private static Connection conn = null;
    32. private static PreparedStatement ps = null;
    33. private static String connectionURL, connectionUserName, connectionPassword;
    34. //加载静态资源
    35. static {
    36. Properties p = new Properties();
    37. try {
    38. p.load(QueryMySql.class.getClassLoader().getResourceAsStream("jdbc.properties"));
    39. connectionURL = p.getProperty("dbUrl");
    40. connectionUserName = p.getProperty("dbUser");
    41. connectionPassword = p.getProperty("dbPassword");
    42. Class.forName(p.getProperty("dbDriver"));
    43. } catch (Exception e) {
    44. LOG.error(e.toString());
    45. }
    46. }
    47. //获取JDBC连接
    48. private static Connection InitConnection(String url, String user, String pw) {
    49. try {
    50. Connection conn = DriverManager.getConnection(url, user, pw);
    51. if (conn == null)
    52. throw new SQLException();
    53. return conn;
    54. } catch (SQLException e) {
    55. e.printStackTrace();
    56. }
    57. return null;
    58. }
    59. //构造方法
    60. QueryMySql(Context context) throws ParseException {
    61. //初始化上下文
    62. this.context = context;
    63. //有默认值参数:获取flume任务配置文件中的参数,读不到的采用默认值
    64. this.columnsToSelect = context.getString("columns.to.select", DEFAULT_COLUMNS_SELECT);
    65. this.runQueryDelay = context.getInteger("run.query.delay", DEFAULT_QUERY_DELAY);
    66. this.startFrom = context.getInteger("start.from", DEFAULT_START_VALUE);
    67. this.defaultCharsetResultSet = context.getString("default.charset.resultset", DEFAULT_CHARSET_RESULTSET);
    68. //无默认值参数:获取flume任务配置文件中的参数
    69. this.table = context.getString("table");
    70. this.customQuery = context.getString("custom.query");
    71. connectionURL = context.getString("connection.url");
    72. connectionUserName = context.getString("connection.user");
    73. connectionPassword = context.getString("connection.password");
    74. conn = InitConnection(connectionURL, connectionUserName, connectionPassword);
    75. //校验相应的配置信息,如果没有默认值的参数也没赋值,抛出异常
    76. checkMandatoryProperties();
    77. //获取当前的id
    78. currentIndex = getStatusDBIndex(startFrom);
    79. //构建查询语句
    80. query = buildQuery();
    81. }
    82. //校验相应的配置信息(表,查询语句以及数据库连接的参数)
    83. private void checkMandatoryProperties() {
    84. if (table == null) {
    85. throw new ConfigurationException("property table not set");
    86. }
    87. if (connectionURL == null) {
    88. throw new ConfigurationException("connection.url property not set");
    89. }
    90. if (connectionUserName == null) {
    91. throw new ConfigurationException("connection.user property not set");
    92. }
    93. if (connectionPassword == null) {
    94. throw new ConfigurationException("connection.password property not set");
    95. }
    96. }
    97. //构建sql语句
    98. private String buildQuery() {
    99. String sql = "";
    100. //获取当前id
    101. currentIndex = getStatusDBIndex(startFrom);
    102. LOG.info(currentIndex + "");
    103. if (customQuery == null) {
    104. sql = "SELECT " + columnsToSelect + " FROM " + table;
    105. } else {
    106. sql = customQuery;
    107. }
    108. StringBuilder execSql = new StringBuilder(sql);
    109. //以id作为offset
    110. if (!sql.contains("where")) {
    111. execSql.append(" where ");
    112. execSql.append("id").append(">").append(currentIndex);
    113. return execSql.toString();
    114. } else {
    115. int length = execSql.toString().length();
    116. return execSql.toString().substring(0, length - String.valueOf(currentIndex).length()) + currentIndex;
    117. }
    118. }
    119. //执行查询
    120. List<List<Object>> executeQuery() {
    121. try {
    122. //每次执行查询时都要重新生成sql,因为id不同
    123. customQuery = buildQuery();
    124. //存放结果的集合
    125. List<List<Object>> results = new ArrayList<>();
    126. if (ps == null) {
    127. //
    128. ps = conn.prepareStatement(customQuery);
    129. }
    130. ResultSet result = ps.executeQuery(customQuery);
    131. while (result.next()) {
    132. //存放一条数据的集合(多个列)
    133. List<Object> row = new ArrayList<>();
    134. //将返回结果放入集合
    135. for (int i = 1; i <= result.getMetaData().getColumnCount(); i++) {
    136. row.add(result.getObject(i));
    137. }
    138. results.add(row);
    139. }
    140. LOG.info("execSql:" + customQuery + " resultSize:" + results.size());
    141. return results;
    142. } catch (SQLException e) {
    143. LOG.error(e.toString());
    144. // 重新连接
    145. conn = InitConnection(connectionURL, connectionUserName, connectionPassword);
    146. }
    147. return null;
    148. }
    149. //将结果集转化为字符串,每一条数据是一个list集合,将每一个小的list集合转化为字符串
    150. List<String> getAllRows(List<List<Object>> queryResult) {
    151. List<String> allRows = new ArrayList<>();
    152. if (queryResult == null || queryResult.isEmpty())
    153. return allRows;
    154. StringBuilder row = new StringBuilder();
    155. for (List<Object> rawRow : queryResult) {
    156. Object value = null;
    157. for (Object aRawRow : rawRow) {
    158. value = aRawRow;
    159. if (value == null) {
    160. row.append(",");
    161. } else {
    162. row.append(aRawRow.toString()).append(",");
    163. }
    164. }
    165. allRows.add(row.toString());
    166. row = new StringBuilder();
    167. }
    168. return allRows;
    169. }
    170. //更新offset元数据状态,每次返回结果集后调用。必须记录每次查询的offset值,为程序中断续跑数据时使用,以id为offset
    171. void updateOffset2DB(int size) {
    172. //以source_tab做为KEY,如果不存在则插入,存在则更新(每个源表对应一条记录)
    173. String sql = "insert into flume_meta(source_tab,currentIndex) VALUES('"
    174. + this.table
    175. + "','" + (recordSixe += size)
    176. + "') on DUPLICATE key update source_tab=values(source_tab),currentIndex=values(currentIndex)";
    177. LOG.info("updateStatus Sql:" + sql);
    178. execSql(sql);
    179. }
    180. //执行sql语句
    181. private void execSql(String sql) {
    182. try {
    183. ps = conn.prepareStatement(sql);
    184. LOG.info("exec::" + sql);
    185. ps.execute();
    186. } catch (SQLException e) {
    187. e.printStackTrace();
    188. }
    189. }
    190. //获取当前id的offset
    191. private Integer getStatusDBIndex(int startFrom) {
    192. //从flume_meta表中查询出当前的id是多少
    193. String dbIndex = queryOne("select currentIndex from flume_meta where source_tab='" + table + "'");
    194. if (dbIndex != null) {
    195. return Integer.parseInt(dbIndex);
    196. }
    197. //如果没有数据,则说明是第一次查询或者数据表中还没有存入数据,返回最初传入的值
    198. return startFrom;
    199. }
    200. //查询一条数据的执行语句(当前id)
    201. private String queryOne(String sql) {
    202. ResultSet result = null;
    203. try {
    204. ps = conn.prepareStatement(sql);
    205. result = ps.executeQuery();
    206. while (result.next()) {
    207. return result.getString(1);
    208. }
    209. } catch (SQLException e) {
    210. e.printStackTrace();
    211. }
    212. return null;
    213. }
    214. //关闭相关资源
    215. void close() {
    216. try {
    217. ps.close();
    218. conn.close();
    219. } catch (SQLException e) {
    220. e.printStackTrace();
    221. }
    222. }
    223. int getCurrentIndex() {
    224. return currentIndex;
    225. }
    226. void setCurrentIndex(int newValue) {
    227. currentIndex = newValue;
    228. }
    229. int getRunQueryDelay() {
    230. return runQueryDelay;
    231. }
    232. String getQuery() {
    233. return query;
    234. }
    235. String getConnectionURL() {
    236. return connectionURL;
    237. }
    238. private boolean isCustomQuerySet() {
    239. return (customQuery != null);
    240. }
    241. Context getContext() {
    242. return context;
    243. }
    244. public String getConnectionUserName() {
    245. return connectionUserName;
    246. }
    247. public String getConnectionPassword() {
    248. return connectionPassword;
    249. }
    250. String getDefaultCharsetResultSet() {
    251. return defaultCharsetResultSet;
    252. }
    253. }
  • 相关阅读:
    java8 .stream().map().collect()
    函数式编程扫盲篇(转载)
    如何成为Python高手(转载)
    JAVA通过XPath解析XML性能比较(原创)
    不要过早退出循环 while(1){no break}
    搭建harbor私有仓库
    Supervisor进程管理
    一键部署redis-5.0.5
    Linux下的crontab定时执行任务命令详解
    利用shell脚本实现对mysql数据库的备份
  • 原文地址:https://www.cnblogs.com/leccoo/p/11384186.html
Copyright © 2020-2023  润新知