• Flume


    1、Flume

    概述:Flume是一种分布式,可靠且可用的服务,用于有效地收集,聚合和移动大量日
    志数据。它具有基于流数据流的简单灵活的架构。它具有可靠的可靠性机制和许多故障
    转移和恢复机制,具有强大的容错性。它使用简单的可扩展数据模型,允许在线分析应
    用程序。
    1)数据采集(爬虫日志数据flume)
    2)数据存储(hdfs/hive/hbase(nosql))
    3)数据计算(mapreduce/hive/sparkSQL/sparkStreaming/flink)
    4)数据可视化

    2、Flume角色

    1)source
    数据源,用户采集数据,source产生数据流,同时会把产生的数据流传输到channel

    2)channel
    传输通道,用于桥接source和sink

    3)sink
    下沉,用于收集channel传输的数据,将数据源传递到目标源

    4)agent
    在flume中使用事件作为传输的基本单元

    3、flume使用

    简单易用,只需要写配置文件即可

    4、flume安装配置

    1)下载flume
    2)上传到Linux
    3)解压
    tar -zxvf apache-flume-1.6.0-bin.tar.gz -C /root/hd
    4)重命名
    mv apache-flume-1.6.0-bin/ flume
    cp flume-env.sh.template flume-env.sh
    5)修改配置
    vi flume-env.sh
    export JAVA_HOME=/root/hd/jdk1.8.0_192

    5、flume监听端口

    启动命令:
    bin/flume-ng agent –conf conf/log4j.properties –name a1 –conf-file conf/flumejob_telnet.conf

    我已经排坑了,这里我建议–conf 后面指定的路径建议是全路径,指定到log4j.properties或,我当时老师讲的是直接conf/,我实际操作是有问题的,不能实时的反馈

    bin/flume-ng agent 使用ng启动agent
    --conf conf/log4j.properties 指定配置所在的文件夹
    --name a1 指定的agent别名
    --conf-file conf/flumejob_telnet.conf 文件
    -Dflume.root.logger=INFO,console 日志级别

    flumejob_telnet.conf

    #smple.conf: A single-node Flume configuration
    # Name the components on this agent 定义变量方便调用 加s可以有多个此角色
    a1.sources = r1
    a1.sinks = k1
    a1.channels = c1
    # Describe/configure the source 描述source角色 进行内容定制
    # 此配置属于tcp source 必须是netcat类型
    a1.sources.r1.type = netcat
    a1.sources.r1.bind = localhost
    a1.sources.r1.port = 44444
    # Describe the sink 输出日志文件
    a1.sinks.k1.type = logger
    # Use a channel which buffers events in memory(file) 使用内存 总大小1000 每次传输100
    a1.channels.c1.type = memory
    a1.channels.c1.capacity = 1000
    a1.channels.c1.transactionCapacity = 100
    # Bind the source and sink to the channel 一个source可以绑定多个channel
    # 一个sinks可以只能绑定一个channel 使用的是图二的模型
    a1.sources.r1.channels = c1
    a1.sinks.k1.channel = c1
    [root@hsiehchou121 flume]# bin/flume-ng agent
    > --conf conf/
    > --name a1
    > --conf-file conf/flumejob_telnet.conf
    > -Dflume.root.logger=INFO.console

    yum search telnet
    yum install telnet.x86_64

    6、flume监听本地linux文件采集到hdfs

    启动命令:
    bin/flume-ng agent –conf conf/log4j.properties –name a1 –conf-file conf/flum
    ejob_hdfs.conf

    flumejob_hdfs.conf

    # Name the components on this agent agent别名设置
    a1.sources = r1
    a1.sinks = k1
    a1.channels = c1
    # Describe/configure the source 设置数据源监听本地文件配置
    # exec 执行一个命令的方式去查看文件 tail -F 实时查看
    a1.sources.r1.type = exec
    # 要执行的脚本command tail -F 默认10行 man tail 查看帮助
    a1.sources.r1.command = tail -F /tmp/root/hive.log
    # 执行这个command使用的是哪个脚本 -c 指定使用什么命令
    # whereis bash
    # bash: /usr/bin/bash /usr/share/man/man1/bash.1.gz
    a1.sources.r1.shell = /usr/bin/bash -c
    # Describe the sink
    a1.sinks.k1.type = hdfs
    a1.sinks.k1.hdfs.path = hdfs://hsiehchou121:9000/flume/%Y%m%d/%H
    #上传文件的前缀
    a1.sinks.k1.hdfs.filePrefix = logs-
    #是否按照时间滚动文件夹
    a1.sinks.k1.hdfs.round = true
    #多少时间单位创建一个新的文件夹 秒 (默认30s)
    a1.sinks.k1.hdfs.roundValue = 1
    #重新定义时间单位(每小时滚动一个文件夹)
    a1.sinks.k1.hdfs.roundUnit = minute
    #是否使用本地时间戳
    a1.sinks.k1.hdfs.useLocalTimeStamp = true
    #积攒多少个 Event 才 flush 到 HDFS 一次
    a1.sinks.k1.hdfs.batchSize = 500
    #设置文件类型,可支持压缩
    a1.sinks.k1.hdfs.fileType = DataStream
    #多久生成一个新的文件 秒
    a1.sinks.k1.hdfs.rollInterval = 30
    #设置每个文件的滚动大小 字节(最好128M,合理)
    a1.sinks.k1.hdfs.rollSize = 134217700
    #文件的滚动与 Event 数量无关
    a1.sinks.k1.hdfs.rollCount = 0
    #最小冗余数(备份数 生成滚动功能则生效roll hadoop本身有此功能 无需配置) 1份 不冗余 hdfs已经备份3份
    a1.sinks.k1.hdfs.minBlockReplicas = 1
    # Use a channel which buffers events in memory
    a1.channels.c1.type = memory
    a1.channels.c1.capacity = 1000
    a1.channels.c1.transactionCapacity = 100
    # Bind the source and sink to the channel
    a1.sources.r1.channels = c1
    a1.sinks.k1.channel = c1
    [root@hsiehchou121 flume]# bin/flume-ng agent
    > --conf conf/log4j.properties
    > --name a1
    > --conf-file conf/flumejob_hdfs.conf

    7、监听文件夹

    flumejob_dir.conf

    # 定义别名
    a1.sources = r1
    a1.sinks = k1
    a1.channels = c1
    # Describe/configure the source
    a1.sources.r1.type = spooldir
    # 监控的文件夹
    a1.sources.r1.spoolDir = /root/testdir
    # 上传成功后显示后缀名
    a1.sources.r1.fileSuffix = .COMPLETED
    # 如论如何 加绝对路径的文件名 默认false
    a1.sources.r1.fileHeader = true
    #忽略所有以.tmp 结尾的文件(正在被写入),不上传
    # ^以任何开头 出现无限次 以.tmp结尾的
    a1.sources.r1.ignorePattern = ([^ ]*.tmp)
    # Describe the sink 下沉到hdfs
    a1.sinks.k1.type = hdfs
    a1.sinks.k1.hdfs.path = hdfs://hsiehchou121:9000/flume/testdir/%Y%m%d/%H
    #上传文件的前缀
    a1.sinks.k1.hdfs.filePrefix = testdir-
    #是否按照时间滚动文件夹
    a1.sinks.k1.hdfs.round = true
    #多少时间单位创建一个新的文件夹
    a1.sinks.k1.hdfs.roundValue = 1
    #重新定义时间单位
    a1.sinks.k1.hdfs.roundUnit = hour
    #是否使用本地时间戳
    a1.sinks.k1.hdfs.useLocalTimeStamp = true
    #积攒多少个 Event 才 flush 到 HDFS 一次
    a1.sinks.k1.hdfs.batchSize = 100
    #设置文件类型,可支持压缩
    a1.sinks.k1.hdfs.fileType = DataStream
    #多久生成一个新的文件
    a1.sinks.k1.hdfs.rollInterval = 600
    #设置每个文件的滚动大小大概是 128M
    a1.sinks.k1.hdfs.rollSize = 134217700
    #文件的滚动与 Event 数量无关
    a1.sinks.k1.hdfs.rollCount = 0
    #最小副本数
    a1.sinks.k1.hdfs.minBlockReplicas = 1
    # Use a channel which buffers events in memory
    a1.channels.c1.type = memory
    a1.channels.c1.capacity = 1000
    a1.channels.c1.transactionCapacity = 100
    # Bind the source and sink to the channel
    a1.sources.r1.channels = c1
    a1.sinks.k1.channel = c1

    [root@hsiehchou121 conf]# bin/flume-ng agent –conf conf/log4j.properties –name a1 –conf-file conf/flumejob_dir.conf

    [root@hsiehchou121 flume]# bin/flume-ng agent
    > --conf conf/log4j.properties
    > --name a1
    > --conf-file conf/flumejob_dir.conf

    8、多个channel/sink

    需求:监控hive.log文件,用同时产生两个channel,一个channel对应的sink存储到hdfs中,另外一个channel对应的sink存储到本地
    flumejob_1.conf

    # name the components on this agent 别名设置
    a1.sources = r1
    a1.sinks = k1 k2
    a1.channels = c1 c2
    # 将数据流复制给多个 channel
    a1.sources.r1.selector.type = replicating
    # Describe/configure the source
    a1.sources.r1.type = exec
    a1.sources.r1.command = tail -F /tmp/root/hive.log
    a1.sources.r1.shell = /bin/bash -c
    # Describe the sink
    # 分两个端口发送数据
    a1.sinks.k1.type = avro
    a1.sinks.k1.hostname = hsiehchou121
    a1.sinks.k1.port = 4141
    a1.sinks.k2.type = avro
    a1.sinks.k2.hostname = hsiehchou121
    a1.sinks.k2.port = 4142
    # Describe the channel
    a1.channels.c1.type = memory
    a1.channels.c1.capacity = 1000
    a1.channels.c1.transactionCapacity = 100
    a1.channels.c2.type = memory
    a1.channels.c2.capacity = 1000
    a1.channels.c2.transactionCapacity = 100
    # Bind the source and sink to the channel
    a1.sources.r1.channels = c1 c2
    a1.sinks.k1.channel = c1
    a1.sinks.k2.channel = c2

    [root@hsiehchou121 flume]# bin/flume-ng agent –conf conf/log4j.properties –name a1 –conf-file conf/flumejob_1.conf

    flumejob_2.conf

    # Name the components on this agent
    a2.sources = r1
    a2.sinks = k1
    a2.channels = c1
    # Describe/configure the source
    a2.sources.r1.type = avro
    # 端口抓取数据
    a2.sources.r1.bind = hsiehchou121
    a2.sources.r1.port = 4141
    # Describe the sink
    a2.sinks.k1.type = hdfs
    a2.sinks.k1.hdfs.path = hdfs://hsiehchou121:9000/flume2/%Y%m%d/%H
    #上传文件的前缀
    a2.sinks.k1.hdfs.filePrefix = flume2-
    #是否按照时间滚动文件夹
    a2.sinks.k1.hdfs.round = true
    #多少时间单位创建一个新的文件夹
    a2.sinks.k1.hdfs.roundValue = 1
    #重新定义时间单位
    a2.sinks.k1.hdfs.roundUnit = hour
    #是否使用本地时间戳
    a2.sinks.k1.hdfs.useLocalTimeStamp = true
    #积攒多少个 Event 才 flush 到 HDFS 一次
    a2.sinks.k1.hdfs.batchSize = 100
    #设置文件类型,可支持压缩
    a2.sinks.k1.hdfs.fileType = DataStream
    #多久生成一个新的文件
    a2.sinks.k1.hdfs.rollInterval = 600
    #设置每个文件的滚动大小大概是 128M
    a2.sinks.k1.hdfs.rollSize = 134217700
    #文件的滚动与 Event 数量无关
    a2.sinks.k1.hdfs.rollCount = 0
    #最小副本数
    a2.sinks.k1.hdfs.minBlockReplicas = 1
    # Describe the channel
    a2.channels.c1.type = memory
    a2.channels.c1.capacity = 1000
    a2.channels.c1.transactionCapacity = 100
    # Bind the source and sink to the channel
    a2.sources.r1.channels = c1
    a2.sinks.k1.channel = c1

    [root@hsiehchou121 flume]# bin/flume-ng agent –conf conf/log4j.properties –name a2 –conf-file conf/flumejob_1.conf

    flumejob_3.conf

    # Name the components on this agent
    a3.sources = r1
    a3.sinks = k1
    a3.channels = c1
    # Describe/configure the source
    a3.sources.r1.type = avro
    a3.sources.r1.bind = hsiehchou121
    a3.sources.r1.port = 4142
    # Describe the sink
    a3.sinks.k1.type = file_roll
    a3.sinks.k1.sink.directory = /root/flume2
    # Describe the channel
    a3.channels.c1.type = memory
    a3.channels.c1.capacity = 1000
    a3.channels.c1.transactionCapacity = 100
    # Bind the source and sink to the channel
    a3.sources.r1.channels = c1
    a3.sinks.k1.channel = c1

    [root@hsiehchou121 flume]# bin/flume-ng agent –conf conf/log4j.properties –name a3 –conf-file conf/flumejob_1.conf

  • 相关阅读:
    如何只通过Sandboxed Solution启动一个定时执行的操作
    创建与SharePoint 2010风格一致的下拉菜单 (续) 整合Feature Custom Action框架
    创建与SharePoint 2010风格一致的下拉菜单
    《SharePoint 2010 应用程序开发指南》第二章预览
    SharePoint 2013 App 开发 (1) 什么是SharePoint App?
    使用Jscex增强SharePoint 2010 JavaScript Client Object Model (JSOM)
    搜索范围的管理
    SharePoint 2010 服务应用程序(Service Application)架构(1)
    SharePoint 2010 服务应用程序(Service Application)架构(2)
    SharePoint 2013 App 开发 (2) 建立开发环境
  • 原文地址:https://www.cnblogs.com/hsiehchou/p/10502457.html
Copyright © 2020-2023  润新知