• 【慕课网实战】Spark Streaming实时流处理项目实战笔记十六之铭文升级版


    铭文一级:

    linux crontab
    网站:http://tool.lu/crontab
    每一分钟执行一次的crontab表达式: */1 * * * *

    crontab -e
    */1 * * * * /home/hadoop/data/project/log_generator.sh


    对接python日志产生器输出的日志到Flume
    streaming_project.conf

    选型:access.log ==> 控制台输出
    exec
    memory
    logger


    exec-memory-logger.sources = exec-source
    exec-memory-logger.sinks = logger-sink
    exec-memory-logger.channels = memory-channel

    exec-memory-logger.sources.exec-source.type = exec
    exec-memory-logger.sources.exec-source.command = tail -F /home/hadoop/data/project/logs/access.log
    exec-memory-logger.sources.exec-source.shell = /bin/sh -c

    exec-memory-logger.channels.memory-channel.type = memory

    exec-memory-logger.sinks.logger-sink.type = logger

    exec-memory-logger.sources.exec-source.channels = memory-channel
    exec-memory-logger.sinks.logger-sink.channel = memory-channel

    flume-ng agent
    --name exec-memory-logger
    --conf $FLUME_HOME/conf
    --conf-file /home/hadoop/data/project/streaming_project.conf
    -Dflume.root.logger=INFO,console


    日志==>Flume==>Kafka
    启动zk:./zkServer.sh start
    启动Kafka Server:kafka-server-start.sh -daemon /home/hadoop/app/kafka_2.11-0.9.0.0/config/server.properties
    修改Flume配置文件使得flume sink数据到Kafka

    streaming_project2.conf
    exec-memory-kafka.sources = exec-source
    exec-memory-kafka.sinks = kafka-sink
    exec-memory-kafka.channels = memory-channel

    exec-memory-kafka.sources.exec-source.type = exec
    exec-memory-kafka.sources.exec-source.command = tail -F /home/hadoop/data/project/logs/access.log
    exec-memory-kafka.sources.exec-source.shell = /bin/sh -c

    exec-memory-kafka.channels.memory-channel.type = memory

    exec-memory-kafka.sinks.kafka-sink.type = org.apache.flume.sink.kafka.KafkaSink
    exec-memory-kafka.sinks.kafka-sink.brokerList = hadoop000:9092
    exec-memory-kafka.sinks.kafka-sink.topic = streamingtopic
    exec-memory-kafka.sinks.kafka-sink.batchSize = 5
    exec-memory-kafka.sinks.kafka-sink.requiredAcks = 1

    exec-memory-kafka.sources.exec-source.channels = memory-channel
    exec-memory-kafka.sinks.kafka-sink.channel = memory-channel

    flume-ng agent
    --name exec-memory-kafka
    --conf $FLUME_HOME/conf
    --conf-file /home/hadoop/data/project/streaming_project2.conf
    -Dflume.root.logger=INFO,console

    kafka-console-consumer.sh --zookeeper hadoop000:2181 --topic streamingtopic

    数据清洗操作:从原始日志中取出我们所需要的字段信息就可以了

    数据清洗结果类似如下:
    ClickLog(46.30.10.167,20171022151701,128,200,-)
    ClickLog(143.132.168.72,20171022151701,131,404,-)
    ClickLog(10.55.168.87,20171022151701,131,500,-)
    ClickLog(10.124.168.29,20171022151701,128,404,-)
    ClickLog(98.30.87.143,20171022151701,131,404,-)
    ClickLog(55.10.29.132,20171022151701,146,404,http://www.baidu.com/s?wd=Storm实战)
    ClickLog(10.87.55.30,20171022151701,130,200,http://www.baidu.com/s?wd=Hadoop基础)
    ClickLog(156.98.29.30,20171022151701,146,500,https://www.sogou.com/web?query=大数据面试)
    ClickLog(10.72.87.124,20171022151801,146,500,-)
    ClickLog(72.124.167.156,20171022151801,112,404,-)

    到数据清洗完为止,日志中只包含了实战课程的日志


    补充一点:希望你们的机器配置被太低
    Hadoop/ZK/HBase/Spark Streaming/Flume/Kafka
    hadoop000: 8Core 8G

    铭文二级:

    定时调度工具的使用(https://tool.lu/crontab):

    linux crontab 定时

    指令为:crontab -e

    然后在里面编辑:*/1 * * * *    //“1”代表1分钟

    vi log_generator.sh  //把执行脚本放进去:python /home/hadoop/data/project/generate_log.py

    验证日志能否输出,在终端二的project文件目录下执行: tail -200f logs/access.log

    执行log_generator.sh脚本后可看到终端二也有数据产生

    chmod u+x log_generator.sh   //添加执行权限

    使用Flume实时收集日志信息:

    streaming_project.conf(exec-memory-logger):先输出到控制台测试一下

    exec source:

    type:exec

    command:tail -F /路径/

    shell:/bin/sh -c

    使用Flume整合到Kafka:

    streaming_project2.conf(exec-memory-kafka):

    type:org.apache.flume.sink.kafka.KafkaSink

    brokerList、topic、requiredAck、batchSize

    启动zk、启动kafka、终端上确认kafka能消费生产者的东西

    建project:spark与utils文件夹

    ImoocStatStreamingApp

    KafkaUtils.createStreaming(ssc,zkQuorm,groupId,topicMap)

    代码程确认kafka能消费生产者的东西

    ClickLog类的建立(类似java里面的javabean):

    /**
      * 清洗后的日志信息
      * @param ip  日志访问的ip地址
      * @param time  日志访问的时间
      * @param courseId  日志访问的实战课程编号
      * @param statusCode 日志访问的状态码
      * @param referer  日志访问的referer
      */
    case class ClickLog(ip:String, time:String, courseId:Int, statusCode:Int, referer:String)

    Flume执行脚本后面加一行(巩固一下):

    -Dflume.root.logger = INFO,console

    时间转换类的开发:

    /**
      * 日期时间工具类
      */
    object DateUtils {
      val YYYYMMDDHHMMSS_FORMAT = FastDateFormat.getInstance("yyyy-MM-dd HH:mm:ss")
      val TARGE_FORMAT = FastDateFormat.getInstance("yyyyMMddHHmmss")
      def getTime(time: String) = {
        YYYYMMDDHHMMSS_FORMAT.parse(time).getTime
      }
      def parseToMinute(time :String) = {
        TARGE_FORMAT.format(new Date(getTime(time)))
      }
      def main(args: Array[String]): Unit = {
        println(parseToMinute("2017-10-22 14:46:01"))
      }
    } 

    时间转换的思路【new Date(Long类型的毫秒数)可转化】:

    一、先得到 1.原本时间的类型 2.想要得到的时间类型 

    二、将原本的时间类型parse解析成Long类型的毫秒数,再将想得到的类型format得到的Date即可

    铭文三级:

    日志转换FastDateFormat替代simpleDateFormat解决线程不安全问题(要引入commons-lang依赖):

    private String initDate() {  
           Date d = new Date();  
           FastDateFormat fdf = FastDateFormat.getInstance("yyyy-MM-dd HH:mm:ss");  
           return fdf.format(d);  
       }  
    

    查看文件内容的特殊方法:
      相信最基本的cat和less,more你已经很熟悉了,如果有特殊的要求呢:
    1. 如果你只想看文件的前5行,可以使用head命令,如:
    head -5 /etc/passwd
    2. 如果你想查看文件的后10行,可以使用tail命令,如:
    tail -10 /etc/passwd
    tail -f /var/log/messages
    参数-f使tail不停地去读最新的内容,这样有实时监视的效果

    map(_._2) 等价于 map(t => t._2) //t是个2项以上的元组
    map(_._2, _) 等价与 map(t => t._2, t) //这会返回第二项为首后面项为旧元组的新元组 

  • 相关阅读:
    html css 学习
    第七天
    第六天
    第五天
    第四天
    第三天
    第二天
    团队敏捷开发day8
    团队敏捷开发day7
    团队敏捷开发day6
  • 原文地址:https://www.cnblogs.com/kkxwz/p/8400548.html
Copyright © 2020-2023  润新知