• Spark Streaming——实战


    12-1 -课程目录

    项目实战

    需求说明

    互联网访问日志概述

    功能开发及本地运行

    生产环境运行

    12-2 -需求说明

    今天到现在为止实战课程的访问量

    今天到现在为止从搜索引擎过来的实战课程的访问量

    12-3 -用户行为日志介绍

    为什么要记录用户的访问行为日志

    网站页面的访问量

    网站的粘性

    推荐

    用户行为日志分析的意义

    网站的眼睛

    网站的神经

    网站的大脑

    12-4 -Python日志产生器开发之产生访问url和ip信息

    12-5 -Python日志产生器开发之产生referer和状态码信息

     

     

     

    12-6 -Python日志产生器开发之产生日志访问时间

    12-7 -Python日志产生器服务器测试并将日志写入到文件中

    12-8 -通过定时调度工具每一分钟产生一批数据

    linux crontab

    https://tool.lu/crontab

    每分钟执行一次crontab表达式:*/1 * * * *

    crontab -e

    */1 * * * */home/hadoop/data/project/log_generator.sh

    12-9 -使用Flume实时收集日志信息

    打通flume&kafka&spark streaming线路

    对接Python日志产生器输出的日志到flume

    streaming_project.conf

    选型:access.log==>控制台输出

    exec

    memory

    logger

    具体可以参照:http://flume.apache.org/

    exec-memory-logger.sources=exec-sources

    exec-memory-logger.sinks=logger-sink

    exec-memory-logger.channel=money-channel

    exec-memory-logger.sources.exec-source.type=exec

    exec-memory-logger.sources.exec-source.command=tail -F /home/hadoop/data/project/logs/access.log

    exec-memory-logger.sources.exec-source.shell=/bin/sh -C

    exec-memory-logger.channel.memory-channel.type=memory

    exec-memory-logger.sinks.logger.sink=logger

    exec-memory-logger.sources.execx-source.channels=memory-channel

    exec-memory-logger.sinks.logger.sink.channel=memory-channel

     

    启动

    12-10 -对接实时日志数据到Kafka并输出到控制台测试

    日志==>Flume==>kafka

    1、启动zookeeper

    ./zkServer.sh start

    2、启动kafka Server

    ./kafka-server-start.sh -daemon /home/hadoop/app/kafka_2.11-0.9.0.0/config/server.propertie

    3、修改flume配置文件使得flume sink数据到kafka

    exec-memory-kafka.sources=exec-sources

    exec-memory-kafka.sinks=kafka-sink

    exec-memory-kafka.channel=money-channel

    exec-memory-kafka.sources.exec-source.type=exec

    exec-memory-kafka.sources.exec-source.command=tail -F /home/hadoop/data/project/logs/access.log

    exec-memory-kafka.sources.exec-source.shell=/bin/sh -C

    exec-memory-kafka.channel.memory-channel.type=memory

    exec-memory-kafka.sinks.logger.sink=kafka

    exec-memory-kafka.sources.execx-source.channels=memory-channel

    exec-memory-kafka.sinks.logger.sink.channel=memory-channel

     

     

    12-11 -Spark Streaming对接Kafka的数据进行消费
     

    打通flume&kafka&speak Streaming 线路

    在spark应用程序处理kafka过来的数据

    源码地址:https://gitee.com/sag888/big_data/blob/master/Spark%20Streaming%E5%AE%9E%E6%97%B6%E6%B5%81%E5%A4%84%E7%90%86%E9%A1%B9%E7%9B%AE%E5%AE%9E%E6%88%98/project/l2118i/sparktrain/src/main/scala/com/imooc/spark/project/spark/ImoocStatStreamingApp.scala

    源码:

    package com.imooc.spark.project.spark

    import com.imooc.spark.project.dao.{CourseClickCountDAO, CourseSearchClickCountDAO}

    import com.imooc.spark.project.domain.{ClickLog, CourseClickCount, CourseSearchClickCount}

    import com.imooc.spark.project.utils.DateUtils

    import org.apache.spark.SparkConf

    import org.apache.spark.streaming.kafka.KafkaUtils

    import org.apache.spark.streaming.{Seconds, StreamingContext}

    import scala.collection.mutable.ListBuffer

    /**

    * 使用Spark Streaming处理Kafka过来的数据

    */

    object ImoocStatStreamingApp {

    def main(args: Array[String]): Unit = {

    if (args.length != 4) {

    println("Usage: ImoocStatStreamingApp <zkQuorum> <group> <topics> <numThreads>")

    System.exit(1)

    }

    val Array(zkQuorum, groupId, topics, numThreads) = args

    val sparkConf = new SparkConf().setAppName("ImoocStatStreamingApp") //.setMaster("local[5]")

    val ssc = new StreamingContext(sparkConf, Seconds(60))

    val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap

    val messages = KafkaUtils.createStream(ssc, zkQuorum, groupId, topicMap)

    ssc.start()

    ssc.awaitTermination()

    }

    }

     

    12-12 -使用Spark Streaming完成数据清洗操作

    按照需求对实时产生的点击数据进行数据清洗

    数据清洗操作:从原始日志中取出我们所需要的字段信息就可以了

    过滤时间:创建时间工具类

    源码地址: https://gitee.com/sag888/big_data/blob/master/Spark%20Streaming%E5%AE%9E%E6%97%B6%E6%B5%81%E5%A4%84%E7%90%86%E9%A1%B9%E7%9B%AE%E5%AE%9E%E6%88%98/project/l2118i/sparktrain/src/main/scala/com/imooc/spark/project/utils/DateUtils.scala

    源码:

    package com.imooc.spark.project.utils

    import java.util.Date

    import org.apache.commons.lang3.time.FastDateFormat

    /**

    * 日期时间工具类

    */

    object DateUtils {

    val YYYYMMDDHHMMSS_FORMAT = FastDateFormat.getInstance("yyyy-MM-dd HH:mm:ss")

    val TARGE_FORMAT = FastDateFormat.getInstance("yyyyMMddHHmmss")

    def getTime(time: String) = {

    YYYYMMDDHHMMSS_FORMAT.parse(time).getTime

    }

    def parseToMinute(time :String) = {

    TARGE_FORMAT.format(new Date(getTime(time)))

    }

    def main(args: Array[String]): Unit = {

    println(parseToMinute("2017-10-22 14:46:01"))

    }

    }

    源码地址:https://gitee.com/sag888/big_data/blob/master/Spark%20Streaming%E5%AE%9E%E6%97%B6%E6%B5%81%E5%A4%84%E7%90%86%E9%A1%B9%E7%9B%AE%E5%AE%9E%E6%88%98/project/l2118i/sparktrain/src/main/scala/com/imooc/spark/project/spark/ImoocStatStreamingApp.scala

    // 测试步骤二:数据清洗

    val logs = messages.map(_._2)

    val cleanData = logs.map(line => {

    val infos = line.split(" ")

    // infos(2) = "GET /class/130.html HTTP/1.1"

    // url = /class/130.html

    val url = infos(2).split(" ")(1)

    var courseId = 0

    // 把实战课程的课程编号拿到了

    if (url.startsWith("/class")) {

    val courseIdHTML = url.split("/")(2)

    courseId = courseIdHTML.substring(0, courseIdHTML.lastIndexOf(".")).toInt

    }

    ClickLog(infos(0), DateUtils.parseToMinute(infos(1)), courseId, infos(3).toInt, infos(4))

    }).filter(clicklog => clicklog.courseId != 0)

    清洗model类

    package com.imooc.spark.project.domain

    /**

    * 清洗后的日志信息

    * @param ip 日志访问的ip地址

    * @param time 日志访问的时间

    * @param courseId 日志访问的实战课程编号

    * @param statusCode 日志访问的状态码

    * @param referer 日志访问的referer

    */

    case class ClickLog(ip:String, time:String, courseId:Int, statusCode:Int, referer:String)

     

    补充一点:机器配置不要太低

    Hadoop/ZK/HBase/Speak Streaming/flume/kafka

    hadoop001: 8Core 8G 内存

     

    12-13 -功能一之需求分析及存储结果技术选型分析

    功能1、今天到现在为止 实战课程的访问量

    yyyyMMdd courseid

    使用数据库来进行我们的统计结果

    Spark Streaming 把统计结果写入到数据库里面

    可视化前端根据:yyyyMMdd courseid 把数据库里面的统计结果展示出来

    选择什么什么数据库作为统计结果存储呢?

    RDBMS:mysql、oracl...

    day course_id click_count

    20171111 1 10

    20171111 2 10

    下一次数据进来之后

    20171111+1 ==>click_count+下一次批次的统计结果==>写入到数据库之中

    NoSQL:HBase,Redis...

    HBase:一个API就能搞定,非常方便

    20171111+1 ==>click_count+下一次批次的统计结果

    本次课程为什么选择HBASE的一个原因所在

    前提:

    HDFS

    步骤 1 、启动Hadoop

    $sbin/./start-dfs-sh

     步骤2、启动hbase

    $bin/./start-hbase.sh

    详细操作HBASE命令 http://www.cnblogs.com/nexiyi/p/hbase_shell.html

    步骤3、创建数据表

    create 'imooc_course_clickcount','info'

    步骤4、Rowkey设计

    day_courseid

     

    12-14 -功能一之数据库访问DAO层方法定义

    如何使用Scala来操作HBase

    第一步:创建model

    源码地址:https://gitee.com/sag888/big_data/blob/master/Spark%20Streaming%E5%AE%9E%E6%97%B6%E6%B5%81%E5%A4%84%E7%90%86%E9%A1%B9%E7%9B%AE%E5%AE%9E%E6%88%98/project/l2118i/sparktrain/src/main/scala/com/imooc/spark/project/domain/CourseClickCount.scala

    源码:

    package com.imooc.spark.project.domain

    /**

    * 实战课程点击数实体类

    * @param day_course 对应的就是HBase中的rowkey,20171111_1

    * @param click_count 对应的20171111_1的访问总数

    */

    case class CourseClickCount(day_course:String, click_count:Long)

    第二步:创建DAO

    源码地址:

    https://gitee.com/sag888/big_data/blob/master/Spark%20Streaming%E5%AE%9E%E6%97%B6%E6%B5%81%E5%A4%84%E7%90%86%E9%A1%B9%E7%9B%AE%E5%AE%9E%E6%88%98/project/l2118i/sparktrain/src/main/scala/com/imooc/spark/project/dao/CourseClickCountDAO.scala

    源码:

    package com.imooc.spark.project.dao

    import com.imooc.spark.project.domain.CourseClickCount

    import com.imooc.spark.project.utils.HBaseUtils

    import org.apache.hadoop.hbase.client.Get

    import org.apache.hadoop.hbase.util.Bytes

    import scala.collection.mutable.ListBuffer

    /**

    * 实战课程点击数-数据访问层

    */

    object CourseClickCountDAO {

    val tableName = "imooc_course_clickcount"

    val cf = "info"

    val qualifer = "click_count"

    /**

    * 保存数据到HBase

    * @param list CourseClickCount集合

    */

    def save(list: ListBuffer[CourseClickCount]): Unit = {

    val table = HBaseUtils.getInstance().getTable(tableName)

    for(ele <- list) {

    table.incrementColumnValue(Bytes.toBytes(ele.day_course),

    Bytes.toBytes(cf),

    Bytes.toBytes(qualifer),

    ele.click_count)

    }

    }

    /**

    * 根据rowkey查询值

    */

    def count(day_course: String):Long = {

    val table = HBaseUtils.getInstance().getTable(tableName)

    val get = new Get(Bytes.toBytes(day_course))

    val value = table.get(get).getValue(cf.getBytes, qualifer.getBytes)

    if(value == null) {

    0L

    }else{

    Bytes.toLong(value)

    }

    }

    }

    12-15 -功能一之数据库访问DAO层方法实现

    源码地址:

    https://gitee.com/sag888/big_data/blob/master/Spark%20Streaming%E5%AE%9E%E6%97%B6%E6%B5%81%E5%A4%84%E7%90%86%E9%A1%B9%E7%9B%AE%E5%AE%9E%E6%88%98/project/l2118i/sparktrain/src/main/scala/com/imooc/spark/project/dao/CourseClickCountDAO.scala

    源码:

    package com.imooc.spark.project.dao

    import com.imooc.spark.project.domain.CourseClickCount

    import com.imooc.spark.project.utils.HBaseUtils

    import org.apache.hadoop.hbase.client.Get

    import org.apache.hadoop.hbase.util.Bytes

    import scala.collection.mutable.ListBuffer

    /**

    * 实战课程点击数-数据访问层

    */

    object CourseClickCountDAO {

    val tableName = "imooc_course_clickcount"

    val cf = "info"

    val qualifer = "click_count"

    /**

    * 保存数据到HBase

    * @param list CourseClickCount集合

    */

    def save(list: ListBuffer[CourseClickCount]): Unit = {

    val table = HBaseUtils.getInstance().getTable(tableName)

    for(ele <- list) {

    table.incrementColumnValue(Bytes.toBytes(ele.day_course),

    Bytes.toBytes(cf),

    Bytes.toBytes(qualifer),

    ele.click_count)

    }

    }

    /**

    * 根据rowkey查询值

    */

    def count(day_course: String):Long = {

    val table = HBaseUtils.getInstance().getTable(tableName)

    val get = new Get(Bytes.toBytes(day_course))

    val value = table.get(get).getValue(cf.getBytes, qualifer.getBytes)

    if(value == null) {

    0L

    }else{

    Bytes.toLong(value)

    }

    }

    def main(args: Array[String]): Unit = {

    val list = new ListBuffer[CourseClickCount]

    list.append(CourseClickCount("20171111_8",8))

    list.append(CourseClickCount("20171111_9",9))

    list.append(CourseClickCount("20171111_1",100))

    save(list)

    println(count("20171111_8") + " : " + count("20171111_9")+ " : " + count("20171111_1"))

    }

    }

    12-16 -功能一之HBase操作工具类开发

    Java开发的

    源码地址 :

    https://gitee.com/sag888/big_data/blob/master/Spark%20Streaming%E5%AE%9E%E6%97%B6%E6%B5%81%E5%A4%84%E7%90%86%E9%A1%B9%E7%9B%AE%E5%AE%9E%E6%88%98/project/l2118i/imooc_web/src/main/java/com/imooc/utils/HBaseUtils.java

    源码:

    package com.imooc.utils;

    import org.apache.hadoop.conf.Configuration;

    import org.apache.hadoop.hbase.client.*;

    import org.apache.hadoop.hbase.filter.Filter;

    import org.apache.hadoop.hbase.filter.PrefixFilter;

    import org.apache.hadoop.hbase.util.Bytes;

    import java.io.IOException;

    import java.util.HashMap;

    import java.util.Map;

    /**

    * HBase操作工具类

    */

    public class HBaseUtils {

    HBaseAdmin admin = null;

    Configuration conf = null;

    /**

    * 私有构造方法:加载一些必要的参数

    */

    private HBaseUtils() {

    conf = new Configuration();

    conf.set("hbase.zookeeper.quorum", "hadoop000:2181");

    conf.set("hbase.rootdir", "hdfs://hadoop000:8020/hbase");

    try {

    admin = new HBaseAdmin(conf);

    } catch (IOException e) {

    e.printStackTrace();

    }

    }

    private static HBaseUtils instance = null;

    public static synchronized HBaseUtils getInstance() {

    if (null == instance) {

    instance = new HBaseUtils();

    }

    return instance;

    }

    /**

    * 根据表名获取到HTable实例

    */

    public HTable getTable(String tableName) {

    HTable table = null;

    try {

    table = new HTable(conf, tableName);

    } catch (IOException e) {

    e.printStackTrace();

    }

    return table;

    }

    /**

    * 根据表名和输入条件获取HBase的记录数

    */

    public Map<String, Long> query(String tableName, String condition) throws Exception {

    Map<String, Long> map = new HashMap<>();

    HTable table = getTable(tableName);

    String cf = "info";

    String qualifier = "click_count";

    Scan scan = new Scan();

    Filter filter = new PrefixFilter(Bytes.toBytes(condition));

    scan.setFilter(filter);

    ResultScanner rs = table.getScanner(scan);

    for(Result result : rs) {

    String row = Bytes.toString(result.getRow());

    long clickCount = Bytes.toLong(result.getValue(cf.getBytes(), qualifier.getBytes()));

    map.put(row, clickCount);

    }

    return map;

    }

    public static void main(String[] args) throws Exception {

    Map<String, Long> map = HBaseUtils.getInstance().query("imooc_course_clickcount" , "20171022");

    for(Map.Entry<String, Long> entry: map.entrySet()) {

    System.out.println(entry.getKey() + " : " + entry.getValue());

    }

    }

    }

    12-17 -功能一之将Spark Streaming的处理结果写入到HBase中

    源码地址:

    https://gitee.com/sag888/big_data/blob/master/Spark%20Streaming%E5%AE%9E%E6%97%B6%E6%B5%81%E5%A4%84%E7%90%86%E9%A1%B9%E7%9B%AE%E5%AE%9E%E6%88%98/project/l2118i/sparktrain/src/main/scala/com/imooc/spark/project/spark/ImoocStatStreamingApp.scala

    源码:

    // 测试步骤三:统计今天到现在为止实战课程的访问量

    cleanData.map(x => {

    // HBase rowkey设计: 20171111_88

    (x.time.substring(0, 8) + "_" + x.courseId, 1)

    }).reduceByKey(_ + _).foreachRDD(rdd => {

    rdd.foreachPartition(partitionRecords => {

    val list = new ListBuffer[CourseClickCount]

    partitionRecords.foreach(pair => {

    list.append(CourseClickCount(pair._1, pair._2))

    })

    CourseClickCountDAO.save(list)

    })

    })

    12-18 -功能二之需求分析及HBase设计&amp;HBase数据访问层开发

    功能:统计今天到现在为止从搜索引擎过来的实战课程的访问量

    功能二:功能一+从搜索引擎引流过来的

    HBase表设计

    create 'imooc_course_search_clickcount','info‘

    rowkey设计:也是根据我们的业务需求来的

    201711111+search+1

    第一步:创建model

    源码地址:

    https://gitee.com/sag888/big_data/blob/master/Spark%20Streaming%E5%AE%9E%E6%97%B6%E6%B5%81%E5%A4%84%E7%90%86%E9%A1%B9%E7%9B%AE%E5%AE%9E%E6%88%98/project/l2118i/sparktrain/src/main/scala/com/imooc/spark/project/domain/CourseSearchClickCount.scala

    源码:

    package com.imooc.spark.project.domain

    /**

    * 从搜索引擎过来的实战课程点击数实体类

    * @param day_search_course

    * @param click_count

    */

    case class CourseSearchClickCount(day_search_course:String, click_count:Long)

    第二步:dao层

    源码地址:

    https://gitee.com/sag888/big_data/blob/master/Spark%20Streaming%E5%AE%9E%E6%97%B6%E6%B5%81%E5%A4%84%E7%90%86%E9%A1%B9%E7%9B%AE%E5%AE%9E%E6%88%98/project/l2118i/sparktrain/src/main/scala/com/imooc/spark/project/dao/CourseSearchClickCountDAO.scala

    源码

    package com.imooc.spark.project.dao

    import com.imooc.spark.project.domain.{CourseClickCount, CourseSearchClickCount}

    import com.imooc.spark.project.utils.HBaseUtils

    import org.apache.hadoop.hbase.client.Get

    import org.apache.hadoop.hbase.util.Bytes

    import scala.collection.mutable.ListBuffer

    /**

    * 从搜索引擎过来的实战课程点击数-数据访问层

    */

    object CourseSearchClickCountDAO {

    val tableName = "imooc_course_search_clickcount"

    val cf = "info"

    val qualifer = "click_count"

    /**

    * 保存数据到HBase

    *

    * @param list CourseSearchClickCount集合

    */

    def save(list: ListBuffer[CourseSearchClickCount]): Unit = {

    val table = HBaseUtils.getInstance().getTable(tableName)

    for(ele <- list) {

    table.incrementColumnValue(Bytes.toBytes(ele.day_search_course),

    Bytes.toBytes(cf),

    Bytes.toBytes(qualifer),

    ele.click_count)

    }

    }

    /**

    * 根据rowkey查询值

    */

    def count(day_search_course: String):Long = {

    val table = HBaseUtils.getInstance().getTable(tableName)

    val get = new Get(Bytes.toBytes(day_search_course))

    val value = table.get(get).getValue(cf.getBytes, qualifer.getBytes)

    if(value == null) {

    0L

    }else{

    Bytes.toLong(value)

    }

    }

    def main(args: Array[String]): Unit = {

    val list = new ListBuffer[CourseSearchClickCount]

    list.append(CourseSearchClickCount("20171111_www.baidu.com_8",8))

    list.append(CourseSearchClickCount("20171111_cn.bing.com_9",9))

    save(list)

    println(count("20171111_www.baidu.com_8") + " : " + count("20171111_cn.bing.com_9"))

    }

    }

    12-19 -功能二之功能实现及本地测试

    源码地址:

    https://gitee.com/sag888/big_data/blob/master/Spark%20Streaming%E5%AE%9E%E6%97%B6%E6%B5%81%E5%A4%84%E7%90%86%E9%A1%B9%E7%9B%AE%E5%AE%9E%E6%88%98/project/l2118i/sparktrain/src/main/scala/com/imooc/spark/project/spark/ImoocStatStreamingApp.scala

    源码:

    // 测试步骤四:统计从搜索引擎过来的今天到现在为止实战课程的访问量

    cleanData.map(x => {

    /**

    * https://www.sogou.com/web?query=Spark SQL实战

    *

    * ==>

    *

    * https:/www.sogou.com/web?query=Spark SQL实战

    */·

    val referer = x.referer.replaceAll("//", "/")

    val splits = referer.split("/")

    var host = ""

    if(splits.length > 2) {

    host = splits(1)

    }

    (host, x.courseId, x.time)

    }).filter(_._1 != "").map(x => {

    (x._3.substring(0,8) + "_" + x._1 + "_" + x._2 , 1)

    }).reduceByKey(_ + _).foreachRDD(rdd => {

    rdd.foreachPartition(partitionRecords => {

    val list = new ListBuffer[CourseSearchClickCount]

    partitionRecords.foreach(pair => {

    list.append(CourseSearchClickCount(pair._1, pair._2))

    })

    CourseSearchClickCountDAO.save(list)

    })

    })

    12-20 -将项目运行在服务器环境中

    将项目运行在服务器环境中

    编译打包

    mvn clean package -DskipTests

    解决方案:

    <!--

    <sourceDirectory>src/main/scala</sourceDirectory>

    <testSourceDirectory>src/test/scala</testSourceDirectory>

    -->

    运行

    报错

    提交作业时,注意事项

    1、--packages的使用

    2、--jars的使用

  • 相关阅读:
    Android总结之json解析(FastJson Gson 对比)
    Android性能优化之UncaughtExceptionHandler定制自己的错误日志系统
    IOS遍历网页获取网页中<img>标签中的图片url
    IOS各种集合遍历效率对比
    cx_oracle访问处理oracle中文乱码问题
    使用tar解压文件提示gzip: stdin: not in gzip format错误
    Mac安装crfpp
    oracle 常用操作
    docker启动centos7后sudo不能使用
    常见Python爬虫工具总结
  • 原文地址:https://www.cnblogs.com/aishanyishi/p/10386416.html
Copyright © 2020-2023  润新知