• 网站流量分析项目day02


    1. 数据采集之Flume

    • Taildir Source

    相当于exec + spool的功能,还有断点续传功能。Flume1.7版本以上才有此功能,可以监控一个目录,并且根据正则表达式对目录中文件名对文件进行实时收集。

    • 注意1:当只以文件大小的方式进行滚动,如果文件不满足条件,会永远处于临时状态。

    解决办法1:停止Flume(不推荐)

    解决方法2:设置参数:hdfs.idleTimeout,当文件不满足滚动条件的时候,文件在一定时间(hdfs.idleTimeout)内没有任何操作,此时让其执行滚动。

    • 注意2:Flume设置某一种滚动方式,在满足条件后,文件滚动到了第一个datanode后,还需要向其他副本复制,这个也是需要花费时间的,默认是当每个副本都复制完Flume才会认为完成,这时候Flume就会还认为没有写入成功,会继续传递数据。

    解决办法:设置参数hdfs.minBlockReplicas=1,这时候Flume就会认为检测到只有一个副本,只要一个写入成功就会认为成功,其他的复制过程还是由hdfs完成。

    2. 模块开发之数据预处理

    根据目标制定规则,过滤掉“不和规定”的数据,清洗无意义的数据。

    a. 实现方式:MR

    • 一般字段比较多,这时候我们会封装成javabean对象,有了javabean对象我们就还要实现writable序列化接口
      • writable和Serialiazable序列化接口的关系,Serializable序列化较为复杂,正是因为如此出现了Writable,它专门为hadoop序列化更加轻量化。注意:二者只能对java序列化,avro可以用于任何语言,任何平台。
    • 重写JavaBean的toString方法。最好用‘/001’,因为只是hive的默认分割,方便我们后续写入到Hive中。

    b. 点击流模型数据

      点击流是指用户在网站上持续访问的轨迹,注重用户浏览网页的整个流程,用户对网站的每次访问,包含一系列的点击动作行为,注重用户浏览整个网站的过程。而网站日志是面向整个站点,它包含了用户行为数据,服务器响应数据等众多日志信息,我们通过网站日志的分析可以获得用户的点击流数据。

    • 点击流模型pageviews:专注于用户每次会话(session)的识别,以及每次session内访问了几步和每一步的停留时间。

    1) 在所有访问日志中找出该用户的所有访问记录
    2) 把该用户所有访问记录按照时间正序排序
    3) 计算前后两条记录时间差是否为30分钟
    4) 如果小于30分钟,则是同一会话session的延续
    5) 如果大于30分钟,则是下一会话session的开始
    6) 用前后两条记录时间差算出上一步停留时间
    7) 最后一步和只有一步的业务默认指定页面停留时间60s

    • 点击流模型Visit:专注于每次会话session内起始、结束的访问情况信息。比如用户在某一个会话session内,进入会话的起始页面和起始时间,会话结束是从哪个页面离开的,离开时间,本次session总共访问了几个页面等信息。

    1) 在pageviews模型上进行梳理
    2) 在每一次回收session内所有访问记录按照时间正序排序
    3) 第一天的时间页面就是起始时间页面
    4) 业务指定最后一条记录的时间页面作为离开时间和离开页面

    c. 数据清洗分析与实现

    194.237.142.21 - - [01/Nov/2018:06:49:18 +0000] "GET /wp-content/uploads/2013/07/rstudio-git3.png HTTP/1.1" 304 0 "-" "Mozilla/4.0 (compatible;)"
    183.49.46.228 - - [01/Nov/2018:06:49:23 +0000] "-" 400 0 "-" "-"
    163.177.71.12 - - [01/Nov/2018:06:49:33 +0000] "HEAD / HTTP/1.1" 200 20 "-" "DNSPod-Monitor/1.0"
    163.177.71.12 - - [01/Nov/2018:06:49:36 +0000] "HEAD / HTTP/1.1" 200 20 "-" "DNSPod-Monitor/1.0"
    101.226.68.137 - - [01/Nov/2018:06:49:42 +0000] "HEAD / HTTP/1.1" 200 20 "-" "DNSPod-Monitor/1.0"
    101.226.68.137 - - [01/Nov/2018:06:49:45 +0000] "HEAD / HTTP/1.1" 200 20 "-" "DNSPod-Monitor/1.0"
    60.208.6.156 - - [01/Nov/2018:06:49:48 +0000] "GET /wp-content/uploads/2013/07/rcassandra.png HTTP/1.0" 200 185524 "http://cos.name/category/software/packages/" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"
    222.68.172.190 - - [01/Nov/2018:06:49:57 +0000] "GET /images/my.jpg HTTP/1.1" 200 19939 "http://www.angularjs.cn/A00n" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"
    222.68.172.190 - - [01/Nov/2018:06:50:08 +0000] "-" 400 0 "-" "-"
    183.195.232.138 - - [01/Nov/2018:06:50:16 +0000] "HEAD / HTTP/1.1" 200 20 "-" "DNSPod-Monitor/1.0"
    183.195.232.138 - - [01/Nov/2018:06:50:16 +0000] "HEAD / HTTP/1.1" 200 20 "-" "DNSPod-Monitor/1.0"
    66.249.66.84 - - [01/Nov/2018:06:50:28 +0000] "GET /page/6/ HTTP/1.1" 200 27777 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
    221.130.41.168 - - [01/Nov/2018:06:50:37 +0000] "GET /feed/ HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"
    157.55.35.40 - - [01/Nov/2018:06:51:13 +0000] "GET /robots.txt HTTP/1.1" 200 150 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
    50.116.27.194 - - [01/Nov/2018:06:51:35 +0000] "POST /wp-cron.php?doing_wp_cron=1379487095.2510800361633300781250 HTTP/1.0" 200 0 "-" "WordPress/3.6; http://blog.fens.me"
    58.215.204.118 - - [01/Nov/2018:06:51:35 +0000] "GET /nodejs-socketio-chat/ HTTP/1.1" 200 10818 "http://www.google.com/url?sa=t&rct=j&q=nodejs%20%E5%BC%82%E6%AD%A5%E5%B9%BF%E6%92%AD&source=web&cd=1&cad=rja&ved=0CCgQFjAA&url=%68%74%74%70%3a%2f%2f%62%6c%6f%67%2e%66%65%6e%73%2e%6d%65%2f%6e%6f%64%65%6a%73%2d%73%6f%63%6b%65%74%69%6f%2d%63%68%61%74%2f&ei=rko5UrylAefOiAe7_IGQBw&usg=AFQjCNG6YWoZsJ_bSj8kTnMHcH51hYQkAA&bvm=bv.52288139,d.aGc" "Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0"
    58.215.204.118 - - [01/Nov/2018:06:51:36 +0000] "GET /wp-includes/js/jquery/jquery-migrate.min.js?ver=1.2.1 HTTP/1.1" 304 0 "http://blog.fens.me/nodejs-socketio-chat/" "Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0"
    58.215.204.118 - - [01/Nov/2018:06:51:35 +0000] "GET /wp-includes/js/jquery/jquery.js?ver=1.10.2 HTTP/1.1" 304 0 "http://blog.fens.me/nodejs-socketio-chat/" "Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0"
    58.215.204.118 - - [01/Nov/2018:06:51:36 +0000] "GET /wp-includes/js/comment-reply.min.js?ver=3.6 HTTP/1.1" 304 0 "http://blog.fens.me/nodejs-socketio-chat/" "Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0"
    58.215.204.118 - - [01/Nov/2018:06:51:36 +0000] "GET /wp-content/uploads/2013/08/chat.png HTTP/1.1" 200 48968 "http://blog.fens.me/nodejs-socketio-chat/" "Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0"
    58.215.204.118 - - [01/Nov/2018:06:51:36 +0000] "GET /wp-content/uploads/2013/08/chat2.png HTTP/1.1" 200 59852 "http://blog.fens.me/nodejs-socketio-chat/" "Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0"
    58.215.204.118 - - [01/Nov/2018:06:51:37 +0000] "GET /wp-content/uploads/2013/08/socketio.png HTTP/1.1" 200 80493 "http://blog.fens.me/nodejs-socketio-chat/" "Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0"
    58.248.178.212 - - [01/Nov/2018:06:51:37 +0000] "GET /nodejs-grunt-intro/ HTTP/1.1" 200 51770 "http://blog.fens.me/series-nodejs/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; MDDR; InfoPath.2; .NET4.0C)"
    58.248.178.212 - - [01/Nov/2018:06:51:40 +0000] "GET /wp-includes/js/jquery/jquery-migrate.min.js?ver=1.2.1 HTTP/1.1" 200 7200 "http://blog.fens.me/nodejs-grunt-intro/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; MDDR; InfoPath.2; .NET4.0C)"
    58.248.178.212 - - [01/Nov/2018:06:51:40 +0000] "GET /wp-includes/js/comment-reply.min.js?ver=3.6 HTTP/1.1" 200 786 "http://blog.fens.me/nodejs-grunt-intro/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; MDDR; InfoPath.2; .NET4.0C)"
    58.248.178.212 - - [01/Nov/2018:06:51:40 +0000] "GET /wp-includes/js/jquery/jquery.js?ver=1.10.2 HTTP/1.1" 200 45307 "http://blog.fens.me/nodejs-grunt-intro/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; MDDR; InfoPath.2; .NET4.0C)"
    58.248.178.212 - - [01/Nov/2018:06:51:40 +0000] "GET /wp-includes/js/jquery/jquery.js?ver=1.10.2 HTTP/1.1" 200 93128 "http://blog.fens.me/nodejs-grunt-intro/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; MDDR; InfoPath.2; .NET4.0C)"
    58.248.178.212 - - [01/Nov/2018:06:51:40 +0000] "GET /wp-includes/js/comment-reply.min.js?ver=3.6 HTTP/1.1" 200 786 "http://blog.fens.me/nodejs-grunt-intro/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; MDDR; InfoPath.2; .NET4.0C)"
    
    • 清洗后的数据的数据
    false194.237.142.21-2018-11-01 06:49:18/wp-content/uploads/2013/07/rstudio-git3.png3040"-""Mozilla/4.0(compatible;)"
    false163.177.71.12-2018-11-01 06:49:33/20020"-""DNSPod-Monitor/1.0"
    false163.177.71.12-2018-11-01 06:49:36/20020"-""DNSPod-Monitor/1.0"
    false101.226.68.137-2018-11-01 06:49:42/20020"-""DNSPod-Monitor/1.0"
    false101.226.68.137-2018-11-01 06:49:45/20020"-""DNSPod-Monitor/1.0"
    false60.208.6.156-2018-11-01 06:49:48/wp-content/uploads/2013/07/rcassandra.png200185524"http://cos.name/category/software/packages/""Mozilla/5.0(WindowsNT6.1)AppleWebKit/537.36(KHTML,likeGecko)Chrome/29.0.1547.66Safari/537.36"
    false222.68.172.190-2018-11-01 06:49:57/images/my.jpg20019939"http://www.angularjs.cn/A00n""Mozilla/5.0(WindowsNT6.1)AppleWebKit/537.36(KHTML,likeGecko)Chrome/29.0.1547.66Safari/537.36"
    false183.195.232.138-2018-11-01 06:50:16/20020"-""DNSPod-Monitor/1.0"
    false183.195.232.138-2018-11-01 06:50:16/20020"-""DNSPod-Monitor/1.0"
    false66.249.66.84-2018-11-01 06:50:28/page/6/20027777"-""Mozilla/5.0(compatible;Googlebot/2.1;+http://www.google.com/bot.html)"
    false221.130.41.168-2018-11-01 06:50:37/feed/3040"-""Mozilla/5.0(WindowsNT6.1)AppleWebKit/537.36(KHTML,likeGecko)Chrome/29.0.1547.66Safari/537.36"
    false157.55.35.40-2018-11-01 06:51:13/robots.txt200150"-""Mozilla/5.0(compatible;bingbot/2.0;+http://www.bing.com/bingbot.htm)"
    false50.116.27.194-2018-11-01 06:51:35/wp-cron.php?doing_wp_cron=1379487095.25108003616333007812502000"-""WordPress/3.6;http://blog.fens.me"
    false58.215.204.118-2018-11-01 06:51:35/nodejs-socketio-chat/20010818"http://www.google.com/url?sa=t&rct=j&q=nodejs%20%E5%BC%82%E6%AD%A5%E5%B9%BF%E6%92%AD&source=web&cd=1&cad=rja&ved=0CCgQFjAA&url=%68%74%74%70%3a%2f%2f%62%6c%6f%67%2e%66%65%6e%73%2e%6d%65%2f%6e%6f%64%65%6a%73%2d%73%6f%63%6b%65%74%69%6f%2d%63%68%61%74%2f&ei=rko5UrylAefOiAe7_IGQBw&usg=AFQjCNG6YWoZsJ_bSj8kTnMHcH51hYQkAA&bvm=bv.52288139,d.aGc""Mozilla/5.0(WindowsNT5.1;rv:23.0)Gecko/20100101Firefox/23.0"
    false58.215.204.118-2018-11-01 06:51:36/wp-includes/js/jquery/jquery-migrate.min.js?ver=1.2.13040"http://blog.fens.me/nodejs-socketio-chat/""Mozilla/5.0(WindowsNT5.1;rv:23.0)Gecko/20100101Firefox/23.0"
    false58.215.204.118-2018-11-01 06:51:35/wp-includes/js/jquery/jquery.js?ver=1.10.23040"http://blog.fens.me/nodejs-socketio-chat/""Mozilla/5.0(WindowsNT5.1;rv:23.0)Gecko/20100101Firefox/23.0"
    false58.215.204.118-2018-11-01 06:51:36/wp-includes/js/comment-reply.min.js?ver=3.63040"http://blog.fens.me/nodejs-socketio-chat/""Mozilla/5.0(WindowsNT5.1;rv:23.0)Gecko/20100101Firefox/23.0"
    false58.215.204.118-2018-11-01 06:51:36/wp-content/uploads/2013/08/chat.png20048968"http://blog.fens.me/nodejs-socketio-chat/""Mozilla/5.0(WindowsNT5.1;rv:23.0)Gecko/20100101Firefox/23.0"
    false58.215.204.118-2018-11-01 06:51:36/wp-content/uploads/2013/08/chat2.png20059852"http://blog.fens.me/nodejs-socketio-chat/""Mozilla/5.0(WindowsNT5.1;rv:23.0)Gecko/20100101Firefox/23.0"
    false58.215.204.118-2018-11-01 06:51:37/wp-content/uploads/2013/08/socketio.png20080493"http://blog.fens.me/nodejs-socketio-chat/""Mozilla/5.0(WindowsNT5.1;rv:23.0)Gecko/20100101Firefox/23.0"
    false58.248.178.212-2018-11-01 06:51:37/nodejs-grunt-intro/20051770"http://blog.fens.me/series-nodejs/""Mozilla/4.0(compatible;MSIE8.0;WindowsNT5.1;Trident/4.0;.NETCLR1.1.4322;.NETCLR2.0.50727;.NETCLR3.0.04506.30;.NETCLR3.0.4506.2152;.NETCLR3.5.30729;MDDR;InfoPath.2;.NET4.0C)"
    false58.248.178.212-2018-11-01 06:51:40/wp-includes/js/jquery/jquery-migrate.min.js?ver=1.2.12007200"http://blog.fens.me/nodejs-grunt-intro/""Mozilla/4.0(compatible;MSIE8.0;WindowsNT5.1;Trident/4.0;.NETCLR1.1.4322;.NETCLR2.0.50727;.NETCLR3.0.04506.30;.NETCLR3.0.4506.2152;.NETCLR3.5.30729;MDDR;InfoPath.2;.NET4.0C)"
    false58.248.178.212-2018-11-01 06:51:40/wp-includes/js/comment-reply.min.js?ver=3.6200786"http://blog.fens.me/nodejs-grunt-intro/""Mozilla/4.0(compatible;MSIE8.0;WindowsNT5.1;Trident/4.0;.NETCLR1.1.4322;.NETCLR2.0.50727;.NETCLR3.0.04506.30;.NETCLR3.0.4506.2152;.NETCLR3.5.30729;MDDR;InfoPath.2;.NET4.0C)"
    false58.248.178.212-2018-11-01 06:51:40/wp-includes/js/jquery/jquery.js?ver=1.10.220045307"http://blog.fens.me/nodejs-grunt-intro/""Mozilla/4.0(compatible;MSIE8.0;WindowsNT5.1;Trident/4.0;.NETCLR1.1.4322;.NETCLR2.0.50727;.NETCLR3.0.04506.30;.NETCLR3.0.4506.2152;.NETCLR3.5.30729;MDDR;InfoPath.2;.NET4.0C)"
    false58.248.178.212-2018-11-01 06:51:40/wp-includes/js/jquery/jquery.js?ver=1.10.220093128"http://blog.fens.me/nodejs-grunt-intro/""Mozilla/4.0(compatible;MSIE8.0;WindowsNT5.1;Trident/4.0;.NETCLR1.1.4322;.NETCLR2.0.50727;.NETCLR3.0.04506.30;.NETCLR3.0.4506.2152;.NETCLR3.5.30729;MDDR;InfoPath.2;.NET4.0C)"
    
    • 实现代码
        <repositories>
            <repository>
                <id>cloudera</id>
                <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
            </repository>
        </repositories>
        <dependencies>
            <dependency>
                <groupId>org.apache.hadoop</groupId>
                <artifactId>hadoop-client</artifactId>
                <version>2.6.0-mr1-cdh5.14.0</version>
    
            </dependency>
            <dependency>
                <groupId>org.apache.hadoop</groupId>
                <artifactId>hadoop-common</artifactId>
                <version>2.6.0-cdh5.14.0</version>
    
            </dependency>
            <dependency>
                <groupId>org.apache.hadoop</groupId>
                <artifactId>hadoop-hdfs</artifactId>
                <version>2.6.0-cdh5.14.0</version>
    
            </dependency>
    
            <dependency>
                <groupId>org.apache.hadoop</groupId>
                <artifactId>hadoop-mapreduce-client-core</artifactId>
                <version>2.6.0-cdh5.14.0</version>
    
            </dependency>
            <!-- https://mvnrepository.com/artifact/junit/junit -->
            <dependency>
                <groupId>junit</groupId>
                <artifactId>junit</artifactId>
                <version>4.11</version>
                <scope>test</scope>
            </dependency>
            <dependency>
                <groupId>org.testng</groupId>
                <artifactId>testng</artifactId>
                <version>RELEASE</version>
                <scope>test</scope>
            </dependency>
    
        </dependencies>
    
        <build>
            <plugins>
                <plugin>
                    <groupId>org.apache.maven.plugins</groupId>
                    <artifactId>maven-compiler-plugin</artifactId>
                    <version>3.0</version>
                    <configuration>
                        <source>1.8</source>
                        <target>1.8</target>
                        <encoding>UTF-8</encoding>
                        <!--    <verbal>true</verbal>-->
                    </configuration>
                </plugin>
                <!--执行打包操作:  将所依赖的jar包打入到当前的jar包中-->
                <plugin>
                    <groupId>org.apache.maven.plugins</groupId>
                    <artifactId>maven-shade-plugin</artifactId>
                    <version>2.4.3</version>
                    <executions>
                        <execution>
                            <phase>package</phase>
                            <goals>
                                <goal>shade</goal>
                            </goals>
                            <configuration>
                                <minimizeJar>true</minimizeJar>
                            </configuration>
                        </execution>
                    </executions>
                </plugin>
            </plugins>
        </build>
    package com.itheima.pojo;
    
    
    import org.apache.hadoop.io.Writable;
    
    import java.io.DataInput;
    import java.io.DataOutput;
    import java.io.IOException;
    
    /**
     * 对接外部数据的层,表结构定义最好跟外部数据源保持一致
     * 术语: 贴源表
     * @author
     *
     */
    public class WebLogBean implements Writable {
    
        private boolean valid = true;// 判断数据是否合法
        private String remote_addr;// 记录客户端的ip地址
        private String remote_user;// 记录客户端用户名称,忽略属性"-"
        private String time_local;// 记录访问时间与时区
        private String request;// 记录请求的url与http协议
        private String status;// 记录请求状态;成功是200
        private String body_bytes_sent;// 记录发送给客户端文件主体内容大小
        private String http_referer;// 用来记录从那个页面链接访问过来的
        private String http_user_agent;// 记录客户浏览器的相关信息
    
    
        public void set(boolean valid,String remote_addr, String remote_user, String time_local, String request, String status, String body_bytes_sent, String http_referer, String http_user_agent) {
            this.valid = valid;
            this.remote_addr = remote_addr;
            this.remote_user = remote_user;
            this.time_local = time_local;
            this.request = request;
            this.status = status;
            this.body_bytes_sent = body_bytes_sent;
            this.http_referer = http_referer;
            this.http_user_agent = http_user_agent;
        }
    
        public String getRemote_addr() {
            return remote_addr;
        }
    
        public void setRemote_addr(String remote_addr) {
            this.remote_addr = remote_addr;
        }
    
        public String getRemote_user() {
            return remote_user;
        }
    
        public void setRemote_user(String remote_user) {
            this.remote_user = remote_user;
        }
    
        public String getTime_local() {
            return this.time_local;
        }
    
        public void setTime_local(String time_local) {
            this.time_local = time_local;
        }
    
        public String getRequest() {
            return request;
        }
    
        public void setRequest(String request) {
            this.request = request;
        }
    
        public String getStatus() {
            return status;
        }
    
        public void setStatus(String status) {
            this.status = status;
        }
    
        public String getBody_bytes_sent() {
            return body_bytes_sent;
        }
    
        public void setBody_bytes_sent(String body_bytes_sent) {
            this.body_bytes_sent = body_bytes_sent;
        }
    
        public String getHttp_referer() {
            return http_referer;
        }
    
        public void setHttp_referer(String http_referer) {
            this.http_referer = http_referer;
        }
    
        public String getHttp_user_agent() {
            return http_user_agent;
        }
    
        public void setHttp_user_agent(String http_user_agent) {
            this.http_user_agent = http_user_agent;
        }
    
        public boolean isValid() {
            return valid;
        }
    
        public void setValid(boolean valid) {
            this.valid = valid;
        }
    
        @Override
        public String toString() {
            StringBuilder sb = new StringBuilder();
            sb.append(this.valid);
            sb.append("01").append(this.getRemote_addr());
            sb.append("01").append(this.getRemote_user());
            sb.append("01").append(this.getTime_local());
            sb.append("01").append(this.getRequest());
            sb.append("01").append(this.getStatus());
            sb.append("01").append(this.getBody_bytes_sent());
            sb.append("01").append(this.getHttp_referer());
            sb.append("01").append(this.getHttp_user_agent());
            return sb.toString();
        }
    
        @Override
        public void readFields(DataInput in) throws IOException {
            this.valid = in.readBoolean();
            this.remote_addr = in.readUTF();
            this.remote_user = in.readUTF();
            this.time_local = in.readUTF();
            this.request = in.readUTF();
            this.status = in.readUTF();
            this.body_bytes_sent = in.readUTF();
            this.http_referer = in.readUTF();
            this.http_user_agent = in.readUTF();
    
        }
    
        @Override
        public void write(DataOutput out) throws IOException {
            out.writeBoolean(this.valid);
            out.writeUTF(null==remote_addr?"":remote_addr);
            out.writeUTF(null==remote_user?"":remote_user);
            out.writeUTF(null==time_local?"":time_local);
            out.writeUTF(null==request?"":request);
            out.writeUTF(null==status?"":status);
            out.writeUTF(null==body_bytes_sent?"":body_bytes_sent);
            out.writeUTF(null==http_referer?"":http_referer);
            out.writeUTF(null==http_user_agent?"":http_user_agent);
    
        }
    
    }
    package com.itheima;
    
    import com.itheima.pojo.WebLogBean;
    
    import java.text.ParseException;
    import java.text.SimpleDateFormat;
    import java.util.Locale;
    import java.util.Set;
    
    public class WebLogParser {
    
        public static SimpleDateFormat df1 = new SimpleDateFormat("dd/MMM/yyyy:HH:mm:ss", Locale.US);
        public static SimpleDateFormat df2 = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss", Locale.US);
    
        public static WebLogBean parser(String line) {
            WebLogBean webLogBean = new WebLogBean();
            String[] arr = line.split(" ");
            if (arr.length > 11) {
                webLogBean.setRemote_addr(arr[0]);
                webLogBean.setRemote_user(arr[1]);
                String time_local = formatDate(arr[3].substring(1));
                if(null==time_local || "".equals(time_local)) time_local="-invalid_time-";
                webLogBean.setTime_local(time_local);
                webLogBean.setRequest(arr[6]);
                webLogBean.setStatus(arr[8]);
                webLogBean.setBody_bytes_sent(arr[9]);
                webLogBean.setHttp_referer(arr[10]);
    
                //如果useragent元素较多,拼接useragent
                if (arr.length > 12) {
                    StringBuilder sb = new StringBuilder();
                    for(int i=11;i<arr.length;i++){
                        sb.append(arr[i]);
                    }
                    webLogBean.setHttp_user_agent(sb.toString());
                } else {
                    webLogBean.setHttp_user_agent(arr[11]);
                }
    
                if (Integer.parseInt(webLogBean.getStatus()) >= 400) {// 大于400,HTTP错误
                    webLogBean.setValid(false);
                }
                
                if("-invalid_time-".equals(webLogBean.getTime_local())){
                    webLogBean.setValid(false);
                }
            } else {
                webLogBean=null;
            }
    
            return webLogBean;
        }
    
        public static void filtStaticResource(WebLogBean bean, Set<String> pages) {
            if (!pages.contains(bean.getRequest())) {
                bean.setValid(false);
            }
        }
            //格式化时间方法
        public static String formatDate(String time_local) {
            try {
                return df2.format(df1.parse(time_local));
            } catch (ParseException e) {
                return null;
            }
    
        }
    
    }
    package com.itheima;
    
    import com.itheima.pojo.WebLogBean;
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.NullWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    
    import java.io.IOException;
    import java.util.HashSet;
    import java.util.Set;
    
    /**
     * 处理原始日志,过滤出真实pv请求 转换时间格式 对缺失字段填充默认值 对记录标记valid和invalid
     * 
     */
    
    public class WeblogPreProcess {
    
        static class WeblogPreProcessMapper extends Mapper<LongWritable, Text, Text, NullWritable> {
            // 用来存储网站url分类数据
            Set<String> pages = new HashSet<String>();
            Text k = new Text();
            NullWritable v = NullWritable.get();
    
            /**
             * 从外部配置文件中加载网站的有用url分类数据 存储到maptask的内存中,用来对日志数据进行过滤
             */
            @Override
            protected void setup(Context context) throws IOException, InterruptedException {
                pages.add("/about");
                pages.add("/black-ip-list/");
                pages.add("/cassandra-clustor/");
                pages.add("/finance-rhive-repurchase/");
                pages.add("/hadoop-family-roadmap/");
                pages.add("/hadoop-hive-intro/");
                pages.add("/hadoop-zookeeper-intro/");
                pages.add("/hadoop-mahout-roadmap/");
    
            }
    
            @Override
            protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    
                String line = value.toString();
                WebLogBean webLogBean = WebLogParser.parser(line);
                if (webLogBean != null) {
                    // 过滤js/图片/css等静态资源
                    WebLogParser.filtStaticResource(webLogBean, pages);
                    /* if (!webLogBean.isValid()) return; */
                    k.set(webLogBean.toString());
                    context.write(k, v);
                }
            }
    
        }
    
        public static void main(String[] args) throws Exception {
    
            String inPath = "D:\hadoop_data\weblog\";
            String outpath ="D:\hadoop_output\weblog\weboutput";
    
            Configuration conf = new Configuration();
            Job job = Job.getInstance(conf);
    
            job.setJarByClass(WeblogPreProcess.class);
    
            job.setMapperClass(WeblogPreProcessMapper.class);
    
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(NullWritable.class);
    
    //         FileInputFormat.setInputPaths(job, new Path(args[0]));
    //         FileOutputFormat.setOutputPath(job, new Path(args[1]));
            FileInputFormat.setInputPaths(job, new Path(inPath));
            FileOutputFormat.setOutputPath(job, new Path(outpath));
    
            job.setNumReduceTasks(0);
    
            boolean res = job.waitForCompletion(true);
            System.exit(res?0:1);
    
        }
    
    } 
    • 点击流模型pageviews
    package com.itheima.pageviews;
    
    import org.apache.hadoop.io.Writable;
    
    import java.io.DataInput;
    import java.io.DataOutput;
    import java.io.IOException;
    
    public class PageViewsBean implements Writable {
    
        private String session;//session id
        private String remote_addr;
        private String timestr;
        private String request;
        private int step;
        private String staylong;
        private String referal;
        private String useragent;
        private String bytes_send;
        private String status;
    
        public void set(String session, String remote_addr, String useragent, String timestr, String request, int step, String staylong, String referal, String bytes_send, String status) {
            this.session = session;
            this.remote_addr = remote_addr;
            this.useragent = useragent;
            this.timestr = timestr;
            this.request = request;
            this.step = step;
            this.staylong = staylong;
            this.referal = referal;
            this.bytes_send = bytes_send;
            this.status = status;
        }
    
        public String getSession() {
            return session;
        }
    
        public void setSession(String session) {
            this.session = session;
        }
    
        public String getRemote_addr() {
            return remote_addr;
        }
    
        public void setRemote_addr(String remote_addr) {
            this.remote_addr = remote_addr;
        }
    
        public String getTimestr() {
            return timestr;
        }
    
        public void setTimestr(String timestr) {
            this.timestr = timestr;
        }
    
        public String getRequest() {
            return request;
        }
    
        public void setRequest(String request) {
            this.request = request;
        }
    
        public int getStep() {
            return step;
        }
    
        public void setStep(int step) {
            this.step = step;
        }
    
        public String getStaylong() {
            return staylong;
        }
    
        public void setStaylong(String staylong) {
            this.staylong = staylong;
        }
    
        public String getReferal() {
            return referal;
        }
    
        public void setReferal(String referal) {
            this.referal = referal;
        }
    
        public String getUseragent() {
            return useragent;
        }
    
        public void setUseragent(String useragent) {
            this.useragent = useragent;
        }
    
        public String getBytes_send() {
            return bytes_send;
        }
    
        public void setBytes_send(String bytes_send) {
            this.bytes_send = bytes_send;
        }
    
        public String getStatus() {
            return status;
        }
    
        public void setStatus(String status) {
            this.status = status;
        }
    
        @Override
        public void readFields(DataInput in) throws IOException {
            this.session = in.readUTF();
            this.remote_addr = in.readUTF();
            this.timestr = in.readUTF();
            this.request = in.readUTF();
            this.step = in.readInt();
            this.staylong = in.readUTF();
            this.referal = in.readUTF();
            this.useragent = in.readUTF();
            this.bytes_send = in.readUTF();
            this.status = in.readUTF();
    
        }
    
        @Override
        public void write(DataOutput out) throws IOException {
            out.writeUTF(session);
            out.writeUTF(remote_addr);
            out.writeUTF(timestr);
            out.writeUTF(request);
            out.writeInt(step);
            out.writeUTF(staylong);
            out.writeUTF(referal);
            out.writeUTF(useragent);
            out.writeUTF(bytes_send);
            out.writeUTF(status);
    
        }
    
    }
    package com.itheima.pageviews;
    
    
    
    
    import com.itheima.pojo.WebLogBean;
    import org.apache.commons.beanutils.BeanUtils;
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.NullWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    
    import java.io.IOException;
    import java.text.ParseException;
    import java.text.SimpleDateFormat;
    import java.util.*;
    
    
    
    /**
     *
     * 将清洗之后的日志梳理出点击流pageviews模型数据
     *
     * 输入数据是清洗过后的结果数据
     *
     * 区分出每一次会话,给每一次visit(session)增加了session-id(随机uuid)
     * 梳理出每一次会话中所访问的每个页面(请求时间,url,停留时长,以及该页面在这次session中的序号)
     * 保留referral_url,body_bytes_send,useragent
     *
     *
     */
    public class ClickStreamPageView {
    
        static class ClickStreamMapper extends Mapper<LongWritable, Text, Text, WebLogBean> {
    
            Text k = new Text();
            WebLogBean v = new WebLogBean();
    
            @Override
            protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    
                String line = value.toString();
    
                String[] fields = line.split("01");
                if (fields.length < 9) return;
                //将切分出来的各字段set到weblogbean中
                //fields[0].equals("true")
                v.set("true".equals(fields[0]) ? true : false, fields[1], fields[2], fields[3], fields[4], fields[5], fields[6], fields[7], fields[8]);
                //只有有效记录才进入后续处理
                if (v.isValid()) {
                    //此处用ip地址来标识用户
                    k.set(v.getRemote_addr());
                    context.write(k, v);
                }
            }
        }
    
        static class ClickStreamReducer extends Reducer<Text, WebLogBean, NullWritable, Text> {
    
            Text v = new Text();
    
            @Override
            protected void reduce(Text key, Iterable<WebLogBean> values, Context context) throws IOException, InterruptedException {
                ArrayList<WebLogBean> beans = new ArrayList<WebLogBean>();
    
    //            for (WebLogBean b : values) {
    //                beans.add(b);
    //            }
    
                // 先将一个用户的所有访问记录中的时间拿出来排序
                try {
                    for (WebLogBean bean : values) {
                        WebLogBean webLogBean = new WebLogBean();
                        try {
                            BeanUtils.copyProperties(webLogBean, bean);
                        } catch(Exception e) {
                            e.printStackTrace();
                        }
                        beans.add(webLogBean);
                    }
    
    
    
                    //将bean按时间先后顺序排序
                    Collections.sort(beans, new Comparator<WebLogBean>() {
    
                        @Override
                        public int compare(WebLogBean o1, WebLogBean o2) {
                            try {
                                Date d1 = toDate(o1.getTime_local());
                                Date d2 = toDate(o2.getTime_local());
                                if (d1 == null || d2 == null)
                                    return 0;
                                return d1.compareTo(d2);
                            } catch (Exception e) {
                                e.printStackTrace();
                                return 0;
                            }
                        }
    
                    });
    
                    /**
                     * 以下逻辑为:从有序bean中分辨出各次visit,并对一次visit中所访问的page按顺序标号step
                     * 核心思想:
                     * 就是比较相邻两条记录中的时间差,如果时间差<30分钟,则该两条记录属于同一个session
                     * 否则,就属于不同的session
                     *
                     */
    
                    int step = 1;
                    String session = UUID.randomUUID().toString();
                    for (int i = 0; i < beans.size(); i++) {
                        WebLogBean bean = beans.get(i);
                        // 如果仅有1条数据,则直接输出
                        if (1 == beans.size()) {
    
                            // 设置默认停留时长为60s
                            v.set(session+"01"+key.toString()+"01"+bean.getRemote_user() + "01" + bean.getTime_local() + "01" + bean.getRequest() + "01" + step + "01" + (60) + "01" + bean.getHttp_referer() + "01" + bean.getHttp_user_agent() + "01" + bean.getBody_bytes_sent() + "01"
                                    + bean.getStatus());
                            context.write(NullWritable.get(), v);
                            session = UUID.randomUUID().toString();
                            break;
                        }
    
                        // 如果不止1条数据,则将第一条跳过不输出,遍历第二条时再输出
                        if (i == 0) {
                            continue;
                        }
    
                        // 求近两次时间差
                        long timeDiff = timeDiff(toDate(bean.getTime_local()), toDate(beans.get(i - 1).getTime_local()));
                        // 如果本次-上次时间差<30分钟,则输出前一次的页面访问信息
                        if (timeDiff < 30 * 60 * 1000) {
    
                            v.set(session+"01"+key.toString()+"01"+beans.get(i - 1).getRemote_user() + "01" + beans.get(i - 1).getTime_local() + "01" + beans.get(i - 1).getRequest() + "01" + step + "01" + (timeDiff / 1000) + "01" + beans.get(i - 1).getHttp_referer() + "01"
                                    + beans.get(i - 1).getHttp_user_agent() + "01" + beans.get(i - 1).getBody_bytes_sent() + "01" + beans.get(i - 1).getStatus());
                            context.write(NullWritable.get(), v);
                            step++;
                        } else {
    
                            // 如果本次-上次时间差>30分钟,则输出前一次的页面访问信息且将step重置,以分隔为新的visit
                            v.set(session+"01"+key.toString()+"01"+beans.get(i - 1).getRemote_user() + "01" + beans.get(i - 1).getTime_local() + "01" + beans.get(i - 1).getRequest() + "01" + (step) + "01" + (60) + "01" + beans.get(i - 1).getHttp_referer() + "01"
                                    + beans.get(i - 1).getHttp_user_agent() + "01" + beans.get(i - 1).getBody_bytes_sent() + "01" + beans.get(i - 1).getStatus());
                            context.write(NullWritable.get(), v);
                            // 输出完上一条之后,重置step编号
                            step = 1;
                            session = UUID.randomUUID().toString();
                        }
    
                        // 如果此次遍历的是最后一条,则将本条直接输出
                        if (i == beans.size() - 1) {
                            // 设置默认停留市场为60s
                            v.set(session+"01"+key.toString()+"01"+bean.getRemote_user() + "01" + bean.getTime_local() + "01" + bean.getRequest() + "01" + step + "01" + (60) + "01" + bean.getHttp_referer() + "01" + bean.getHttp_user_agent() + "01" + bean.getBody_bytes_sent() + "01" + bean.getStatus());
                            context.write(NullWritable.get(), v);
                        }
                    }
    
                } catch (ParseException e) {
                    e.printStackTrace();
    
                }
    
            }
    
            private String toStr(Date date) {
                SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss", Locale.US);
                return df.format(date);
            }
    
            private Date toDate(String timeStr) throws ParseException {
                SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss", Locale.US);
                return df.parse(timeStr);
            }
    
            private long timeDiff(String time1, String time2) throws ParseException {
                Date d1 = toDate(time1);
                Date d2 = toDate(time2);
                return d1.getTime() - d2.getTime();
    
            }
    
            private long timeDiff(Date time1, Date time2) throws ParseException {
    
                return time1.getTime() - time2.getTime();
    
            }
    
        }
    
        public static void main(String[] args) throws Exception {
    
            Configuration conf = new Configuration();
            Job job = Job.getInstance(conf);
    
            job.setJarByClass(ClickStreamPageView.class);
    
            job.setMapperClass(ClickStreamMapper.class);
            job.setReducerClass(ClickStreamReducer.class);
    
            job.setMapOutputKeyClass(Text.class);
            job.setMapOutputValueClass(WebLogBean.class);
    
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(Text.class);
    
    //        FileInputFormat.setInputPaths(job, new Path(args[0]));
    //        FileOutputFormat.setOutputPath(job, new Path(args[1]));
    
            FileInputFormat.setInputPaths(job, new Path("D:\hadoop_output\weblog\weboutput"));
            FileOutputFormat.setOutputPath(job, new Path("D:\hadoop_output\weblog\pageviews"));
    
            job.waitForCompletion(true);
    
        }
    
    }
    • 点击流模型visits
    package com.itheima.visits;
    
    import org.apache.hadoop.io.Writable;
    
    import java.io.DataInput;
    import java.io.DataOutput;
    import java.io.IOException;
    
    public class VisitBean implements Writable {
    
        private String session;
        private String remote_addr;
        private String inTime;
        private String outTime;
        private String inPage;
        private String outPage;
        private String referal;
        private int pageVisits;
    
        public void set(String session, String remote_addr, String inTime, String outTime, String inPage, String outPage, String referal, int pageVisits) {
            this.session = session;
            this.remote_addr = remote_addr;
            this.inTime = inTime;
            this.outTime = outTime;
            this.inPage = inPage;
            this.outPage = outPage;
            this.referal = referal;
            this.pageVisits = pageVisits;
        }
    
        public String getSession() {
            return session;
        }
    
        public void setSession(String session) {
            this.session = session;
        }
    
        public String getRemote_addr() {
            return remote_addr;
        }
    
        public void setRemote_addr(String remote_addr) {
            this.remote_addr = remote_addr;
        }
    
        public String getInTime() {
            return inTime;
        }
    
        public void setInTime(String inTime) {
            this.inTime = inTime;
        }
    
        public String getOutTime() {
            return outTime;
        }
    
        public void setOutTime(String outTime) {
            this.outTime = outTime;
        }
    
        public String getInPage() {
            return inPage;
        }
    
        public void setInPage(String inPage) {
            this.inPage = inPage;
        }
    
        public String getOutPage() {
            return outPage;
        }
    
        public void setOutPage(String outPage) {
            this.outPage = outPage;
        }
    
        public String getReferal() {
            return referal;
        }
    
        public void setReferal(String referal) {
            this.referal = referal;
        }
    
        public int getPageVisits() {
            return pageVisits;
        }
    
        public void setPageVisits(int pageVisits) {
            this.pageVisits = pageVisits;
        }
    
        @Override
        public void readFields(DataInput in) throws IOException {
            this.session = in.readUTF();
            this.remote_addr = in.readUTF();
            this.inTime = in.readUTF();
            this.outTime = in.readUTF();
            this.inPage = in.readUTF();
            this.outPage = in.readUTF();
            this.referal = in.readUTF();
            this.pageVisits = in.readInt();
    
        }
    
        @Override
        public void write(DataOutput out) throws IOException {
            out.writeUTF(session);
            out.writeUTF(remote_addr);
            out.writeUTF(inTime);
            out.writeUTF(outTime);
            out.writeUTF(inPage);
            out.writeUTF(outPage);
            out.writeUTF(referal);
            out.writeInt(pageVisits);
    
        }
    
        @Override
        public String toString() {
            return session + "01" + remote_addr + "01" + inTime + "01" + outTime + "01" + inPage + "01" + outPage + "01" + referal + "01" + pageVisits;
        }
    }
    package com.itheima.visits;
    
    
    import com.itheima.pageviews.PageViewsBean;
    import org.apache.commons.beanutils.BeanUtils;
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.NullWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    
    import java.io.IOException;
    import java.util.ArrayList;
    import java.util.Collections;
    import java.util.Comparator;
    
    
    
    /**
     * 输入数据:pageviews模型结果数据
     * 从pageviews模型结果数据中进一步梳理出visit模型
     * sessionid  start-time   out-time   start-page   out-page   pagecounts  ......
     * 
     * @author
     *
     */
    public class ClickStreamVisit {
    
        // 以session作为key,发送数据到reducer
        static class ClickStreamVisitMapper extends Mapper<LongWritable, Text, Text, PageViewsBean> {
    
            PageViewsBean pvBean = new PageViewsBean();
            Text k = new Text();
    
            @Override
            protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    
                String line = value.toString();
                String[] fields = line.split("01");
                int step = Integer.parseInt(fields[5]);
                //(String session, String remote_addr, String timestr, String request, int step, String staylong, String referal, String useragent, String bytes_send, String status)
                //299d6b78-9571-4fa9-bcc2-f2567c46df3472.46.128.140-2013-09-18 07:58:50/hadoop-zookeeper-intro/160"https://www.google.com/""Mozilla/5.0"14722200
                pvBean.set(fields[0], fields[1], fields[2], fields[3],fields[4], step, fields[6], fields[7], fields[8], fields[9]);
                k.set(pvBean.getSession());
                context.write(k, pvBean);
            }
        }
    
        static class ClickStreamVisitReducer extends Reducer<Text, PageViewsBean, NullWritable, VisitBean> {
    
            @Override
            protected void reduce(Text session, Iterable<PageViewsBean> pvBeans, Context context) throws IOException, InterruptedException {
    
                // 将pvBeans按照step排序
                ArrayList<PageViewsBean> pvBeansList = new ArrayList<PageViewsBean>();
                for (PageViewsBean pvBean : pvBeans) {
                    PageViewsBean bean = new PageViewsBean();
                    try {
                        BeanUtils.copyProperties(bean, pvBean);
                        pvBeansList.add(bean);
                    } catch (Exception e) {
                        e.printStackTrace();
                    }
                }
    
                Collections.sort(pvBeansList, new Comparator<PageViewsBean>() {
    
                    @Override
                    public int compare(PageViewsBean o1, PageViewsBean o2) {
    
                        return o1.getStep() > o2.getStep() ? 1 : -1;
                    }
                });
    
                // 取这次visit的首尾pageview记录,将数据放入VisitBean中
                VisitBean visitBean = new VisitBean();
                // 取visit的首记录
                visitBean.setInPage(pvBeansList.get(0).getRequest());
                visitBean.setInTime(pvBeansList.get(0).getTimestr());
                // 取visit的尾记录
                visitBean.setOutPage(pvBeansList.get(pvBeansList.size() - 1).getRequest());
                visitBean.setOutTime(pvBeansList.get(pvBeansList.size() - 1).getTimestr());
                // visit访问的页面数
                visitBean.setPageVisits(pvBeansList.size());
                // 来访者的ip
                visitBean.setRemote_addr(pvBeansList.get(0).getRemote_addr());
                // 本次visit的referal
                visitBean.setReferal(pvBeansList.get(0).getReferal());
                visitBean.setSession(session.toString());
    
                context.write(NullWritable.get(), visitBean);
    
            }
    
        }
    
        public static void main(String[] args) throws Exception {
            Configuration conf = new Configuration();
            Job job = Job.getInstance(conf);
    
            job.setJarByClass(ClickStreamVisit.class);
    
            job.setMapperClass(ClickStreamVisitMapper.class);
            job.setReducerClass(ClickStreamVisitReducer.class);
    
            job.setMapOutputKeyClass(Text.class);
            job.setMapOutputValueClass(PageViewsBean.class);
    
            job.setOutputKeyClass(NullWritable.class);
            job.setOutputValueClass(VisitBean.class);
            
            
    //        FileInputFormat.setInputPaths(job, new Path(args[0]));
    //        FileOutputFormat.setOutputPath(job, new Path(args[1]));
            FileInputFormat.setInputPaths(job, new Path("D:\hadoop_output\weblog\pageviews"));
            FileOutputFormat.setOutputPath(job, new Path("D:\hadoop_output\weblog\visits"));
            
            boolean res = job.waitForCompletion(true);
            System.exit(res?0:1);
    
        }
    
    }
    ClickStreamVisit

     

     

     

  • 相关阅读:
    Python2-列表
    C#1-变量和表达式
    Python1-变量
    grep输出带有颜色设置
    ftp服务器的安装与使用
    慕课网-哒哒租车系统
    ulimit 命令
    ARP与RARP
    return 和exit
    java运行机制
  • 原文地址:https://www.cnblogs.com/qidi/p/11600713.html
Copyright © 2020-2023  润新知