• 网站流量日志分析(模块开发——数据预处理)


    数据预处理

    在正式处理数据之前对收集的数据进行预先处理的操作。

    • 原因:不管通过何种手段收集的数据 往往是不利于直接分析的 数据中存在的格式规整的差异。
    • 目的:把不干净的数据 格式不规则的数据 通过预处理清洗变成格式统一规整的结构化数据
    • 技术:MapReduce

    预处理的编程思路问题

    在使用mr编程的过程中 牢牢把握住key是什么。因为mr中key有很多默认的属性。

    分区---->key哈希  % reducetasknums
    分组---->key相同分为一组
    排序---->按照key的字典序排序
    

    MapReduce编程技巧

    • 涉及多属性数据传递 通常采用建立javabean携带数据 并且需要实现hadoop的序列化机制 Writable
    • 有意识的重写对象toString方法 并且以01进行字段 的分割,便于后续的数据入库操作(hive中默认的分隔符就是 01)
    • 本次分析无效的数据 通过采用建立标记位的形式进行逻辑删除

    点击流模型的概述

    点击流模式是业务模型 客观并不存在 其模式是由一堆业务指标堆积而成。
    点击流模式所描述的是用户在网站持续访问的一条轨迹,是一个线的概念。

    • 点击流模式和原始日志数据区别

      • 原始访问日志是站在网站的角度看待用户访问行为 数据按照时间追加的 是散点状的数据
      • 点击流模型是站在用户的角度看待用户的访问行为 数据一条持续的轨迹线
      • 点击流模型数据可以通过原始日志数据梳理而来

    会话(session)

    通常业界以前后两条的记录的时间差是否在30分钟以内作为会话判断的标准

    如果小于30分钟 就属于同一个会话

    如果大于30分钟 就是一个新的会话开始

    所谓点击流模型指的是在一个会话内的持续访问轨迹线。

    代码

    pom.xml

        <dependencies>
            <dependency>
                <groupId>org.apache.hadoop</groupId>
                <artifactId>hadoop-common</artifactId>
                <version>2.7.5</version>
            </dependency>
            <dependency>
                <groupId>org.apache.hadoop</groupId>
                <artifactId>hadoop-client</artifactId>
                <version>2.7.5</version>
            </dependency>
            <dependency>
                <groupId>org.apache.hadoop</groupId>
                <artifactId>hadoop-hdfs</artifactId>
                <version>2.7.5</version>
            </dependency>
            <dependency>
                <groupId>org.apache.hadoop</groupId>
                <artifactId>hadoop-mapreduce-client-core</artifactId>
                <version>2.7.5</version>
            </dependency>
            <dependency>
                <groupId>junit</groupId>
                <artifactId>junit</artifactId>
                <version>RELEASE</version>
            </dependency>
            <dependency>
                <groupId>pers.hwj</groupId>
                <artifactId>preprocess</artifactId>
                <version>1.0-SNAPSHOT</version>
                <scope>compile</scope>
            </dependency>
        </dependencies>
        <build>
            <plugins>
                <plugin>
                    <groupId>org.apache.maven.plugins</groupId>
                    <artifactId>maven-compiler-plugin</artifactId>
                    <version>3.1</version>
                    <configuration>
                        <source>1.8</source>
                        <target>1.8</target>
                        <encoding>UTF-8</encoding>
                        <!--    <verbal>true</verbal>-->
                    </configuration>
                </plugin>
                <plugin>
                    <groupId>org.apache.maven.plugins</groupId>
                    <artifactId>maven-shade-plugin</artifactId>
                    <version>2.4.3</version>
                    <executions>
                        <execution>
                            <phase>package</phase>
                            <goals>
                                <goal>shade</goal>
                            </goals>
                            <configuration>
                                <minimizeJar>true</minimizeJar>
                            </configuration>
                        </execution>
                    </executions>
                </plugin>
    
            </plugins>
        </build>
    

    log4j.properties

    log4j.rootLogger=debug, stdout, R 
    
    log4j.appender.stdout=org.apache.log4j.ConsoleAppender 
    log4j.appender.stdout.layout=org.apache.log4j.PatternLayout 
    
    #log4j.appender.stdout.layout.ConversionPattern=%5p - %m%n
    log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n
    
    log4j.appender.R=org.apache.log4j.RollingFileAppender 
    log4j.appender.R.File=log4j.log 
    
    log4j.appender.R.MaxFileSize=100KB 
    log4j.appender.R.MaxBackupIndex=1 
    
    log4j.appender.R.layout=org.apache.log4j.PatternLayout 
    #log4j.appender.R.layout.ConversionPattern=%p %t %c - %m%n
    log4j.appender.R.layout.ConversionPattern=%d %p [%c] - %m%n
    
    log4j.logger.com.codefutures=DEBUG 
    

    preprocess 模块

    WebLogBean

    package pers.hwj;
    
    import jdk.nashorn.internal.objects.annotations.Constructor;
    
    import org.apache.hadoop.io.Writable;
    
    import java.beans.ConstructorProperties;
    import java.io.DataInput;
    import java.io.DataOutput;
    import java.io.IOException;
    import java.nio.ByteBuffer;
    import java.nio.charset.CharacterCodingException;
    import java.text.ParseException;
    
    /**
     * @Author hwj
     * @Date 2020/8/6 14:32
     * @Desc: 根据网站流量日志创建对应的私有属性
     **/
    /*
    对于明显不合规的数据,创建标记位,进行逻辑删除
    该bean是需要序列化操作的,要继承 Writable
    主要步骤:
    1. 属性定义
    2. set get 方法
    3. toString
    4. 序列化及反序列化
     */
    public class WebLogBean implements Writable {
        private boolean valid = true; // 判断数据是否合法
        private String remote_ip; // 记录客户端的IP地址
        private String remote_user; // 客户端用户名称
        private String time_local; // 记录访客时间与时区
        private String request; // 访问请求方式
        private String status; // 记录请求状态
        private String body_bytes_sent; // 记录发送给客户端文件主体内容大小
        private String http_referer; // 记录是从什么页面链接访问过来的
        private String http_user_agent; // 记录客户浏览器的详细信息
    
        public void set(boolean valid,String remote_ip, String remote_user, String time_local, String request, String status, String body_bytes_sent, String http_referer, String http_user_agent) {
            this.valid = valid;
            this.remote_ip = remote_ip;
            this.remote_user = remote_user;
            this.time_local = time_local;
            this.request = request;
            this.status = status;
            this.body_bytes_sent = body_bytes_sent;
            this.http_referer = http_referer;
            this.http_user_agent = http_user_agent;
        }
        public boolean isValid() {
            return valid;
        }
    
        public void setValid(boolean valid) {
            this.valid = valid;
        }
    
        public String getRemote_ip() {
            return remote_ip;
        }
    
        public void setRemote_ip(String remote_ip) {
            this.remote_ip = remote_ip;
        }
    
        public String getRemote_user() {
            return remote_user;
        }
    
        public void setRemote_user(String remote_user) {
            this.remote_user = remote_user;
        }
    
        public String getTime_local() {
            return time_local;
        }
    
        public void setTime_local(String time_local) {
            this.time_local = time_local;
        }
    
        public String getRequest() {
            return request;
        }
    
        public void setRequest(String request) {
            this.request = request;
        }
    
        public String getStatus() {
            return status;
        }
    
        public void setStatus(String status) {
            this.status = status;
        }
    
        public String getBody_bytes_sent() {
            return body_bytes_sent;
        }
    
        public void setBody_bytes_sent(String body_bytes_sent) {
            this.body_bytes_sent = body_bytes_sent;
        }
    
        public String getHttp_referer() {
            return http_referer;
        }
    
        public void setHttp_referer(String http_referer) {
            this.http_referer = http_referer;
        }
    
        public String getHttp_user_agent() {
            return http_user_agent;
        }
    
        public void setHttp_user_agent(String http_user_agent) {
            this.http_user_agent = http_user_agent;
        }
    
        @Override
        public String toString() {
            StringBuilder stringBuilder = new StringBuilder();
            stringBuilder.append(valid);
            // 01是hive的默认分隔符,对后面的数据处理来说很方便
            stringBuilder.append("01").append(remote_ip);
            stringBuilder.append("01").append(remote_user);
            stringBuilder.append("01").append(time_local);
            stringBuilder.append("01").append(request);
            stringBuilder.append("01").append(status);
            stringBuilder.append("01").append(body_bytes_sent);
            stringBuilder.append("01").append(http_referer);
            stringBuilder.append("01").append(http_user_agent);
            return stringBuilder.toString();
        }
    
    // 序列化方法
    
        @Override
        public void write(DataOutput out) throws IOException {
            out.writeBoolean(valid);
            out.writeUTF(null==remote_ip?"":remote_ip);
            out.writeUTF(null==remote_user?"":remote_user);
            out.writeUTF(null==time_local?"":time_local);
            out.writeUTF(null==request?"":request);
            out.writeUTF(null==status?"":status);
            out.writeUTF(null==body_bytes_sent?"":body_bytes_sent);
            out.writeUTF(null==http_referer?"":http_referer);
            out.writeUTF(null==http_user_agent?"":http_user_agent);
        }
    
        // 反序列化方法
        @Override
        public void readFields(DataInput in) throws IOException {
            this.valid=in.readBoolean();
            this.remote_ip=in.readUTF();
            this.remote_user=in.readUTF();
            this.time_local=in.readUTF();
            this.request=in.readUTF();
            this.status=in.readUTF();
            this.body_bytes_sent=in.readUTF();
            this.http_referer=in.readUTF();
            this.http_user_agent=in.readUTF();
        }
    
    
    }
    
    

    WebLogMain

    package pers.hwj;
    
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    import java.io.IOException;
    
    /**
     * @Author hwj
     * @Date 2020/8/6 14:29
     * @Desc: 处理原始日志,过滤出真实pv请求 转换时间格式 对缺失字段填充默认值 
     * 对记录标记valid和invalid
     **/
    /*
    k1				v1
    起始偏移量	该行内容
    k2				v2
    行内容		    null
     */
    public class WebLogMain {
        // 将这个描述好的对象提交给集群去运行
        public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
            //  Configuration 封装了对应客户端或服务器的配置
            Configuration conf = new Configuration();
            Job job = Job.getInstance(conf);
            job.setJarByClass(WebLogMain.class);
    
            // 指定 Map 阶段的处理方式
            job.setMapperClass(WebLogMapper.class);
    
            // 指定 reduce 阶段的处理方式
            job.setNumReduceTasks(0);
    
            // 指定 Map 阶段键值对输出的数据类型
            job.setMapOutputKeyClass(Text.class);
            job.setMapOutputValueClass(LongWritable.class);
    
            // 指定 reduce 阶段输出到文件的键值对类型
    //        FileInputFormat.setInputPaths(job,new Path("file:///G:\input"));
            FileInputFormat.setInputPaths(job,new Path("E:\Big_Data_Files\企业级网站流量运营分析系统开发实战\网站流日志分析资料\day2资料\代码\数据预处理数据\weblog\input"));
            FileOutputFormat.setOutputPath(job,new Path("E:\Big_Data_Files\opt\"));
    
            // 向 yarn 集群提交这个 job
            boolean res=job.waitForCompletion(true);
            System.exit(res?0:1);
        }
    }
    
    

    WebLogMapper

    package pers.hwj;
    
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.NullWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapred.OutputCollector;
    import org.apache.hadoop.mapred.Reporter;
    import org.apache.hadoop.mapreduce.Mapper;
    
    import java.io.IOException;
    import java.text.ParseException;
    import java.text.SimpleDateFormat;
    import java.util.HashSet;
    import java.util.Locale;
    import java.util.Set;
    
    /**
     * @Author hwj
     * @Date 2020/8/6 14:37
     * @Desc:
     **/
    /*
    k1  起始偏移量   LongWritable
    v1  行文本内容   Text
    k2  JAVA Bean   WebLogBean
    v2  NULL         NullWritable
    LongWritable, Text, Text, NullWritable
     */
    /*
    1. 行文本拆分,得到各个Bean字段,获取k2
    2. 将 k2,v2 写入上下文
    ** 注意提出的几种类型数据如下 **
    1. 不是指定网页跳转过来的请求(可能爬虫)
    2. HTTP 状态码 >400 的请求
    3. 时间为空的剔除
     */
    public class WebLogMapper extends Mapper<LongWritable, Text, Text, NullWritable> {
        // 时间格式转换
        public static SimpleDateFormat df1 = new SimpleDateFormat("dd/MMM/yyyy:HH:mm:ss", Locale.US);
        public static SimpleDateFormat df2 = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss", Locale.US);
        // 用来存储网站url分类数据
        Set<String> pages = new HashSet<String>();
        Text k = new Text();
        NullWritable v = NullWritable.get();
    
        @Override
        protected void setup(Context context) throws IOException, InterruptedException {
            pages.add("/about");
            pages.add("/black-ip-list/");
            pages.add("/cassandra-clustor/");
            pages.add("/finance-rhive-repurchase/");
            pages.add("/hadoop-family-roadmap/");
            pages.add("/hadoop-hive-intro/");
            pages.add("/hadoop-zookeeper-intro/");
            pages.add("/hadoop-mahout-roadmap/");
    
        }
    
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String text = value.toString();
            String[] split = text.split(" ");
            WebLogBean logBean = new WebLogBean();
            // 若
            if(split.length>11) {
    
    //            private boolean valid = true; // 判断数据是否合法
    
                logBean.setRemote_ip(split[0]);
                logBean.setRemote_user(split[1]);
                String time_local=formatDate(split[3].substring(1));
                if(time_local.equals("")||time_local==null){
                    time_local="-invalid_time-";
                }
                logBean.setTime_local(time_local);
                logBean.setRequest(split[6]);
                logBean.setStatus(split[8]);
                logBean.setBody_bytes_sent(split[9]);
                logBean.setHttp_referer(split[10]);
                // 如果 user agent 元素较多,拼接 server agent
                if(split.length>12){
                    StringBuilder stringBuilder = new StringBuilder();
                    for(int i=11;i<split.length;i++) {
                        stringBuilder.append(split[i]);
                    }
                    logBean.setHttp_user_agent(stringBuilder.toString());
                }else{
                    logBean.setHttp_user_agent(split[11]);
                }
                // 对于明显不合规的数据,创建标记位,进行逻辑删除
                if(Integer.parseInt(logBean.getStatus())>=400){
                    logBean.setValid(false);
                }
                if(logBean.getTime_local().equals("-invalid_time-")){
                    logBean.setValid(false);
                }
                filtStaticResource(logBean, pages);
            }else{
                logBean=null;
            }
            if (logBean != null) {
                // 过滤js/图片/css等静态资源
                filtStaticResource(logBean, pages);
                /* if (!webLogBean.isValid()) return; */
                k.set(logBean.toString());
                context.write(k, v);
            }
        }
        // 定义时间格式转换
        public static String formatDate(String time_local) {
            try {
                return df2.format(df1.parse(time_local));
            } catch (ParseException e) {
                return null;
            }
        }
        public static void filtStaticResource(WebLogBean bean, Set<String> pages) {
            if (!pages.contains(bean.getRequest())) {
                bean.setValid(false);
            }
        }
    }
    
    

    pageviews 模块

    ClickStreamPageView

    
    
    import org.apache.commons.beanutils.BeanUtils;
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.NullWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    import pers.hwj.WebLogBean;
    
    import java.io.IOException;
    import java.text.ParseException;
    import java.text.SimpleDateFormat;
    import java.util.*;
    
    /**
     * 将清洗之后的日志梳理出点击流pageviews模型数据
     * 
     * 输入数据是清洗过后的结果数据
     * 
     * 区分出每一次会话,给每一次visit(session)增加了session-id(随机uuid)
     * 梳理出每一次会话中所访问的每个页面(请求时间,url,停留时长,以及该页面在这次session中的序号)
     * 保留referral_url,body_bytes_send,useragent
     * 
     * @author
     */
    public class ClickStreamPageView {
    
    	static class ClickStreamMapper extends Mapper<LongWritable, Text, Text, WebLogBean> {
    
    		Text k = new Text();
    		WebLogBean v = new WebLogBean();
    
    		@Override
    		protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    // 去除处理后的行文本数据
    			String line = value.toString();
    // 将行文本数据进行分割
    			String[] fields = line.split("01");
    			if (fields.length < 9) return;
    			// 将切分出来的指定url请求的各字段set到weblogbean中
    			//fields[0].equals("true")
    			v.set("true".equals(fields[0]) ? true : false, fields[1], fields[2], fields[3], fields[4], fields[5], fields[6], fields[7], fields[8]);
    
    			// 只有有效记录才进入后续处理
    			if (v.isValid()) {
    			// 此处用ip地址来标识用户
    				k.set(v.getRemote_ip());
    				context.write(k, v);
    			}
    		}
    	}
    // 同一用户的要进行合并分析
    	static class ClickStreamReducer extends Reducer<Text, WebLogBean, NullWritable, Text> {
    		Text v = new Text();
    
    		@Override
    		protected void reduce(Text key, Iterable<WebLogBean> values, Context context) throws IOException, InterruptedException {
    			ArrayList<WebLogBean> beans = new ArrayList<WebLogBean>();
    //			for (WebLogBean b : values) {
    //				beans.add(b);
    //			}
    			// 先将一个用户的所有访问记录中的时间拿出来排序
    			try {
    				for (WebLogBean bean : values) {
    					WebLogBean webLogBean = new WebLogBean();
    					try {
    						BeanUtils.copyProperties(webLogBean, bean);
    					} catch(Exception e) {
    						e.printStackTrace();
    					}
    					beans.add(webLogBean);
    				}
    
    				//将bean按时间先后顺序排序
    				Collections.sort(beans, new Comparator<WebLogBean>() {
    
    					@Override
    					public int compare(WebLogBean o1, WebLogBean o2) {
    						try {
    							Date d1 = toDate(o1.getTime_local());
    							Date d2 = toDate(o2.getTime_local());
    							if (d1 == null || d2 == null)
    								return 0;
    							return d1.compareTo(d2);
    						} catch (Exception e) {
    							e.printStackTrace();
    							return 0;
    						}
    					}
    
    				});
    
    				/**
    				 * 以下逻辑为:从有序bean中分辨出各次visit,并对一次visit中所访问的page按顺序标号step
    				 * 核心思想:
    				 * 就是比较相邻两条记录中的时间差,如果时间差<30分钟,则该两条记录属于同一个session
    				 * 否则,就属于不同的session
    				 */
    				
    				int step = 1;
    				// 会话标识 session
    				String session = UUID.randomUUID().toString();
    				for (int i = 0; i < beans.size(); i++) {
    					WebLogBean bean = beans.get(i);
    					// 如果仅有1条数据,则直接输出
    					if (1 == beans.size()) {
    						// 设置默认停留时长为60s
    						v.set(session+"01"+key.toString()+"01"+bean.getRemote_user() + "01" + bean.getTime_local()
    								+ "01" + bean.getRequest() + "01" + step + "01" + (60) + "01"
    								+ bean.getHttp_referer() + "01" + bean.getHttp_user_agent() + "01"
    								+ bean.getBody_bytes_sent() + "01" + bean.getStatus());
    						context.write(NullWritable.get(), v);
    						session = UUID.randomUUID().toString();
    						break;
    					}
    
    					// 如果不止1条数据,则将第一条跳过不输出,遍历第二条时再输出
    					if (i == 0) {
    						continue;
    					}
    					// 求近两次时间差
    					long timeDiff = timeDiff(toDate(bean.getTime_local()), toDate(beans.get(i - 1).getTime_local()));
    					// 如果本次-上次时间差<30分钟,则输出前一次的页面访问信息
    					if (timeDiff < 30 * 60 * 1000) {
    						
    						v.set(session+"01"+key.toString()+"01"+beans.get(i - 1).getRemote_user() + "01" + beans.get(i - 1).getTime_local() + "01" + beans.get(i - 1).getRequest() + "01" + step + "01" + (timeDiff / 1000) + "01" + beans.get(i - 1).getHttp_referer() + "01"
    								+ beans.get(i - 1).getHttp_user_agent() + "01" + beans.get(i - 1).getBody_bytes_sent() + "01" + beans.get(i - 1).getStatus());
    						context.write(NullWritable.get(), v);
    						step++;
    					} else {
    						// 如果本次-上次时间差>30分钟,则输出前一次的页面访问信息且将step重置,以分隔为新的visit
    						v.set(session+"01"+key.toString()+"01"+beans.get(i - 1).getRemote_user() + "01" + beans.get(i - 1).getTime_local() + "01" + beans.get(i - 1).getRequest() + "01" + (step) + "01" + (60) + "01" + beans.get(i - 1).getHttp_referer() + "01"
    								+ beans.get(i - 1).getHttp_user_agent() + "01" + beans.get(i - 1).getBody_bytes_sent() + "01" + beans.get(i - 1).getStatus());
    						context.write(NullWritable.get(), v);
    						// 输出完上一条之后,重置step编号
    						step = 1;
    						session = UUID.randomUUID().toString();
    					}
    
    					// 如果此次遍历的是最后一条,则将本条直接输出
    					if (i == beans.size() - 1) {
    						// 设置默认停留市场为60s
    						v.set(session+"01"+key.toString()+"01"+bean.getRemote_user() + "01" + bean.getTime_local() + "01" + bean.getRequest() + "01" + step + "01" + (60) + "01" + bean.getHttp_referer() + "01" + bean.getHttp_user_agent() + "01" + bean.getBody_bytes_sent() + "01" + bean.getStatus());
    						context.write(NullWritable.get(), v);
    					}
    				}
    
    			} catch (ParseException e) {
    				e.printStackTrace();
    
    			}
    
    		}
    
    		private String toStr(Date date) {
    			SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss", Locale.US);
    			return df.format(date);
    		}
    
    		private Date toDate(String timeStr) throws ParseException {
    			SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss", Locale.US);
    			return df.parse(timeStr);
    		}
    
    		private long timeDiff(String time1, String time2) throws ParseException {
    			Date d1 = toDate(time1);
    			Date d2 = toDate(time2);
    			return d1.getTime() - d2.getTime();
    		}
    
    		private long timeDiff(Date time1, Date time2) throws ParseException {
    
    			return time1.getTime() - time2.getTime();
    
    		}
    	}
    
    	public static void main(String[] args) throws Exception {
    
    		Configuration conf = new Configuration();
    		Job job = Job.getInstance(conf);
    
    		job.setJarByClass(ClickStreamPageView.class);
    
    		job.setMapperClass(ClickStreamMapper.class);
    		job.setReducerClass(ClickStreamReducer.class);
    
    		job.setMapOutputKeyClass(Text.class);
    		job.setMapOutputValueClass(WebLogBean.class);
    
    		job.setOutputKeyClass(Text.class);
    		job.setOutputValueClass(Text.class);
    
    //		FileInputFormat.setInputPaths(job, new Path(args[0]));
    //		FileOutputFormat.setOutputPath(job, new Path(args[1]));
    
    		FileInputFormat.setInputPaths(job, new Path("E:\Big_Data_Files\opt"));
    		FileOutputFormat.setOutputPath(job, new Path("E:\Big_Data_Files\oppt"));
    
    		job.waitForCompletion(true);
    	}
    
    }
    
    

    PageViewsBean

    import org.apache.hadoop.io.Writable;
    
    import java.io.DataInput;
    import java.io.DataOutput;
    import java.io.IOException;
    
    public class PageViewsBean implements Writable {
    
    	private String session;
    	private String remote_addr;
    	private String timestr;
    	private String request;
    	private int step;
    	private String staylong;
    	private String referal;
    	private String useragent;
    	private String bytes_send;
    	private String status;
    
    	public void set(String session, String remote_addr, String useragent, String timestr, String request, int step, String staylong, String referal, String bytes_send, String status) {
    		this.session = session;
    		this.remote_addr = remote_addr;
    		this.useragent = useragent;
    		this.timestr = timestr;
    		this.request = request;
    		this.step = step;
    		this.staylong = staylong;
    		this.referal = referal;
    		this.bytes_send = bytes_send;
    		this.status = status;
    	}
    
    	public String getSession() {
    		return session;
    	}
    
    	public void setSession(String session) {
    		this.session = session;
    	}
    
    	public String getRemote_addr() {
    		return remote_addr;
    	}
    
    	public void setRemote_addr(String remote_addr) {
    		this.remote_addr = remote_addr;
    	}
    
    	public String getTimestr() {
    		return timestr;
    	}
    
    	public void setTimestr(String timestr) {
    		this.timestr = timestr;
    	}
    
    	public String getRequest() {
    		return request;
    	}
    
    	public void setRequest(String request) {
    		this.request = request;
    	}
    
    	public int getStep() {
    		return step;
    	}
    
    	public void setStep(int step) {
    		this.step = step;
    	}
    
    	public String getStaylong() {
    		return staylong;
    	}
    
    	public void setStaylong(String staylong) {
    		this.staylong = staylong;
    	}
    
    	public String getReferal() {
    		return referal;
    	}
    
    	public void setReferal(String referal) {
    		this.referal = referal;
    	}
    
    	public String getUseragent() {
    		return useragent;
    	}
    
    	public void setUseragent(String useragent) {
    		this.useragent = useragent;
    	}
    
    	public String getBytes_send() {
    		return bytes_send;
    	}
    
    	public void setBytes_send(String bytes_send) {
    		this.bytes_send = bytes_send;
    	}
    
    	public String getStatus() {
    		return status;
    	}
    
    	public void setStatus(String status) {
    		this.status = status;
    	}
    
    	@Override
    	public void readFields(DataInput in) throws IOException {
    		this.session = in.readUTF();
    		this.remote_addr = in.readUTF();
    		this.timestr = in.readUTF();
    		this.request = in.readUTF();
    		this.step = in.readInt();
    		this.staylong = in.readUTF();
    		this.referal = in.readUTF();
    		this.useragent = in.readUTF();
    		this.bytes_send = in.readUTF();
    		this.status = in.readUTF();
    
    	}
    
    	@Override
    	public void write(DataOutput out) throws IOException {
    		out.writeUTF(session);
    		out.writeUTF(remote_addr);
    		out.writeUTF(timestr);
    		out.writeUTF(request);
    		out.writeInt(step);
    		out.writeUTF(staylong);
    		out.writeUTF(referal);
    		out.writeUTF(useragent);
    		out.writeUTF(bytes_send);
    		out.writeUTF(status);
    
    	}
    
    }
    
    

    visits 模块

    ClickStreamVisit

    import org.apache.commons.beanutils.BeanUtils;
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.NullWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    
    import java.io.IOException;
    import java.util.ArrayList;
    import java.util.Collections;
    import java.util.Comparator;
    
    
    /**
     * 输入数据:pageviews模型结果数据
     * 从pageviews模型结果数据中进一步梳理出visit模型
     * sessionid  start-time   out-time   start-page   out-page   pagecounts  ......
     * 
     * @author
     *
     */
    public class ClickStreamVisit {
    
    	// 以session作为key,发送数据到reducer
    	static class ClickStreamVisitMapper extends Mapper<LongWritable, Text, Text, PageViewsBean> {
    
    		PageViewsBean pvBean = new PageViewsBean();
    		Text k = new Text();
    
    		@Override
    		protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    
    			String line = value.toString();
    			String[] fields = line.split("01");
    			int step = Integer.parseInt(fields[5]);
    			//(String session, String remote_addr, String timestr, String request, int step, String staylong, String referal, String useragent, String bytes_send, String status)
    			//299d6b78-9571-4fa9-bcc2-f2567c46df3472.46.128.140-2013-09-18 07:58:50/hadoop-zookeeper-intro/160"https://www.google.com/""Mozilla/5.0"14722200
    			pvBean.set(fields[0], fields[1], fields[2], fields[3],fields[4], step, fields[6], fields[7], fields[8], fields[9]);
    			k.set(pvBean.getSession());
    			context.write(k, pvBean);
    
    		}
    
    	}
    
    	static class ClickStreamVisitReducer extends Reducer<Text, PageViewsBean, NullWritable, VisitBean> {
    
    		@Override
    		protected void reduce(Text session, Iterable<PageViewsBean> pvBeans, Context context) throws IOException, InterruptedException {
    
    			// 将pvBeans按照step排序
    			ArrayList<PageViewsBean> pvBeansList = new ArrayList<PageViewsBean>();
    			for (PageViewsBean pvBean : pvBeans) {
    				PageViewsBean bean = new PageViewsBean();
    				try {
    					BeanUtils.copyProperties(bean, pvBean);
    					pvBeansList.add(bean);
    				} catch (Exception e) {
    					e.printStackTrace();
    				}
    			}
    
    			Collections.sort(pvBeansList, new Comparator<PageViewsBean>() {
    
    				@Override
    				public int compare(PageViewsBean o1, PageViewsBean o2) {
    
    					return o1.getStep() > o2.getStep() ? 1 : -1;
    				}
    			});
    
    			// 取这次visit的首尾pageview记录,将数据放入VisitBean中
    			VisitBean visitBean = new VisitBean();
    			// 取visit的首记录
    			visitBean.setInPage(pvBeansList.get(0).getRequest());
    			visitBean.setInTime(pvBeansList.get(0).getTimestr());
    			// 取visit的尾记录
    			visitBean.setOutPage(pvBeansList.get(pvBeansList.size() - 1).getRequest());
    			visitBean.setOutTime(pvBeansList.get(pvBeansList.size() - 1).getTimestr());
    			// visit访问的页面数
    			visitBean.setPageVisits(pvBeansList.size());
    			// 来访者的ip
    			visitBean.setRemote_addr(pvBeansList.get(0).getRemote_addr());
    			// 本次visit的referal
    			visitBean.setReferal(pvBeansList.get(0).getReferal());
    			visitBean.setSession(session.toString());
    
    			context.write(NullWritable.get(), visitBean);
    
    		}
    
    	}
    
    	public static void main(String[] args) throws Exception {
    		Configuration conf = new Configuration();
    		Job job = Job.getInstance(conf);
    
    		job.setJarByClass(ClickStreamVisit.class);
    
    		job.setMapperClass(ClickStreamVisitMapper.class);
    		job.setReducerClass(ClickStreamVisitReducer.class);
    
    		job.setMapOutputKeyClass(Text.class);
    		job.setMapOutputValueClass(PageViewsBean.class);
    
    		job.setOutputKeyClass(NullWritable.class);
    		job.setOutputValueClass(VisitBean.class);
    		
    		
    //		FileInputFormat.setInputPaths(job, new Path(args[0]));
    //		FileOutputFormat.setOutputPath(job, new Path(args[1]));
    		FileInputFormat.setInputPaths(job, new Path("E:\Big_Data_Files\oppt"));
    		FileOutputFormat.setOutputPath(job, new Path("E:\Big_Data_Files\opppt"));
    		
    		boolean res = job.waitForCompletion(true);
    		System.exit(res?0:1);
    
    	}
    
    }
    
    

    VisitBean

    import org.apache.hadoop.io.Writable;
    
    import java.io.DataInput;
    import java.io.DataOutput;
    import java.io.IOException;
    
    public class VisitBean implements Writable {
    
    	private String session;
    	private String remote_addr;
    	private String inTime;
    	private String outTime;
    	private String inPage;
    	private String outPage;
    	private String referal;
    	private int pageVisits;
    
    	public void set(String session, String remote_addr, String inTime, String outTime, String inPage, String outPage, String referal, int pageVisits) {
    		this.session = session;
    		this.remote_addr = remote_addr;
    		this.inTime = inTime;
    		this.outTime = outTime;
    		this.inPage = inPage;
    		this.outPage = outPage;
    		this.referal = referal;
    		this.pageVisits = pageVisits;
    	}
    
    	public String getSession() {
    		return session;
    	}
    
    	public void setSession(String session) {
    		this.session = session;
    	}
    
    	public String getRemote_addr() {
    		return remote_addr;
    	}
    
    	public void setRemote_addr(String remote_addr) {
    		this.remote_addr = remote_addr;
    	}
    
    	public String getInTime() {
    		return inTime;
    	}
    
    	public void setInTime(String inTime) {
    		this.inTime = inTime;
    	}
    
    	public String getOutTime() {
    		return outTime;
    	}
    
    	public void setOutTime(String outTime) {
    		this.outTime = outTime;
    	}
    
    	public String getInPage() {
    		return inPage;
    	}
    
    	public void setInPage(String inPage) {
    		this.inPage = inPage;
    	}
    
    	public String getOutPage() {
    		return outPage;
    	}
    
    	public void setOutPage(String outPage) {
    		this.outPage = outPage;
    	}
    
    	public String getReferal() {
    		return referal;
    	}
    
    	public void setReferal(String referal) {
    		this.referal = referal;
    	}
    
    	public int getPageVisits() {
    		return pageVisits;
    	}
    
    	public void setPageVisits(int pageVisits) {
    		this.pageVisits = pageVisits;
    	}
    
    	@Override
    	public void readFields(DataInput in) throws IOException {
    		this.session = in.readUTF();
    		this.remote_addr = in.readUTF();
    		this.inTime = in.readUTF();
    		this.outTime = in.readUTF();
    		this.inPage = in.readUTF();
    		this.outPage = in.readUTF();
    		this.referal = in.readUTF();
    		this.pageVisits = in.readInt();
    
    	}
    
    	@Override
    	public void write(DataOutput out) throws IOException {
    		out.writeUTF(session);
    		out.writeUTF(remote_addr);
    		out.writeUTF(inTime);
    		out.writeUTF(outTime);
    		out.writeUTF(inPage);
    		out.writeUTF(outPage);
    		out.writeUTF(referal);
    		out.writeInt(pageVisits);
    
    	}
    
    	@Override
    	public String toString() {
    		return session + "01" + remote_addr + "01" + inTime + "01" + outTime + "01" + inPage + "01" + outPage + "01" + referal + "01" + pageVisits;
    	}
    }
    
    
  • 相关阅读:
    继承映射
    一对多,多对一,自关联的配置
    Spring 配置自动扫描spring bean配置
    Dao 处理
    2019暑假集训 括号匹配
    2019暑假集训 BLO
    2019暑假集训 Intervals
    2019暑假集训 平板涂色
    2019暑假集训 走廊泼水节
    0002-五层小山
  • 原文地址:https://www.cnblogs.com/alidata/p/13470899.html
Copyright © 2020-2023  润新知