• 系统设计以及javascript笔记:用户行为分析研究之数据采集


    1.1用户行为分析的重要性

      用户行为分析的重要性,我想做个网站的人都会用很清晰的认识,本来我想谈谈自己想法,但感觉自己毕竟还是做技术的,很难清晰的从商业价值的角度来分析它的重要性,因此放弃了想阐述自己意见的想法。当我第一次见到百度统计,和谷歌分析网站,就有那种惊鸿一瞥的激动,很想自己也能写出一套这样的网站,这也是我持续研究用户行为分析的初衷。

      我估计还是有很多童鞋对“用户行为分析”的概念比较陌生,这里将百度百科里的解释在这里贴出来,抛砖引玉,希望能有更多的志同道合者跟我一起研究这个主题,百度百科的地址如下:

      http://baike.baidu.com/view/2330219.htm

      好了,废话不多说了,马上就进入正题。

    1.2     设计优秀的数据采集系统

      对于大型网站而言,网站响应速度是网站是否优秀一个重要衡量标准,下面我引用一些权威机构的统计数据来说明网站响应速度的重要性:

      用户行为分析的前提就是能准确的采集到用户的相关数据,这就需要我们在网站页面里添加采集数据的代码,如果我们的采集代码写的不好,一定会对网站的性能产生一定的影响,更有甚者还会影响到网站的稳定性。因此设计一套性能卓越,安全性好,耦合度很低的日志采集程序是非常重要的。

      这里我提供一套采集数据方案,方案详情如下:

      我是做java的程序员,经常使用到的web应用服务器是tomact,jboss,weblogic等等,我这里为什么不使用这些我非常熟悉的web应用服务器,而去选择功能相对单一的apache或者是nginx呢?理由非常简单,因为apache和nginx速度更快,更加轻量级,这个经验来源于我做网站的经验,大型网站的服务端设计是很复杂的,但基本都有一个共同的原则:当用户一个请求提交到了服务端,服务端会先判断这个请求,如果请求的是一些对静态资源的访问(比如图片,不会变化的文字等),请求会直接提交到响应的静态资源服务器集群,这样速度会更快,而这些静态资源服务器基本都是apache或者是像nginx这样的轻量级web服务器集群。

    1.3    采集系统之服务端

      本地开发,我就不去搭建集群了,有兴趣的童鞋可以在网上查查相关的资料。本地开发我就搭建一个apache服务器。

      服务器的开发非常简单,只要修改下apache下的conf文件(注意:我的开发平台是window7),代码如下:

    <IfModule log_config_module>
        LogFormat "%h %l %u %t [%{%Y-%m-%d %T}t] \"%r\" [%q] [%U] %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined
        LogFormat "%h %l %u %t [%{%Y-%m-%d %T}t] \"%r\" [%q] [%U] %>s %b" common
    
        <IfModule logio_module>
          # You need to enable mod_logio.c to use %I and %O
          LogFormat "%h %l %u %t [%{%Y-%m-%d %T}t] \"%r\" [%q] [%U] %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %I %O" combinedio
        </IfModule>   

    在htdocs文件夹里添加如下文件:

    1)         a.gif。(1*1像素的透明文件)

    2)         click.html。(用于记录点击日志)

    3)         error.html。(记录错误信息日志)

    启动apache服务器,我们在浏览器录入如下地址:

    http://127.0.0.1/a.gif?name=sharpxiajun&msg=test

    在logs文件夹里找到2012_06_26.access.log文件,打开文件,我们会看到如下日志:

    127.0.0.1 - - [26/Jun/2012:11:37:07 +0800] [2012-06-26 11:37:07] "GET /a.gif?name=sharpxiajun&msg=test HTTP/1.1" [?name=sharpxiajun&msg=test] [/a.gif] 200 43

    访问请求被完整的记录下来了。

    1.4     采集系统之客户端

      采集系统的核心还是客户端的采集脚本,这里我会贴出完整的采集脚本以及测试页面,代码的详细解析我会在以后的博客里进行阐述。

      我的采集脚本可以记录用户访问的日志,还能记录用户的点击日志,不过点击日志一般包含业务含义需要用户根据自己的需求去定义。代码如下:

    up_beacon.js:

    View Code
    (function(window,document,undefined){
        // upLogger对象是采集脚本对外提供的操作对象
        if (window.upLogger){//如果不为空,直接返回,避免重复安装
            return;
        }
        var upBeaconUtil ={//日志记录工具类
            jsName:'up_beacon.js',//程序名称
            defaultVer:20120607,//版本日期
            getVersion:function(){//获取版本号
                var e = this.jsName;
                var a = new RegExp(e + "(\\?(.*))?$");
                var d = document.getElementsByTagName("script");
                for (var i = 0;i < d.length;i++){
                    var b = d[i];
                    if (b.src && b.src.match(a)){
                        var z = b.src.match(a)[2];
                        if (z && (/^[a-zA-Z0-9]+$/).test(z)){
                             return z;
                        }
                    }
                }
                return this.defaultVer;
            },
            setCookie:function(sName,sValue,oExpires,sPath,sDomain,bSecure){//设置cookie信息
                var currDate = new Date(),
                    sExpires = typeof oExpires == 'undefined'?'':';expires=' + new Date(currDate.getTime() + (oExpires * 24 * 60 * 60* 1000)).toUTCString();
                document.cookie = sName + '=' + sValue + sExpires + ((sPath == null)?'':(' ;path=' + sPath)) + ((sDomain == null)?'':(' ;domain=' + sDomain)) + ((bSecure == true)?' ; secure':'');
            },
            getCookie:function(sName){//获取cookie信息
                var regRes = document.cookie.match(new RegExp("(^| )" + sName + "=([^;]*)(;|$)"));
                return (regRes != null)?unescape(regRes[2]):'-';
            },
            getRand:function(){// 生产页面的唯一标示
                var currDate = new Date();
                var randId = currDate.getTime() + '-';    
                for (var i = 0;i < 32;i++)
                {
                    randId += Math.floor(Math.random() * 10);    
                }
                return randId;
            },
            parseError:function(obj){
                var retVal = '';
                for (var key in obj){
                    retVal += key + '=' + obj[key] + ';';    
                }
                return retVal;
            },
            getParam:function(obj,flag){// 参数转化方法
                var retVal = null;
                if (obj){
                    if (upBeaconUtil.isString(obj) || upBeaconUtil.isNumber(obj)){
                        retVal = obj;    
                    }else{
                        if (upBeaconUtil.isObject(obj)){
                            var tmpStr = '';
                            for (var key in obj){
                                if (obj[key] != null && obj[key] != undefined){
                                    var tmpObj = obj[key];
                                    if (upBeaconUtil.isArray(tmpObj)){
                                        tmpObj = tmpObj.join(',');    
                                    }else{
                                        if (upBeaconUtil.isDate(tmpObj)){
                                            tmpObj = tmpObj.getTime();    
                                        }
                                    }
                                    tmpStr += key + '=' + tmpObj + '&';
                                }
                            }
                            tmpStr = tmpStr.substring(0,tmpStr.length - 1);
                            retVal = tmpStr;
                        }else{
                            if (upBeaconUtil.isArray(obj)){
                                if (upBeaconUtil.length & upBeaconUtil.length > 0){
                                    retVal = obj.join(',');
                                }
                            }else{
                                retVal = obj.toString();    
                            }
                        }
                    }
                }
                
                if (!retVal){
                    retVal = '-';    
                }
                
                if (flag){
                    retVal = encodeURIComponent(retVal);
                    retVal = this.base64encode(retVal);
                }
                return retVal;
            },
            base64encode: function(G) {//base64加密
                var A = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
                var C, E, z;
                var F, D, B;
                z = G.length;
                E = 0;
                C = "";
                while (E < z) {
                    F = G.charCodeAt(E++) & 255;
                    if (E == z) {
                        C += A.charAt(F >> 2);
                        C += A.charAt((F & 3) << 4);
                        C += "==";
                        break
                    }
                    D = G.charCodeAt(E++);
                    if (E == z) {
                        C += A.charAt(F >> 2);
                        C += A.charAt(((F & 3) << 4) | ((D & 240) >> 4));
                        C += A.charAt((D & 15) << 2);
                        C += "=";
                        break
                    }
                    B = G.charCodeAt(E++);
                    C += A.charAt(F >> 2);
                    C += A.charAt(((F & 3) << 4) | ((D & 240) >> 4));
                    C += A.charAt(((D & 15) << 2) | ((B & 192) >> 6));
                    C += A.charAt(B & 63)
                }
                return C
            },
            getDomain:function(){//获取网站的域名
                return document.URL.substring(document.URL.indexOf("://") + 3,document.URL.lastIndexOf("\/"));
            },
            isString:function(obj){// 判断是不是String类型
                return (obj != null) && (obj != undefined) && (typeof obj == 'string') && (obj.constructor == String);    
            },
            isNumber:function(obj){// 判断是否是数组
                return (typeof obj == 'number') && (obj.constructor == Number);    
            },
            isDate:function(obj){// 判断是否是日期
                return obj && (typeof obj == 'object') && (obj.constructor == Date);
            },
            isArray:function(obj){//判断是否是数组
                return obj && (typeof obj == 'object') && (obj.constructor == Array);    
            },
            isObject:function(obj){//判断是否是对象
                return obj && (typeof obj == 'object') && (obj.constructor == Object)    
            },
            trim:function(str){// 去除左右两边空格
                return str.replace(/(^\s*)|(\s*$)/, "");;
            }
        },
        beacon_vist_num = isNaN(beacon_vist_num = +upBeaconUtil.getCookie('up_beacon_vist_count')) ? 1:beacon_vist_num + 1;// 从cookie里获取访问次数
        upBeaconUtil.setCookie('up_beacon_vist_count',beacon_vist_num);//记录新的访问次数
        var setUpBeaconId = function(){
            var sUpBeaconId = upBeaconUtil.trim(upBeaconUtil.getCookie('up_beacon_id'));
            if (sUpBeaconId == undefined || sUpBeaconId == null || sUpBeaconId == '' || sUpBeaconId == '-'){
                upBeaconUtil.setCookie('up_beacon_id',(upBeaconUtil.getDomain() + '.' + (new Date()).getTime()));
            }        
        }(),
        beaconMethod = {
            uvId:'up_beacon_id',// 
            memId:'up_dw_track'    ,
            beaconUrl:'127.0.0.1/a.gif',//记录访问日志的url
            errorUrl:'127.0.0.1/error.html',//记录错误日志的url
            clickUrl:'127.0.0.1/click.html',//记录click日志的url 
            pageId:typeof _beacon_pageid != 'undefined'?_beacon_pageid:(_beacon_pageid = upBeaconUtil.getRand()),//生产pageId(页面唯一标示)
            protocol:function(){//请求的协议例如http://
                var reqHeader = location.protocol;
                if ('file:' === reqHeader){
                    reqHeader = 'http:';    
                }
                return reqHeader + '//';
            },
            tracking:function(){// 记录访问日志的方法(对外)
                this.beaconLog();
            },
            getRefer:function(){// 获取上游页面信息
                var reqRefer = document.referrer;
                reqRefer == location.href && (reqRefer = '');
                try{
                    reqRefer = '' == reqRefer ? opener.location:reqRefer;
                    reqRefer = '' == reqRefer ? '-':reqRefer;
                }catch(e){
                    reqRefer = '-';
                }
                return reqRefer;
            },
            beaconLog:function(){// 记录访问日志方法
                try{
                    var httpHeadInd = document.URL.indexOf('://'),
                        httpUrlContent = '{' + upBeaconUtil.getParam(document.URL.substring(httpHeadInd + 2)) + '}',
                        hisPageUrl = '{' + upBeaconUtil.getParam(this.getRefer()) + '}',
                        ptId = upBeaconUtil.getCookie(this.memId),
                        cId = upBeaconUtil.getCookie(this.uvId),
                        btsVal = upBeaconUtil.getCookie('b_t_s'),
                        beanconMObj = {};
                    var btsFlag = btsVal == '-' || btsVal.indexOf('s') == -1;
                    if (ptId != '-'){
                        beanconMObj.memId = ptId;    
                    }
                    if (btsFlag){
                        beanconMObj.subIsNew = 1;
                        upBeaconUtil.setCookie('b_t_s',btsVal == '-' ? 's' : (btsVal + 's'),10000,'/');
                    }else{
                        beanconMObj.subIsNew = 0;    
                    }
                    var logParams = '{' + upBeaconUtil.getParam(beanconMObj) + '}',
                        logPageId = this.pageId,
                        logTitle = document.title;
                    if (logTitle.length > 25){
                        logTitle = logTitle.substring(0,25);
                    }
                    logTitle = encodeURIComponent(logTitle);
                    var logCharset = (navigator.userAgent.indexOf('MSIE') != -1) ? document.charset : document.characterSet,
                        logQuery = '{' + upBeaconUtil.getParam({
                            pageId:logPageId,
                            title:logTitle,
                            charset:logCharset,
                            sr:(window.screen.width + '*' + window.screen.height)
                        }) + '}';
                    var sparam = {
                        logUrl:httpUrlContent,
                        logHisRefer:hisPageUrl,
                        logParams:logParams,
                        logQuery:logQuery
                    };
                    this.sendRequest(this.beaconUrl,sparam);
                }catch(ex){
                    this.sendError(ex);    
                }
            },
            clickLog:function(sparam){// 记录点击日志
                try{
                    // 获得pageId
                    var clickPageId = this.pageId;
                    if (!clickPageId){// 当pageId值为空,重新计算pageId
                        this.pageId = upBeaconUtil.getRand();
                        clickPageId    = this.pageId;
                    }
                    var clickAuthId = this.authId;//authId是针对某个网站的唯一标示
                    if (!clickAuthId){
                        clickAuthId = '-';    
                    }
                    if (upBeaconUtil.isObject(sparam)){// 当传入参数是javascript对象
                        sparam.pageId = clickPageId;
                        sparam.authId = clickAuthId;    
                    }else{
                        if (upBeaconUtil.isString(sparam) && sparam.indexOf('=') > 0){// 当传入参数是字符串
                            sparam += '&pageId=' + clickPageId + "&authId=" + clickAuthId;
                        }else{
                            if (upBeaconUtil.isArray(sparam)){// 当传入参数是数组
                                sparam.push("pageId=" + clickPageId);
                                sparam.push("authId=" + clickAuthId);
                                sparam = sparam.join('&');//数组转化为字符串
                            }else{// 其他数据类型
                                sparam = {pageId:clickPageId,authId:clickAuthId};    
                            }
                        }
                    }
                    this.sendRequest(this.clickUrl, sparam);// 发送点击日志
                }catch(ex){
                    this.sendError(ex);        
                }
            },
            sendRequest:function(url,params){// 日志发送方法
                var urlParam = '',currDate = new Date();
                try{
                    if (params){
                        urlParam = upBeaconUtil.getParam(params,false);
                        urlParam = (urlParam == '')?urlParam:(urlParam + '&');
                    }
                    var tmpUrlParam = 'ver=' + upBeaconUtil.getVersion() + '&time=' + currDate.getTime();
                    url = this.protocol() + url + '?' + urlParam + tmpUrlParam;
                    
                    var logImage = new Image();
                    logImage.onload = function(){
                        logImage = null;    
                    }
                    logImage.src = url;
                }catch(e){
                    this.sendError(e);
                }
            },
            sendError:function(ex){// 发送错误日志
                var errURIParams = upBeaconUtil.parseError(ex),
                    errURL = this.errorUrl + '?type=send&exception=' + encodeURIComponent(errURIParams.toString()),
                    errImage = new Image();
                errImage.onload = function(){
                    errImage = null;    
                };
                errImage.src = this.protocol() + errURL;
            }
        };
        beaconMethod.tracking();
        window.upLogger = beaconMethod;//构建window的upLogger对象
    })(window,document);

    install_up_beacon.js文件,这个文件对外提供:

    (function(window,document,undefined){
        /*安装采集脚本的js程序*/
        // upLogger对象是采集脚本对外提供的操作对象
        if (window.upLogger){//如果不为空,直接返回,避免重复安装
            return;
        }
        var cookieUtil = {//cookie操作工具类
            setCookie:function(sName,sValue,oExpires,sPath,sDomain,bSecure){
                var currDate = new Date(),
                    sExpires = typeof oExpires == 'undefined'?'':';expires=' + new Date(currDate.getTime() + (oExpires * 24 * 60 * 60* 1000)).toUTCString();
                document.cookie = sName + '=' + sValue + sExpires + ((sPath == null)?'':(' ;path=' + sPath)) + ((sDomain == null)?'':(' ;domain=' + sDomain)) + ((bSecure == true)?' ; secure':'');
            },
            getCookie:function(sName){
                var regRes = document.cookie.match(new RegExp("(^| )" + sName + "=([^;]*)(;|$)"));
                return (regRes != null)?unescape(regRes[2]):'-';
            }        
        };
        var btsVal = cookieUtil.getCookie('b_t_s'),//b_t_s的cookie作用1.标识该页面是否已经安装了采集脚本;2.记录采集脚本的有效期
            startTime = 0,
            intervalTime = 3 * 24 * 60 * 60 * 1000,
            currIntervalTime = new Date().getTime() - 1200000000000,
            domainHead = (document.URL.substring(0,document.URL.indexOf('://'))) + '://';
        if (btsVal != '-' && btsVal.indexOf('t') != -1){
            var getBtsTime = btsVal.substring(btsVal.indexOf('t') + 1,btsVal.indexOf('x'));
                getCurrInterVal = currIntervalTime - getBtsTime;
            if (getCurrInterVal > intervalTime){
                startTime = currIntervalTime;
                cookieUtil.setCookie('b_t_s',btsVal.replace('t' + getBtsTime + 'x', 't' + currIntervalTime + 'x'), 10000, '/');
            }else{
                startTime = getBtsTime;
            }
        }else{
            if (btsVal == '-'){
                cookieUtil.setCookie('b_t_s','t' + currIntervalTime + 'x', 10000, '/');    
            }else{
                cookieUtil.setCookie('b_t_s',btsVal + 't' + currIntervalTime + 'x', 10000, '/');        
            }
            startTime = currIntervalTime;
        }
        document.write('<script src="' + domainHead + '127.0.0.1/up_beacon.js?' + startTime + '"><\/script>');//安装采集脚本
    })(window,document);

    下面是测试页面;

    第一个测试页面:testbeacon.html,代码如下:

    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>beacon test page</title>
    </head>
    <script type="text/javascript" src="install_up_beacon.js"></script>
    <body>
    <h1>日志测试</h1>
    <input type="button" value="Click Button" id="clickBtn" name="clickBtn" onclick="clickLog('testClickBtn','MyTest')"/>
    </body>
    </html>
    <script type="text/javascript">
    // 用户行为统计代码
    function recordStaticLogerr(authId,type,msg){
        if (window.upLogger){
            upLogger.authId = authId;
            upLogger.clickLog('type=' + type + '&clickTarget=' + msg);    
        }
    }
    
    // 记录click日志的方法
    function clickLog(clog_msg,clog_type){
        var clog_authId    = 'sharpxiajun';
        recordStaticLogerr(clog_authId,clog_type,clog_msg);    
    }
    </script>

    第二个测试页面:parent.html,代码如下:

    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>parent html</title>
    </head>
    
    <body>
    <a href="testbeacon.html" target="_self">child.html</a>
    </body>
    </html>

    1.5     测试结果

    测试地址:

    http://localhost/testbeacon.html

    http://localhost/parent.html

    我们查看cookies信息,如下图:

    日志信息如下:

    127.0.0.1 - - [26/Jun/2012:10:01:52 +0800] [2012-06-26 10:01:52] "GET /parent.html HTTP/1.1" [] [/parent.html] 304 -
    127.0.0.1 - - [26/Jun/2012:10:01:54 +0800] [2012-06-26 10:01:54] "GET /testbeacon.html HTTP/1.1" [] [/testbeacon.html] 304 -
    127.0.0.1 - - [26/Jun/2012:10:01:54 +0800] [2012-06-26 10:01:54] "GET /install_up_beacon.js HTTP/1.1" [] [/install_up_beacon.js] 304 -
    127.0.0.1 - - [26/Jun/2012:10:01:54 +0800] [2012-06-26 10:01:54] "GET /up_beacon.js?140675524644 HTTP/1.1" [?140675524644] [/up_beacon.js] 304 -
    127.0.0.1 - - [26/Jun/2012:10:01:54 +0800] [2012-06-26 10:01:54] "GET /a.gif?logUrl={/localhost/testbeacon.html}&logHisRefer={http://localhost/parent.html}&logParams={subIsNew=0}&logQuery={pageId=1340676114790-42900296489937289847295051780050&title=beacon%20test%20page&charset=UTF-8&sr=1280*1024}&ver=140675524644&time=1340676114791 HTTP/1.1" [?logUrl={/localhost/testbeacon.html}&logHisRefer={http://localhost/parent.html}&logParams={subIsNew=0}&logQuery={pageId=1340676114790-42900296489937289847295051780050&title=beacon%20test%20page&charset=UTF-8&sr=1280*1024}&ver=140675524644&time=1340676114791] [/a.gif] 200 43
    127.0.0.1 - - [26/Jun/2012:10:02:01 +0800] [2012-06-26 10:02:01] "GET /click.html?type=MyTest&clickTarget=testClickBtn&pageId=1340676114790-42900296489937289847295051780050&authId=sharpxiajun&ver=140675524644&time=1340676121252 HTTP/1.1" [?type=MyTest&clickTarget=testClickBtn&pageId=1340676114790-42900296489937289847295051780050&authId=sharpxiajun&ver=140675524644&time=1340676121252] [/click.html] 200 310

    大家看到了吧,请求都被记录下来,下面我们只要好好分析这些日志文件的信息就行了。

     pdf下载地址:

    https://files.cnblogs.com/sharpxiajun/%E7%94%A8%E6%88%B7%E8%A1%8C%E4%B8%BA%E5%88%86%E6%9E%90%E7%A0%94%E7%A9%B6%E4%B9%8B%E6%95%B0%E6%8D%AE%E9%87%87%E9%9B%86.pdf

     

  • 相关阅读:
    第十二章学习笔记
    UVa OJ 107 The Cat in the Hat (戴帽子的猫)
    UVa OJ 123 Searching Quickly (快速查找)
    UVa OJ 119 Greedy Gift Givers (贪婪的送礼者)
    UVa OJ 113 Power of Cryptography (密文的乘方)
    UVa OJ 112 Tree Summing (树的求和)
    UVa OJ 641 Do the Untwist (解密工作)
    UVa OJ 105 The Skyline Problem (地平线问题)
    UVa OJ 100 The 3n + 1 problem (3n + 1问题)
    UVa OJ 121 Pipe Fitters (装管子)
  • 原文地址:https://www.cnblogs.com/sharpxiajun/p/2563509.html
Copyright © 2020-2023  润新知