• 学习笔记:Pig基础


    一、Pig基本介绍

     1. 起源

    MapReduce的一个缺点是开发周期太长。写mapperreducer,对代码进行编译和打包,提交作业,获取结果,这整个过程非常耗时。事实上,正是由于YAHOO公司想让科研人员和工程师能够便捷地挖掘大规模数据集,才设计了Pig.

    2. 基础

    一种探索大规模数据集的脚本语言。

    Pig的好处在于仅用控制台上的几行Pig代码就能够处理TB级的数据。

    二、Pig实验

    该文件是某网站访问日志,请大家使用pig计算出每个ip的点击次数

    1.数据源

    119.146.220.12 - - [31/Jan/2012:23:59:44 +0800] "POST /forum.php?mod=post&action=reply&fid=53&tid=69&extra=page%3D1&replysubmit=yes&infloat=yes&handlekey=fastpost&inajax=1 HTTP/1.1" 200 397 "http://f.dataguru.cn/thread-69-1-1.html" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
    119.146.220.12 - - [31/Jan/2012:23:59:45 +0800] "GET /forum.php?mod=viewthread&tid=69&viewpid=677&from=&inajax=1&ajaxtarget=post_new HTTP/1.1" 200 4794 "http://f.dataguru.cn/thread-69-1-1.html" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
    119.146.220.12 - - [31/Jan/2012:23:59:45 +0800] "GET /static/js/common_extra.js?AZH HTTP/1.1" 304 - "http://f.dataguru.cn/thread-69-1-1.html" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
    119.146.220.12 - - [31/Jan/2012:23:59:48 +0800] "GET /static/js/jquery-1.6.js HTTP/1.1" 404 299 "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
    119.146.220.12 - - [31/Jan/2012:23:59:48 +0800] "GET /static/js/floating-jf.js HTTP/1.1" 404 300 "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
    119.146.220.12 - - [31/Jan/2012:23:59:48 +0800] "GET /static/js/common.js?AZH HTTP/1.1" 304 - "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
    119.146.220.12 - - [31/Jan/2012:23:59:48 +0800] "GET /data/cache/style_2_forum_forumdisplay.css?AZH HTTP/1.1" 304 - "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
    119.146.220.12 - - [31/Jan/2012:23:59:48 +0800] "GET /forum.php?mod=forumdisplay&fid=53&page=1 HTTP/1.1" 200 49334 "http://f.dataguru.cn/thread-69-1-1.html" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
    119.146.220.12 - - [31/Jan/2012:23:59:48 +0800] "GET /data/cache/style_2_widthauto.css?AZH HTTP/1.1" 304 - "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
    119.146.220.12 - - [31/Jan/2012:23:59:48 +0800] "GET /static/js/forum.js?AZH HTTP/1.1" 304 - "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
    119.146.220.12 - - [31/Jan/2012:23:59:48 +0800] "GET /popwin_js.php?fid=53 HTTP/1.1" 404 289 "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
    119.146.220.12 - - [31/Jan/2012:23:59:49 +0800] "GET /static/js/seditor.js?AZH HTTP/1.1" 304 - "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
    119.146.220.12 - - [31/Jan/2012:23:59:49 +0800] "GET /home.php?mod=spacecp&ac=pm&op=checknewpm&rand=1328025588 HTTP/1.1" 200 - "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
    220.181.94.221 - - [31/Jan/2012:23:59:49 +0800] "GET /home.php?mod=spacecp&ac=pm&op=showmsg&handlekey=showmsg_11&touid=11&pmid=0&daterange=2&pid=77&tid=26 HTTP/1.1" 200 10074 "-" "Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)"
    119.146.220.12 - - [31/Jan/2012:23:59:48 +0800] "GET /data/cache/style_2_common.css?AZH HTTP/1.1" 304 - "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
    119.146.220.12 - - [31/Jan/2012:23:59:51 +0800] "GET /static/js/jquery-1.6.js HTTP/1.1" 404 299 "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
    119.146.220.12 - - [31/Jan/2012:23:59:52 +0800] "GET /static/js/floating-jf.js HTTP/1.1" 404 300 "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
    119.146.220.12 - - [31/Jan/2012:23:59:55 +0800] "GET /popwin_js.php?fid=53 HTTP/1.1" 404 289 "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
    119.146.220.12 - - [31/Jan/2012:23:59:55 +0800] "GET /static/js/smilies.js?AZH HTTP/1.1" 304 - "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
    119.146.220.12 - - [31/Jan/2012:23:59:55 +0800] "GET /data/cache/common_smilies_var.js?AZH HTTP/1.1" 304 - "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"

    2. Pig 命令

    //加载HDFS中访问日志,使用空格进行分割,只加载ip列
    records = LOAD 'hdfs://hadoop:9000/class7/input/website_log.txt' USING PigStorage(' ') AS (ip:chararray);
    
    // 按照ip进行分组,统计每个ip点击数
    records_b = GROUP records BY ip;
    records_c = FOREACH records_b GENERATE group,COUNT(records) AS click;
    
    // 按照点击数排序,保留点击数前10个的ip数据
    records_d = ORDER records_c by click DESC;
    top10 = LIMIT records_d 10;
    
    // 把生成的数据保存到HDFS的class7目录中
    STORE top10 INTO 'hdfs://hadoop:9000/class7/out';
  • 相关阅读:
    angular2
    angular1
    JavaScript随笔1
    鼠标样式
    清除浮动
    css-父标签中的子标签默认位置
    [Leetcode] Decode Ways
    [Java] 利用LinkedHashMap来实现LRU Cache
    LinkedHashMap和HashMap的比较使用(转)
    [Java] java.util.Arrays 中使用的 sort 采用的算法 (转)
  • 原文地址:https://www.cnblogs.com/FrankZhou2017/p/9145419.html
Copyright © 2020-2023  润新知