• foreach 内嵌的使用


    foreach内部处理数据流的每条记录,进行关系操作,最后用generate返回数据给外部。但注意关系操作符不能作用于表达式,要将表达式提取成关系。

    foreach内部只支持distinctfilterlimit, order;最后必须是generate

    foreach内部处理数据每次处理一条。对group后的关系(流对象),foreach内嵌每次传入一个group子集。

    所以foreach内嵌之前都是先group关系,

    也就是说,group后的数据,一次传进来一个group子集,可以分别按每个group进行统计等操作。(证明在最后。)

    如统计每个exchange对应的symbol数:

    --distinct_symbols.pig

    daily   = load 'NYSE_daily' as (exchange, symbol); -- not interested in otherfields

    grpd    = group daily by exchange;

    uniqcnt = foreach grpd {

              sym      = daily.symbol;

              uniq_sym = distinct sym;

              generate group, COUNT(uniq_sym);

    };

    Foreach分别提取grpd的每个group,表达式变关系,被关系操作符处理。

     

     

    使用forecah内嵌,去重的两个例子

    1,分别统计两列的唯一数:group内提取两个子集,分别去重,统计

    --double_distinct.pig

    divs = load 'NYSE_dividends' as(exchange:chararray, symbol:chararray);

    grpd = group divs all;

    uniq = foreach grpd {

               exchanges      =divs.exchange;--this 'divs' is the $1 colum

               uniq_exchanges = distinct exchanges;

               symbols        = divs.symbol;

               uniq_symbols   = distinct symbols;

               generate COUNT(uniq_exchanges), COUNT(uniq_symbols);

    };

     

    另一种用limit返回1一条:日志数据通过hour分组,统计cookie:

    (1)   首先对(hourcookie)按照cookie进行去重;

    (2)   hour分组;

    (3)   hour分组进行局部count

    (4)   合并各个count,得到最终结果

    下面是具体代码示例:

                       showlogs= LOAD '$inpath' USING PigStorage(',');

    necessaryfields= FOREACH showlogs GENERATE (chararray)$1 AS fhour,$13 AS fallyesid, $19 ASfip, $20 AS fuseragent , CONCAT((chararray)$19,(chararray)$20) AS fipuseragent;

    --1.生成hour+cookie : hourcookie

    hourcookie= FOREACH necessaryfields GENERATE $0 AS hour, $1 AS cookie;

     

    --2.hourcookie按照cookie进行去重: dist_hour_cookie

    ta= GROUP hourcookie BY cookie;

    dist_hour_cookie= FOREACH ta {

           temp = limit hourcookie 1;

           GENERATE FLATTEN(temp);

    };

     

    --3.dist_hour_cookie按照hour进行分组:dist_hour_cookie_grp

    dist_hour_cookie_grp= GROUP dist_hour_cookie BY $0;

     

    --4.dist_hour_cookie_grp进行count:dist_hour_cookie_grp_count

    dist_hour_cookie_grp_count= FOREACH dist_hour_cookie_grp {

           GENERATE group AS hour, COUNT(dist_hour_cookie) AS cookie_count;

    };

     

    --5.dist_hour_cookie_grp_count进行sum

    sumtemp= GROUP  dist_hour_cookie_grp_count all;

    total_cookie= FOREACH sumtemp GENERATE SUM(dist_hour_cookie_grp_count.cookie_count);

    dump total_cookie;

     

    针对foreach内嵌,每次处理一条数据。我做了一个实例测试,证明关系group后,foreach每次处理一个group子集。


    数据按guid分组,每个记录都是一个独立的group。
    去重统计个数,
    在foreach最后输出guid和group,做对比,看到重复数据:
    my_data = foreach origin_cleaned_data generate wizad_ad_id,guid,os_version,log_type;
    test_data = limit my_data 100;
    g_log = group test_data by guid;
    uniq = foreach g_log{
            guid = test_data.guid;
            os_v = test_data.os_version;
            uniq_guid = distinct guid;
            generate group,COUNT(uniq_guid),COUNT(os_v),guid;
            };
    dump uniq;
    describe uniq;


    结果
    g_log结构: {group: chararray,test_data: {wizad_ad_id: chararray,guid: chararray,os_version: chararray,log_type: chararray}}
    输出结果:
    (351794060670802,1,1,{(351794060670802)})
    (352246063893286,1,1,{(352246063893286)})
    (352274018390729,1,1,{(352274018390729)})
    (352315053649659,1,1,{(352315053649659)})
    ......
    (354710052256050,1,1,{(354710052256050)})
    (355065053261297,1,1,{(355065053261297)})
    (861202021584958,1,1,{(861202021584958)})
    (861276027634215,1,1,{(861276027634215)})
    (861288000290493,1,1,{(861288000290493)})
    (861372020081247,1,3,{(861372020081247),(861372020081247),(861372020081247)})
    (862011024062881,1,1,{(862011024062881)})
    (862040020713619,1,1,{(862040020713619)})
    (862055100027987,1,1,{(862055100027987)})
    (862106010206458,1,1,{(862106010206458)})
    (862191016593489,1,1,{(862191016593489)})
    (862283020830914,1,1,{(862283020830914)})
    (862324016545965,1,2,{(862324016545965),(862324016545965)})
    (862565010397387,1,1,{(862565010397387)})
    (862620028211136,1,1,{(862620028211136)})
    (862663027090333,1,1,{(862663027090333)})
    (862703020059792,1,1,{(862703020059792)})
    (862910026533684,1,1,{(862910026533684)})
    (862966020482112,1,1,{(862966020482112)})
    (863077025294442,1,1,{(863077025294442)})
    (863139026459463,1,1,{(863139026459463)})
    ......
    uniq: {group: chararray,long,long,guid: {guid: chararray}}


    从中看出:
    按guid分组所以前后,group和guid基本一致,但因为guid有重复,所以多了几条;
    可如果使用uniq_guid,即generate group,COUNT(uniq_guid),COUNT(os_v),uniq_guid;那么前后将完全一致。


    对比,在generate中去掉group和guid,即generate group,COUNT(uniq_guid),COUNT(os_v),guid;
    输出结果:
    (1,1)
    (1,1)
    (1,1)
    (1,1)
    (1,1)
    .......


    如果按log_type分组,2为click,1为show日志,
    只limit 100的话,只有1的日志,
    g_log = group test_data by log_type;
    uniq = foreach g_log{
            guid = test_data.guid;
            os_v = test_data.os_version; 
            uniq_guid = distinct guid;
            generate COUNT(uniq_guid),COUNT(os_v),guid;
            };

    结构为
    uniq: {long,long,guid: {guid: chararray}}
    (97,100,{(351794060670802),(352246063893286),(352274018390729),(352315053649659),(352316055186377),(352903052496931),(352956061574924),(353721059706006),(354710052256050),(355065053261297),(355310044316523),(355431800349753),(355594050655870),(355868054005229),(356405053894342),(356521051392830),(356524057953100),(356845052409198),(356988053301574),(357070009612856),(357116040052477),(357747051110463),(357784057979674),(358197058147473),(358373048304967),(358585052003587),(358968041526969),(359092058109426),(359357054529855),(359786051477264),(359899046896710),(860173013415348),(860570023370282),(860602020630317),(860813022906346),(860892020832126),(861022007037726),(861060010353755),(861118000103844),(861133029749824),(861202021584958),(861276027634215),(861288000290493),(861372020081247),(861372020081247),(861372020081247),(862011024062881),(862040020713619),(862055100027987),(862106010206458),(862191016593489),(862283020830914),(862324016545965),(862324016545965),(862565010397387),(862620028211136),(862663027090333),(862703020059792),(862910026533684),(862966020482112),(863077025294442),(863139026459463),(863150020224084),(863151028063706),(863177029375831),(863235016133314),(863343023501554),(863427022570130),(863431020260215),(863735010063193),(863777020966957),(863827011094952),(864260024902038),(864264021430787),(864264021548489),(864299028966482),(864301029532170),(864375022648902),(864500025549803),(864505000231848),(864573012942174),(864789028603903),(864958020327031),(864989010679594),(865030012667169),(865316020164285),(865369022512415),(865369029331074),(865407016261415),(866805010343957),(867064010125951),(867064016986521),(867163010209227),(867264010644205),(867739012450792),(868201005627091),(868629010372156),(868880017937570),(869226012126760),(869642009320895)})


     在generate COUNT(uniq_guid),COUNT(os_v),guid;中去掉guid的结果为:
    (97,100)
    uniq: {long,long}



    另外,用sample可以按日志数据的整体分布取样,这样可以看到log_type的所有取值类型数据,

    但是每次取样不一样,所以多次执行的统计结果不一样
    my_data = foreach origin_cleaned_data generate wizad_ad_id,guid,os_version,log_type;
    test_data = sample my_data 0.01;
    g_log = group test_data by log_type;
    uniq = foreach g_log{
            guid = test_data.guid;
            os_v = test_data.os_version;
            uniq_guid = distinct guid;
            generate group, COUNT(uniq_guid), COUNT(os_v);
            };



    g_log结构为: {group: chararray,test_data: {wizad_ad_id: chararray,guid: chararray,os_version: chararray,log_type: chararray}}


    生成的数据结果为
    (1,13455,14246)
    (2,74,74)
    uniq: {group: chararray,long,long}


    语句 generate中去掉group:generate COUNT(uniq_guid), COUNT(os_v)结果为
    (13407,14178)
    (73,73)
    uniq: {long,long}



  • 相关阅读:
    gnuplot 让您的数据可视化
    sort
    sed
    AWK
    STA之RC Corner再论
    STA之RC Corner拾遗
    网络编程释疑之:TCP半开连接的处理
    Task 任务内部揭秘
    Task 线程任务
    【转】SQL Server、Oracle、MySQL和Vertica数据库常用函数对比
  • 原文地址:https://www.cnblogs.com/cl1024cl/p/6205440.html
Copyright © 2020-2023  润新知