• Hadoop Mapreduce中wordcount 过程解析


    将文件split

    文件1:                                                                   分割结果:

    hello  world                                                   <0, "hello world">

    this is wordcount                                           <12,"this is wordcount">

    文件2:

    hello china                                                     <0,"hello china">

    hello IT                                                           <12,"hello IT">

    测试文件较小,所以一般测试文件就是一个split

    MapReduce 框架完成了以上分割

     

    Then,将分割好的<key ,value > 交给用户自定义的map 方法进行处理,生成新的<key,value>:

    <0, "hello world">                        map()                <hello,1> <world,1>                                          

    <12,"this is wordcount">             map()                 <this,1> <is,1> <wordcount,1>

    <0,"hello china">                         map()                 <hello,1> <china,1>

    <12,"hello IT">                            map()                  <hello,1><IT,1>

    map() reduce() 中间有个shuffle :

    <hello,1> <world,1>                         shuffle ()             <hello,1>                

    <this,1> <is,1> <wordcount,1>        shuffle ()              <is,1>

                                                                                        <wordcount,1>

                                                                                        <world,1>  

    <hello,1> <china,1>                         shuffle ()              <china,1> 

    <hello,1> <IT,1>                               shuffle ()               <hello,1>    

                                                                                          <hello,1>

                                                                                           <IT,1>

    分组,将相同的key 合并在一起:

    <hello,1>                        <hello,list(1)>        

    <is,1>                             <is,list(1)>        

    <wordcount,1>               <wordcount,list(1)>        

    <world,1>                      <world,list(1)>        

    <china,1>                        <china,list(1)>        

    <hello,1>    

    <hello,1>                          <hello,list(2)>        

     <IT,1>                             <IT,1>

    <china,list(1)>        

    <hello,list(1,2)>        

    <is,list(1)>  

    <wordcount,list(1)>  

    <world,list(1)>

    <IT,list(1)>                 

    得到最新的<key,value> 之后,再交给用户的reduce()方法,得到最新的<key,value >,并组为wordcount 的结果输出:

    <china,1>        

    <hello,3>

    <is,1>

    <wordcount,1>

    <world,1>

    <IT,1>   

  • 相关阅读:
    Vim的分屏功能 | 酷壳 CoolShell.cn
    分享:Hadoop的Python框架指南
    KMP算法之另类图示分析
    C#cookie自动获取工具发布
    tmux Tutorial Split Terminal Windows Easily
    爱上MVC3~将系统的路由设置抽象成对象吧
    DDD~microsoft NLayerApp项目中的层次结构图
    不说技术~难得糊涂
    DDD~基础设施层
    基础才是重中之重~开发人员应用学会用throw
  • 原文地址:https://www.cnblogs.com/pickKnow/p/10767222.html
Copyright © 2020-2023  润新知