• 理解MapReduce


    1. 用Python编写WordCount程序并提交任务

    程序

    WordCount

    输入

    一个包含大量单词的文本文件

    输出

    文件中每个单词及其出现次数(频数),并按照单词字母顺序排序,每个单词和其频数占一行,单词和频数之间有间隔

    1. 编写map函数,reduce函数
      import sys
      for i in stdin:
           i = i.strip()
           words = i.split()
           for word in words:
               print '%s	%s' % (word,1)
      #reduce函数
      from operator import itemgetter
      import sys
       
      current_word = None
      current_count = 0
      word = None
       
      for line in sys.stdin:
          line = line.strip()
          word, count = line.split('	', 1)
          try:
              count = int(count)
          except ValueError: 
              continue
          if current_word == word:
              current_count += count
          else:
              if current_word:
                  print "%s	%s" % (current_word, current_count)
              current_count = count
              current_word = word
       
      if word == current_word: 
          print "%s	%s" % (current_word, current_count)
    2. 将其权限作出相应修改
      chmod a+x /home/hadoop/mapper.py
      chmod a+x /home/hadoop/wc/reducer.py
    3. 本机上测试运行代码
      echo "foo foo quux labs foo bar quux" | /home/hadoop/wc/mapper.py
      echo "foo foo quux labs foo bar quux" | /home/hadoop/wc/mapper.py | sort -k1,1 | /home/hadoop/wc/reducer.p

    4. 放到HDFS上运行
      1. 将之前爬取的文本文件上传到hdfs上
      2. 用Hadoop Streaming命令提交任务
    5. 查看运行结果

    2. 用mapreduce 处理气象数据集

    编写程序求每日最高最低气温,区间最高最低气温

    1. 气象数据集下载地址为:ftp://ftp.ncdc.noaa.gov/pub/data/noaa
    2. 按学号后三位下载不同年份月份的数据(例如201506110136号同学,就下载2013年以6开头的数据,看具体数据情况稍有变通)
    3. 解压数据集,并保存在文本文件中
    4. 对气象数据格式进行解析
    5. 编写map函数,reduce函数
    6. 将其权限作出相应修改
    7. 本机上测试运行代码
    8. 放到HDFS上运行
      1. 将之前爬取的文本文件上传到hdfs上
      2. 用Hadoop Streaming命令提交任务
    9. 查看运行结果
    cd /usr/hadoop
    sodu mkdir qx
    cd /usr/hadoop/qx
    
    wget -D --accept-regex=REGEX -P data -r -c ftp://ftp.ncdc.noaa.gov/pub/data/noaa/2009/4*  
    zcat data/ftp.ncdc.noaa.gov/pub/data/noaa/2009/4*.gz >qxdata.txt
    chmod a+x /home/hadoop/qx/mapper.py
    chmod a+x /home/hadoop/qx/reducer.py
    import sys
    
    for line in sys.stdin:
        line = line.strip()
        print('%s	%d' % (line[15:23], int(line[87:92])))
    from operator import itemgetter
    import sys
    
    current_date = None
    current_temperature = 0
    date = None
    
    for line in sys.stdin:
        line = line.strip()
        date, temperature = line.split('	', 1)
        try:
            temperature = int(temperature)
        except ValueError:
            continue
    
        if current_date == date:
            if current_temperature < temperature:
                current_temperature = temperature
        else:
            if current_date:
                print('%s	%d' % (current_date, current_temperature))
            current_temperature = temperature
            current_date = date
    
    if current_date == date:
        print('%s	%d' % (current_date, current_temperature))
  • 相关阅读:
    I NEED A OFFER!
    水题 Codeforces Round #303 (Div. 2) A. Toy Cars
    模拟 HDOJ 5099 Comparison of Android versions
    模拟 HDOJ 5095 Linearization of the kernel functions in SVM
    贪心 HDOJ 5090 Game with Pearls
    Kruskal HDOJ 1863 畅通工程
    Kruskal HDOJ 1233 还是畅通工程
    并查集 HDOJ 1232 畅通工程
    DFS/并查集 Codeforces Round #286 (Div. 2) B
    水题 Codeforces Round #286 (Div. 2) A Mr. Kitayuta's Gift
  • 原文地址:https://www.cnblogs.com/ming-z/p/9022006.html
Copyright © 2020-2023  润新知