• python 的hadoop统计词频脚本实现


    map阶段

    # -*- coding: utf-8 -*-
    import sys
    
    for line in sys.stdin:
        line = line.strip()
        words = line.split()
        for word in words:
            print("%s	%s" % (word, 1))

    reduce阶段

    # -*- coding: utf-8 -*-
    import sys
    
    current_word = None
    current_count = 0
    word = None
    
    for line in sys.stdin:
        word, count = line.split('	', 1)
        try:
            count = int(count)
        except ValueError:
            continue
        if current_word == word:
            current_count += count
        else:
            if current_word:
                print ('%s	%s' % (current_word, current_count))
            current_word = word
            current_count = count
    
    if current_word == word:
        print('%s	%s' % (current_word, current_count))
  • 相关阅读:
    C++
    复盘-2018.6.8~2020.6.8
    C++
    C++
    C++
    C++
    Python学习笔记(十)- 面向对象(一)
    SSHException: Error reading SSH protocol banner
    docker 安装mysql
    docker 安装部署
  • 原文地址:https://www.cnblogs.com/syw-home/p/13578296.html
Copyright © 2020-2023  润新知