• pyspark进行词频统计并返回topN


    Part I:词频统计并返回topN

    统计的文本数据:

    what do you do
    how do you do
    how do you do
    how are you
    
    from operator import add
    
    from pyspark import SparkContext
    
    
    def sort_t():
        sc = SparkContext(appName="testWC")
        data = sc.parallelize(["what do you do", "how do you do", "how do you do", "how are you"])
        result = data.flatMap(lambda x: x.split(" ")) 
            .map(lambda x: (x, 1)). 
            reduceByKey(add). 
            sortBy(lambda x: x[1], False).take(3)
        for k, v in result:
            print k, v
    
    
    if __name__ == '__main__':
        sort_t()

     

    Part II:调用排序算法并返回topN

    样本数据 numbers_data.txt:

    15561
    112
    -40
    51467112
    234
    8561
    112
    -34
    53467111 121
    2345 789 34
    14561 -21
    12112 101 100
    -4 23
    51467111
    2434
    15567
    132
    -14
    51467111
    237
    

      

    from pyspark import SparkContext
    
    
    def solve():
        sc = SparkContext(appName="Sort_test_example")
        lines = sc.textFile("../input/numbers_data.txt")
        results = lines.flatMap(lambda x: x.split(" ")) 
            .map(lambda x: (int(x), 1)) 
            .sortByKey(ascending=False).take(3)
        output = results
        for (key, value) in output:
            print key
        print key
    
    
    if __name__ == '__main__':
    
    solve()

    注:若出现并列时,返回多个并列的数 

  • 相关阅读:
    用mapreduce 处理气象数据集
    熟悉常用的HBase操作
    爬虫大作业
    熟悉常用的HDFS操作
    数据结构化与保存
    获取全部校园新闻
    爬取校园新闻首页的新闻的详情,使用正则表达式,函数抽离
    NPOI的excel导出1
    DbHelperSQL帮助类
    C# mvc导出excel
  • 原文地址:https://www.cnblogs.com/SeaSky0606/p/7762703.html
Copyright © 2020-2023  润新知