• ipyparallel WordCount实现


            ipyparallel 之中,可以利用多个engine同时运行一个任务来加快处理的速度。在ipyparallel之中,集群被抽象为view,包括direct_view和balanced_view。其中,direct_view是所有的engine的抽象,当然也可以自行指定由哪些engine构成,而balanced_view是多个engine经过负载均衡之后,抽象出来的由“单一”engine构成的view。利用ipyparallel并行化的基本思路是将要处理的数据首先进行切分,然后分布到每一个engine上,然后将最终的处理结果合并,得到最终的结果,其思路和mapreduce类似。
            下面是一个ipyparallel的并行化wordcount实现,主要思路是:首先读取文件中的句子。利用dview的scatter方法将所有的句子切分成n块发送到每一个engine上,正好每一个engine一个。然后在每一个engine上对切分之后的句子统计词频,最后归并所有engine处理之后的结果。
    #!/usr/bin/env python
    # coding: utf-8
    
    import time
    
    from itertools import repeat
    from ipyparallel import Client, Reference
    from urllib import urlretrieve
    #对text进行wordcount处理
    def wordfreq(text):
        """Return a dictionary of words and word counts in a string."""
        freqs = {}
        for word in text.split():
            lword = word.lower()
            freqs[lword] = freqs.get(lword, 0) + 1
        return freqs
    #输出词频前n个的单词以及其出现的次数
    def print_wordfreq(freqs, n=10):
        """Print the n most common words and counts in the freqs dict."""
    
        words, counts = freqs.keys(), freqs.values()
        items = zip(counts, words)
        items.sort(reverse=True)
        for (count, word) in items[:n]:
            print(word, count)
    
    #自行实现的并行版本的word_freq,对若干行句子进行处理,返回词,出现次数 键值对
    def myword_freq(texts):
        freqs = {}
        for str in texts:
            for word in str.split():
                lword = word.lower()
                freqs[lword] = freqs.get(lword, 0) + 1
        return freqs
    #自行实现的并行版本的wordfreq,首先将texts[]分散传送至每一个engine,然后在每一个engine上执行程序myword_freq,返回求出的词 词频键值对
    def myPwordfreq(view,lines):
        #将文本平均分布在每一个engine上
        view.scatter('texts',lines,flatten=True)
        ar=view.apply(myword_freq,Reference('texts'))
        freqs_list=ar.get()
        #归并最终的处理结果 reduce it!
        word_set=set()
        for f in freqs_list:
            word_set.update(f.keys())
        freqs=dict(zip(word_set,repeat(0)))
        for f in freqs_list:
            for word,count in f.items():
                freqs[word]+=count
        return freqs
    
    if __name__ == '__main__':
        # Create a Client and View
        rc = Client()
    
        dview = rc[:]
        # Run the serial version
        print("Serial word frequency count:")
        text = open('lines.txt').read()
        tic = time.time()
        freqs = wordfreq(text)
        toc = time.time()
        print_wordfreq(freqs, 10)
        print("Took %.3f s to calculate"%(toc-tic))
        # The parallel version
        print("
    Parallel word frequency count:")
        lines=text.splitlines()
        tic=time.time()
        pfreqs=myPwordfreq(dview,lines)
        toc=time.time()
        print_wordfreq(pfreqs)
        print("Took %.3f s to calculate"%(toc-tic))



  • 相关阅读:
    数据库原理 第七章 数据库设计和ER模型
    jeecgboot常见问题及处理方法-found character '@' that cannot start any token. (Do not use @ for indentation)
    jeecgboot积木报表(jimuReport)Oracle切换
    datart表结构
    这几天找工作的经历
    Jenkins 无法登陆解决方法
    Nginx 部署前后端分离项目(SpringBoot Vue)
    CentOS7 用yum方式安装Nginx
    Centos 7 安装 MYSQL 8.0
    Centos 7 安装 JDK1.8
  • 原文地址:https://www.cnblogs.com/zhoudayang/p/5086426.html
Copyright © 2020-2023  润新知