最近使用开发的过程中出现了一个小问题,顺便记录一下原因和方法--课程、问题-
本文例子为Coursera上web intelligence and big data的课程业作。
问题描述:
Download data files bundled as a .zip file from hw3data.zip
Each file in this archive contains entries that look like:
journals/cl/SantoNR90:::Michele Di Santo::Libero Nigro::Wilma Russo:::Programmer-Defined Control Abstractions in Modula-2.
that represent bibliographic information about publications, formatted as follows:
paper-id:::author1::author2::…. ::authorN:::title
Your task is to compute how many times every term occurs across titles, for each author.
For example, the author Alberto Pettorossi the following terms occur in titles with the indicated cumulative frequencies (across all his papers): program:3, transformation:2, transforming:2, using:2, programs:2, and logic:2.
Remember that an author might have written multiple papers, which might be listed in multiple files. Further notice that ‘terms’ must exclude common stop-words, such as prepositions etc. For the purpose of this assignment, the stop-words that need to be omitted are listed in the script stopwords.py. In addition, single letter words, such as "a" can be ignored; also hyphens can be ignored (i.e. deleted). Lastly, periods, commas, etc. need to be ignored; in other words, only alphabets and numbers can be part of a title term: Thus, “program” and “program.” should both be counted as the term ‘program’, and "map-reduce" should be taken as 'map reduce'. Note: You do not need to do stemming, i.e. "algorithm" and "algorithms" can be treated as separate terms.
The assignment is to write a parallel map-reduce program for the above task using either octo.py, or mincemeat.py, each of which is a lightweight map-reduce implementation written in Python.
These are available from http://code.google.com/p/octopy/ and mincemeat.py-zipfile respectively.
I strongly recommend mincemeat.py which is much faster than Octo,py even though the latter was covered first in the lecture video as an example. Both are very similar.
Once you have computed the output, i.e. the terms-frequencies per author, go attempt Homework 3 where you will be asked questions that can be simply answered using your computed output, such as the top terms that occur for some particular author.
Note: There is no need to submit the code; I assume you will experiment using octo.py to learn how to program using map-reduce. Of course, you can always write a serial program for the task at hand, but then you won’t learn anything about map-reduce.
Lastly, please note that octo.py is a rather inefficient implementation of map-reduce. Some of you might want to delve into the code to figure out exactly why. At the same time, this inefficiency is likely to amplify any errors you make in formulating the map and reduce functions for the task at hand. So if your code starts taking too long, say more than an hour to run, there is probably something wrong.
提供若干个需要分析的源文件,每行都是如下的形式:
journals/cl/SantoNR90:::Michele Di Santo::Libero Nigro::Wilma Russo:::Programmer-Defined Control Abstractions in Modula-2.
代表:
paper-id:::author1::author2::…. ::authorN:::title
需要计算出每个作者对应的文章标题每个词项的数量,例如:
作者Alberto Pettorossi的结果为: program:3, transformation:2, transforming:2, using:2, programs:2, logic:2.
注意:每个文档id对应与多个作者,每个作者对应多个词项。词项不包含停用词,单个字母、连字符同样不计。
源代码如下:
# -*- coding: utf-8 -*- #!/usr/bin/env python import glob import mincemeat text_files=glob.glob('E:\\hw3data\\/*') def file_contents(file_name): f=open(file_name) try: return f.read() finally: f.close() source=dict((file_name,file_contents(file_name)) for file_name in text_files) # setup map and reduce functions def mapfn(key, value): stop_words=['all', 'herself', 'should', 'to', 'only', 'under', 'do', 'weve', 'very', 'cannot', 'werent', 'yourselves', 'him', 'did', 'these', 'she', 'havent', 'where', 'whens', 'up', 'are', 'further', 'what', 'heres', 'above', 'between', 'youll', 'we', 'here', 'hers', 'both', 'my', 'ill', 'against', 'arent', 'thats', 'from', 'would', 'been', 'whos', 'whom', 'themselves', 'until', 'more', 'an', 'those', 'me', 'myself', 'theyve', 'this', 'while', 'theirs', 'didnt', 'theres', 'ive', 'is', 'it', 'cant', 'itself', 'im', 'in', 'id', 'if', 'same', 'how', 'shouldnt', 'after', 'such', 'wheres', 'hows', 'off', 'i', 'youre', 'well', 'so', 'the', 'yours', 'being', 'over', 'isnt', 'through', 'during', 'hell', 'its', 'before', 'wed', 'had', 'lets', 'has', 'ought', 'then', 'them', 'they', 'not', 'nor', 'wont', 'theyre', 'each', 'shed', 'because', 'doing', 'some', 'shes', 'our', 'ourselves', 'out', 'for', 'does', 'be', 'by', 'on', 'about', 'wouldnt', 'of', 'could', 'youve', 'or', 'own', 'whats', 'dont', 'into', 'youd', 'yourself', 'down', 'doesnt', 'theyd', 'couldnt', 'your', 'her', 'hes', 'there', 'hed', 'their', 'too', 'was', 'himself', 'that', 'but', 'hadnt', 'shant', 'with', 'than', 'he', 'whys', 'below', 'were', 'and', 'his', 'wasnt', 'am', 'few', 'mustnt', 'as', 'shell', 'at', 'have', 'any', 'again', 'hasnt', 'theyll', 'no', 'when','other', 'which', 'you', 'who', 'most', 'ours ', 'why', 'having', 'once','a','-','.',','] for line in value.splitlines(): word=line.split(':::') authors=word[1].split('::') title=word[2] for author in authors: for term in title.split(): if term not in stop_words: if term.isalnum(): yield author,term.lower() elif len(term)>1: temp='' for ichar in term: if ichar.isalpha() or ichar.isdigit(): temp+=ichar elif ichar=='-': temp+=' ' yield author,temp.lower() def reducefn(key, value): terms = value result={} for term in terms: if term in result: result[term]+=1 else: result[term]=1 return result # start the server s = mincemeat.Server() s.datasource = source s.mapfn = mapfn s.reducefn = reducefn results = s.run_server(password="changeme") #print results result_file=open('hw3_result','w') for result in results: result_file.write(result+' : ') for term in results[result]: result_file.write(term+':'+str(results[result][term])+',') result_file.write('\n') result_file.close()
mapfn为Map部分,读入文档文件,分析出(作者,词项),并传递给Reduce部分。reducefn为Reduce,计算出作者的所有词项频次。
课程中运行程序的截图如下:
附件:
[1] 需要处理的源文档 http://pan.baidu.com/share/link?shareid=475665&uk=1812733245
文章结束给大家分享下程序员的一些笑话语录:
那是习惯决定的,一直保持一个习惯是不好的!IE6的用户不习惯多标签,但是最终肯定还是得转到多标签的浏览器。历史(软件UI)的进步(改善)不是以个人意志(习惯)为转移的!