【学习】大文件统计与排序（转载）

【学习】大文件统计与排序（转载）
学习：大文件统计与排序
这篇主要记录一下学习陈硕同学的对下面这道题的算法思想与代码。

题目是这样的：

有10个文件，每个文件1G，每个文件的每行存放的都是用户的query（请自己随机产生），每个文件的query都可能重复。要求你按照query的频度排序。

（当然，这里的重点是大文件，所以10个1G的文件，或者1个10G的文件，原理都是一样的）

陈硕的代码在这里：

https://gist.github.com/4009225

这是一段非常漂亮的代码，解法与代码都非常值得一看。

【解法】

基本步骤就是不断读入文件，并做初步统计，到了某个内存的极限时写出文件，写的方式是按query的哈希值分配到10个不同的文件中，直到读完所有文件内容，然后把这10个文件中的query按count排序，并10路归并排序出最后结果。

shuffle

从命令行传入输入文件，逐行读入，并存放在一个hashmap中，边读边统计<query, count>，到map的size到达指定size时（10*1000*1000，主要是考虑内存容量），把这个hashmap的内容先写出去，写到10个文件的第hash(query) % 10个中去，这保证了相同的query肯定在同一个文件中。这样，直到把文件读完。所以如果输入文件总大小为10G的话，每个文件大小为 <1G （因为相同的query并合并了），可以进行单文件内存内处理。注意此时虽然相同的query在同一文件中，他们可能是分布在好几个地方的，如：

query1 10
query2 5
query3 3
query1 3
query4 3
query 2 7

reduce

把每个文件中相同query合并，并将query按count排序。

merge

10个有序的文件，通过归并排序得到最后的结果。归并的方式是通过一个10个元素的堆，相比于两两迭代归并排序，这大大减少了读文件的时间。

【运行】

该程序只在linux下运行，并需要boost，ubunut下，先安装boost：

apt-get install libboost-dev

然后编译，该程序用到了c++ 0x的feature，所以需要-std=c++0x:

g++ sort.cpp -o sort -std=c++0x

在运行前，需要准备输入数据，这里用lua随机产生：（https://gist.github.com/4045503）
-- updated version, use a table thus no gc involved local file = io.open("file.txt", "w") local t = {} for i = 1, 500000000 do local n = i % math.random(10000) local str = string.format("This is a number %d ", n) table.insert(t, str) if i % 10000 == 0 then file:write(table.concat(t))
t = {} end end

好，开始运行：

sort file.txt
结果如下：

$ time sort file.txt
processing file.txt
shuffling done
reading shard-00000-of-00010
writing count-00000-of-00010
reading shard-00001-of-00010
writing count-00001-of-00010
reading shard-00002-of-00010
writing count-00002-of-00010
reading shard-00003-of-00010
writing count-00003-of-00010
reading shard-00004-of-00010
writing count-00004-of-00010
reading shard-00005-of-00010
writing count-00005-of-00010
reading shard-00006-of-00010
writing count-00006-of-00010
reading shard-00007-of-00010
writing count-00007-of-00010
reading shard-00008-of-00010
writing count-00008-of-00010
reading shard-00009-of-00010
writing count-00009-of-00010
reducing done
merging done

real 19m18.805s
user 14m20.726s
sys 1m37.758s

在我的32位Ubuntu11.10虚拟机上，分配了1G内存，1个2.5G的CPU core，处理一个15G的文件，花了19m分钟。

【学习】
- 把query按哈希值分配到不同的文件，保证想通过query在同一个文件中，漂亮
- 10路归并排序，用一个最大（小）堆来做，减少了文件读写，漂亮
- LocalSink, Shuffler, Source等很小的类来封装、解耦一些特别的的任务，结构十分漂亮
- 一些我不熟悉的知识：
  
  __gnu_cxx::__sso_string, gnu short string optimization, 这里有跟更多的说明
  
  boost::function , boost::bind
  
  使用map的[] operator时，插入数据根据默认构造函数初始化，对于int默认则是为0
  
  C++ 0x的for each：for (auto kv : queries)
  
  boost::noncopyable：不能被copy的类从此继承
  
  std::hash<string>()：返回一个针对string的hash functor
  
  boost::ptr_vector：boost针对每个container都一共了一个ptr的版本，这个比单纯的使用vector<shared_ptr<T>>要更高效
  
  unlink： delete的一个文件
  
  std::unordered_map<string, int64_t> queries(read_shard(i, nbuckets))：使用了move sematic，不然效率会很低
  
  std::pair定义了 < operator，先比较第一个元素
相关阅读:
函数中的不定长参数研究 *and**
copy()与直接=赋值的区别
 python的位运算符
 python的list相关知识
 hadoop报错：hdfs.DFSClient: Exception in createBlockOutputStream
转载计算机的潜意识的文章：机器学习的入门级经典读物
 完全分布式hadoop2.5.0安装 VMware下虚拟机centos6.4安装1主两从hadoop
PE结构总导航
 利用函数指针绕过断点
 为什么Fun函数能够执行
原文地址：https://www.cnblogs.com/helloWaston/p/4545660.html

【学习】大文件统计与排序（转载）

学习：大文件统计与排序