题目要求:
例如给出下面的输入
1.
./smp2.1 a1.txt a2.txt a3.txt a4.txt a5.txt a6.txt a7.txt
每个文件如a0.txt有着很多int型数据,未排序,对每个文件
启动一个线程对其排序
排序好的文件存储为
a1.txt.sorted … a7.txt.sorted
2.
然后再进行merge操作对于相邻的两个文件merge
每次merge启动一个新的线程,merge生成一个排好序的更大的文件。
原题提示用temp file 不留下中间过程的临时文件,打算用c++的流操作,
不知道有没有对应的temp file,close file的时候自动删除,
当前采用生成了最后再删除的方法,最后生成一个排好序的唯一文件。
初始的a0.txt a1.txt…每个文件中是没有重复数字的,但不同的文件之间可能会有
重复数字,题目要求去除重复只保留唯一的值。
原题目似乎有个错误,他提到数字的大小fit int, 而文件line的数目fit unsigened int
不过他提供的gen.c用来生成各个文件,而gen.c生成的文件line的数目是fit int的,即
小于等于RAND_MAX + 1,2^31 .random 的数字是从0 - 2^31 –1 中产生
题目提到不要重复数字,所以归并过程中产生的文件line的数目始终不超过 2^31即2G.
考虑多个线程同时工作,会撑爆内存?他提到第一步用qsort也就是用内排序,如果输入
文件很多,对应很多线程同时工作,会不会撑爆内存呢?
另外题目对归并的顺序有要求,按照输入文件的顺序,所以a1.txt肯定要和a2.txt归并
而不是按照线程完成的顺序,小文件对应的线程会先完成。tmp1肯定和tmp2合并.
最后考虑文件的增大,归并按照外排序方法,采用两个输入缓冲和一个输出缓冲外排序。
设定3个缓冲区大小相同,输出缓冲区慢即输出到文件,某个输入缓冲满,则继续读取数据到
该缓冲进行merge,若对应输入文件无数据了(一个文件处理完了),则将剩余数据
(输出缓冲,另一输入缓冲,另一输入文件剩余数据)输出到输出文件。
merge的流程控制采用两个队列,使用当前队列,并建立另一个队列下一次使用,利用swap技术。
每次从当前队列取出两个merge,merge后生成的文件名入另一个队列。当前队列处理完,意味着
完成了一个level的merge。swap,换当前队列,继续下一level的merge直到最后只剩下一个文件
在当前队列。
出现的一个问题是,发现线程工作正常,排序也正确,但是最后输出文件名出现了错误。
后来发现是一个资源管理问题,单线程状态下肯定没有问题,但是多线程就出问题了。
while(queue_files[now].size() != 1) {
int l = 0;
while(queue_files[now].size() >= 2) {
file_name1 = queue_files[now].front();
queue_files[now].pop();
file_name2 = queue_files[now].front();
queue_files[now].pop();
string s;
Int2String(s, merge_times);
queue_files[other].push(string("temp")+s);
args_thread.file_name1 = file_name1.c_str();
args_thread.file_name2 = file_name2.c_str();
args_thread.merge_times = merge_times++;
pthread_create(&tid[l++], NULL, MergeNumOf2Files, (void *)&args_thread);
}
if (!queue_files[now].empty()) {
file_name = queue_files[now].front();
queue_files[now].pop();
queue_files[other].push(file_name);
}
for (int i = 0; i < l; i++)
pthread_join(tid[i], NULL);
swap(now, other);
}
注意file_name1 和file_name2,他们实际存储字符串,并传递指针给线程,问题是一个线程正在执行的时
候,while循环继续了,前面的file_name1存储的空间实际被释放了(我觉得是这样的)!可能会被别的数据占
用,而前面那个线程的字符串数组指针还指向这个地址,于是出现了问题。
例如打印a3.txt.sortex或者更加莫名其妙的名字。
原题目要求如下:
//one example
[Part 1] User enters the following file names at the command line:
./mp2.1 a1.txt a2.txt a3.txt a4.txt a5.txt a6.txt a7.txt
[Part 2] Sort the numbers contained in each of the files to form
the following new files, without altering the original files:
"a1.txt.sorted" is a sorted copy of "a1.txt"
"a2.txt.sorted" is a sorted copy of "a2.txt"
"a3.txt.sorted" is a sorted copy of "a3.txt"
"a4.txt.sorted" is a sorted copy of "a4.txt"
"a5.txt.sorted" is a sorted copy of "a5.txt"
"a6.txt.sorted" is a sorted copy of "a6.txt"
"a7.txt.sorted" is a sorted copy of "a7.txt"
And each thread prints the total number of integers encountered as
they terminate:
This worker thread writes XXXXX lines to "YYYYY".
Therefore your output will be similar to -
This worker thread writes 10 lines to "a7.txt.sorted".
This worker thread writes 20 lines to "a3.txt.sorted".
This worker thread writes 30 lines to "a2.txt.sorted".
This worker thread writes 40 lines to "a4.txt.sorted".
This worker thread writes 100000 lines to "a5.txt.sorted".
This worker thread writes 2000000 lines to "a6.txt.sorted".
This worker thread writes 30000000 lines to "a1.txt.sorted".
It's important to note that worker threads may exit in a different order than
the order they were created. This is because they're running as threads,
in parallel. Therefore, small files will sort very quickly while multi-million
line files may take a few seconds to sort.
[Part 3] Start merging the files, while still maintaining the sorted
order:
a1.txt.sorted a3.txt.sorted a5.txt.sorted a7.txt.sorted
v a2.txt.sorted v a4.txt.sorted v a6.txt.sorted v /* 7 threads to sort */
| v | v | v |
v - v | v | v
\---------/ \--------/ \--------/ | /* 3 threads to merge the six files
v v v v that can be merged at this layer */
temp1 temp2 temp3 |
v v v v
| | | |
\-------------------/ \---------------/ /* 2 threads to merge the four files
v v that can be merged at this layer */
temp4 temp5
v v
| |
\-------------------------------------/ /* 1 thread to merge the two files
v that can be merged at this layer */
sorted.txt
Each merge thread will display the name of the files merged and the name
of the new file created with the total number of lines in it. Therefore,
your output will look similar to -
Merged 100 lines and 1000 lines into 1050 lines.
Merged 10000 lines and 300 lines into 10300 lines.
Merged 10 lines and 800 lines into 801 lines.
Merged 1050 lines and 10300 lines into 10345 lines.
Merged 801 lines and 1 lines into 802 lines.
Merged 10345 lines and 802 lines into 11111 lines.
At the end of the execution of the program, your directory must
only contain the following NEWLY created files:
a1.txt.sorted, a2.txt.sorted, ..., a8.txt.sorted, and sorted.txt
As well as the unmodified orignal files:
a1.txt, a2.txt, ..., a8.txt
//测试
allen:~/study/system_programming/uiuc_assignment/smp2$ ./mp2.1 a2.txt a3.txt a4.txt a5.txt a6.txt a7.txt
The worker thread writes 66797lines to a3.txt.sorted
The worker thread writes 114550lines to a4.txt.sorted
The worker thread writes 103026lines to a5.txt.sorted
The worker thread writes 150172lines to a2.txt.sorted
The worker thread writes 118376lines to a7.txt.sorted
The worker thread writes 174951lines to a6.txt.sorted
Merged file a2.txt.sorted 150172 lines and a3.txt.sorted 66797 lines into the file temp0 with 216967 lines!
Merged file a4.txt.sorted 114550 lines and a5.txt.sorted 103026 lines into the file temp1 with 217572 lines!
Merged file a6.txt.sorted 174951 lines and a7.txt.sorted 118376 lines into the file temp2 with 293318 lines!
Merged file temp0 216967 lines and temp1 217572 lines into the file temp3 with 434515 lines!
Merged file temp3 434515 lines and temp2 293318 lines into the file temp4 with 727791 lines!
allen:~/study/system_programming/uiuc_assignment/smp2$ ls
a0.txt2 a12.txt a17.txt a1.txt_bak a3.txt a5.txt.sorted a8.txt gen2.py mergea0a1.txt2 README.pdf tags
a0.txt2.sorted a13.txt a18.txt a1.txt.sorted a3.txt.sorted a6.txt a9.txt gen.c mergea3a6.txt smp2.zip
a0.txt.sorted a14.txt a19.txt a2.txt a4.txt a6.txt.sorted ge2.c gen.py mp2.1 sorted.txt
a10.txt a15.txt a1.txt2 a2.txt2.sorted a4.txt.sorted a7.txt gen Makefile mp2.1.c sortok.py
a11.txt a16.txt a1.txt2.sorted a2.txt.sorted a5.txt a7.txt.sorted gen2 mergea0a1.txt mp2.1.cc sortok_unique.py
allen:~/study/system_programming/uiuc_assignment/smp2$ wc -l a2.txt a3.txt a4.txt a5.txt a6.txt a7.txt
150172 a2.txt
66797 a3.txt
114550 a4.txt
103026 a5.txt
174951 a6.txt
118376 a7.txt
727872 总用量
allen:~/study/system_programming/uiuc_assignment/smp2$ ./sortok_unique.py sorted.txt
OK! The file is sorted without duplicate num
allen:~/study/system_programming/uiuc_assignment/smp2$ wc -l sorted.txt
727791 sorted.txt
1 /*
3 CS 241, Spring 2009
4 */
5
6 #include <stdio.h> /* Standard buffered input/output */
7 #include <stdlib.h> /* Standard library functions */
8 #include <string.h> /* String operations */
9 #include <pthread.h> /* Thread related functions */
10
11 #include <iostream>
12 #include <fstream>
13 #include <sstream>
14 #include <iterator>
15 #include <vector>
16 #include <string>
17 #include <queue>
18
19 #define _GLIBCXX_FULLY_DYNAMIC_STRING
20 using namespace std;
21 const int MaxLen = 5 * 1024 * 1024; //5M int 20MB 开辟3个5M int 的vector,两个作为输入,一个作为输出缓存
22 //const int MaxLen = 5 * 1024; //small buffer for test
23 struct ArgSet {
24 const char *file_name1;
25 const char *file_name2;
26 int merge_times;
27 bool unique_merge;
28 };
29
30 struct LineNumInfo {
31 int file1_line_num;
32 int file2_line_num;
33 int file_out_num;
34 };
35 /*
36 * 步骤1中的对单一文件进行排序,认为单一文件中的所有数字
37 * 可以全部读入内存进行排序
38 */
39 template <typename T>
40 void SortNumOfOneFile(const char *file_name)
41 {
42 //get input data to vec
43 ifstream data_file(file_name);
44 istream_iterator<T> data_begin(data_file);
45 istream_iterator<T> data_end;
46 vector<T> vec(data_begin, data_end);
47 data_file.close();
48
49 //sort vec
50 sort(vec.begin(), vec.end());
51
52 //write result to file_name.sorted,first get out file name
53 string out_file_name = string(file_name) + string(".sorted");
54 ofstream out_file(out_file_name.c_str());
55 copy(vec.begin(), vec.end(), ostream_iterator<T>(out_file, "\n"));
56 out_file.close();
57
58 cout << "The worker thread writes " << vec.size()
59 << "lines to " << out_file_name << endl;
60 }
61
62 void *SortNumOfOneFile(void *f)
63 {
64 char *file_name = (char *) f;
65 SortNumOfOneFile<int>(file_name);
66 return NULL;
67 }
68
69 /*
70 *以下步骤2,考虑对于排好序的文件进行进一步归并,文件大小逐步变大
71 *考虑外排序归并算法,开辟固定大小的两个输入缓冲,一个输出缓冲,
72 *每次将两个文件中的数读入输入缓冲,归并排序结果到输出缓冲,如果
73 *输出缓冲满则输出到输出文件中,如果某个输入缓冲空则继续从对应
74 *输入文件中取数
75 *
76 * 另外按照题目要求,按照输入文件的顺序进行归并,不一定是上一层次的完成顺序
77 */
78
79 //从一个file stream中读取num个数字,
80 //存到vec中,返回读到的数目
81 template <typename T>
82 int ReadFile(ifstream &data_file, int num, T &vec)
83 {
84 int i = 0;
85 int val;
86 while (data_file >> val) {
87 vec[i++] = val;
88 if (i == num)
89 break;
90 }
91
92 return i;
93 }
94
95 template <typename T>
96 void WriteToOutputBuffer(ofstream &out_file, const T &vec_out, int num)
97 {
98 typedef typename T::iterator _RandomAccessIterator;
99 typedef typename iterator_traits<_RandomAccessIterator>::value_type _ValueType;
100 copy(vec_out.begin(), vec_out.begin() + num, ostream_iterator<_ValueType>(out_file, "\n"));
101 }
102
103 /*
104 * vec1,ve2输入缓冲
105 * vec_out输出缓冲
106 * s1,vec1起始标,num1共num1个数据
107 * cur_out 输出缓冲游标
108 * unique = true 不保留重复值(即两个输入文件中都存在的值2个只留1个)
109 */
110 template <typename T>
111 void Merge2Vec(T &vec1,T &vec2, T &vec_out,
112 int s1, int num1, int s2, int num2,
113 int cur_out, ofstream &out_file,
114 ifstream &data_file1, ifstream &data_file2,
115 LineNumInfo &line_num_info,
116 bool unique)
117 {
118 int i = s1;
119 int j = s2;
120 int end1 = i + num1;
121 int end2 = j + num2;
122 int num;
123 //注意初始cur_out要保证< MaxLen
124 //特别注意如果考虑重复的元素去掉的化,则有可能i,j 同时到达终点end1,end2
125 while (1) {
126
127 if(!unique) {
128 if (vec1[i] <= vec2[j])
129 vec_out[cur_out++] = vec1[i++];
130 else
131 vec_out[cur_out++] = vec2[j++];
132 } else { //不保留重复元素
133 if (vec1[i] < vec2[j]) {
134 vec_out[cur_out++] = vec1[i++];
135 } else if (vec1[i] == vec2[j]) {
136 vec_out[cur_out++] = vec1[i++];
137 j++;
138 } else {
139 vec_out[cur_out++] = vec2[j++];
140 }
141 }
142
143 if (cur_out == MaxLen) {
144 //cout << "Outpu full!" << endl;
145 WriteToOutputBuffer(out_file, vec_out, MaxLen);
146 cur_out = 0;
147 line_num_info.file_out_num += MaxLen;
148 }
149
150 if (i == end1) {
151 //前一次1文件读没有读满,说明1文件数字已经读完,只需要把2文件对应缓冲中的以及剩余的数字输出
152 //或者虽然1文件上次读满但是没有剩余数据了 num == 0
153 if (end1 != MaxLen || !(num = ReadFile(data_file1, MaxLen, vec1))) {
154 while(j < end2 && cur_out < MaxLen) //将当前缓冲2中的数据输出
155 vec_out[cur_out++] = vec2[j++];
156 if (cur_out == MaxLen) { //如果输出缓冲满,则输出到out文件,剩下的2输入缓冲的数据肯定不会使得输出缓冲满了
157 WriteToOutputBuffer(out_file, vec_out, MaxLen);
158 cur_out = 0;
159 line_num_info.file_out_num += MaxLen;
160 while(j < end2)
161 vec_out[cur_out++] = vec2[j++];
162 }
163 WriteToOutputBuffer(out_file, vec_out, cur_out);
164 line_num_info.file_out_num += cur_out;
165
166 if (end2 == MaxLen) { //将2文件剩下数字输出
167 while ((num = ReadFile(data_file2, MaxLen, vec_out))) {
168 WriteToOutputBuffer(out_file, vec_out, num);
169 line_num_info.file_out_num += num;
170 }
171 }
172 return;
173
174 } else { //如果1文件还有剩余的数字,则读入缓冲1继续归并
175 i = 0;
176 end1 = num;
177 line_num_info.file1_line_num += num;
178 }
179 }
180
181 if (j == end2) { //处理类似上面 i == end1
182 if (end2 != MaxLen || !(num = ReadFile(data_file2, MaxLen, vec2))) {
183 while(i < end1 && cur_out < MaxLen)
184 vec_out[cur_out++] = vec1[i++];
185 if (cur_out == MaxLen) {
186 WriteToOutputBuffer(out_file, vec_out, MaxLen);
187 cur_out = 0;
188 line_num_info.file_out_num += MaxLen;
189 while(i < end1)
190 vec_out[cur_out++] = vec1[i++];
191 }
192 WriteToOutputBuffer(out_file, vec_out, cur_out);
193 line_num_info.file_out_num += cur_out;
194
195 if (end1 == MaxLen) {
196 while ((num = ReadFile(data_file1, MaxLen, vec_out))) {
197 WriteToOutputBuffer(out_file, vec_out, num);
198 line_num_info.file_out_num += num;
199 }
200 }
201 return;
202 } else { //2文件还有剩余数据
203 j = 0;
204 end2 = num;
205 line_num_info.file2_line_num += num;
206 }
207 }
208 }
209 }
210
211 //对两个输入文件进行归并,重复的数字被删除(when unique == true),注意单个输入的文件中没有重复数字的存在
212 //为了调试方便,首先写成保持重复数字的归并,const是必要的例如你的输入参数是a.c_str()它是一个const
213 //归并的同时记录下两个输入文件的行数目,以及输出文件的行数目
214 void MergeNumOf2Files(const char *file_name1, const char *file_name2,const char *file_name_out, bool unique = false)
215 {
216 ifstream data_file1(file_name1);
217 ifstream data_file2(file_name2);
218 ofstream out_file(file_name_out);
219
220 LineNumInfo line_num_info;
221
222
223 //TODO 多个线程同时会不会内存爆掉?
224 vector<int> vec1(MaxLen); //输入缓冲区1
225 vector<int> vec2(MaxLen); //输入缓冲区2
226 vector<int> vec_out(MaxLen); //输出缓冲区
227
228 int num1 = ReadFile(data_file1, MaxLen, vec1);
229 int num2 = ReadFile(data_file2, MaxLen, vec2);
230
231 line_num_info.file1_line_num = num1;
232 line_num_info.file2_line_num = num2;
233 line_num_info.file_out_num = 0; //take care ,not to forget
234
235
236 Merge2Vec(vec1, vec2, vec_out,
237 0, num1, 0, num2,
238 0, out_file,
239 data_file1, data_file2,
240 line_num_info,
241 unique);
242
243 data_file1.close();
244 data_file2.close();
245 out_file.close();
246
247 cout << "Merged file " << file_name1 << " " << line_num_info.file1_line_num
248 << " lines and " << file_name2 << " " << line_num_info.file2_line_num
249 << " lines into the file " << file_name_out << " with "
250 << line_num_info.file_out_num << " lines!" << endl;
251 }
252
253 void Int2String(string &s, int input)
254 {
255 std::stringstream ss;
256 ss << input;
257 ss >> s;
258 }
259 void *MergeNumOf2Files(void * f)
260 {
261 //g++ 不允许如下转换 错误: 从类型‘void*’到类型‘arg_set*’的转换无效
262 //arg_set *args = f;
263 ArgSet *args = static_cast<ArgSet *>(f);
264
265 string s;
266 Int2String(s, args->merge_times);
267 s = string("temp") + s;
268 MergeNumOf2Files(args->file_name1, args->file_name2, s.c_str(), args->unique_merge);
269 return NULL;
270 }
271
272 /* MAIN PROCEDURE SECTION */
273 int main(int argc, char **argv)
274 {
275 if (argc == 1) {
276 cout << "You should at least given one file" << endl;
277 return -1;
278 }
279 //第一步,对应每个文件生成一个线程对文件中数字排序
280 //将结果存在相应的.sorted文件中
281 pthread_t tid[argc - 1];
282 queue<string> queue_files[2];
283 int now = 0;
284 int other = 1;
285
286 string file_name;
287 string file_name1, file_name2;
288 for (int i = 0; i < argc - 1; i++) {
289 file_name = string(argv[i + 1]) + string(".sorted");
290 queue_files[now].push(file_name);
291 pthread_create(&tid[i], NULL, SortNumOfOneFile, (void *) argv[i + 1]);
292 }
293
294 for (int i = 0; i < argc - 1; i++)
295 pthread_join(tid[i], NULL);
296
297
298 //下面将生成线程进行mergefile将两个排好序的文件排序合并成一个新的tmp文件
299 //不断进行该过程直到生成一个唯一的排序文件即将所有文件中的数字排好序
300 //该过程可能对应多个level需要等到一个level中的所有线程完成任务后再开始
301 //下一个level
302 //TODO merge 次数 应该不超过 argc - 1即初始文件数目 how to prove?
303 int merge_times = 0;
304 ArgSet args_thread;
305
306 while(queue_files[now].size() != 1) {
307 int l = 0;
308 string file_name1[argc - 1];
309 string file_name2[argc - 1];
310 while(queue_files[now].size() >= 2) {
311 file_name1[l] = queue_files[now].front();
312 queue_files[now].pop();
313 file_name2[l] = queue_files[now].front();
314 queue_files[now].pop();
315
316 string s;
317 Int2String(s, merge_times);
318 queue_files[other].push(string("temp")+s);
319
320 args_thread.file_name1 = file_name1[l].c_str();
321 args_thread.file_name2 = file_name2[l].c_str();
322 args_thread.merge_times = merge_times++;
323 args_thread.unique_merge = false; //unique merge 如果为false则允许重复数字
324 //这并不安全,如果下面循环args_thread的内容 变化,但是
//上一线程还没有取用它的信息,原信息被覆盖,应该一个线程分配一个单独的args_thread变量
325 pthread_create(&tid[l++], NULL, MergeNumOf2Files, (void *)&args_thread);
326 }
327 if (!queue_files[now].empty()) {
328 file_name = queue_files[now].front();
329 queue_files[now].pop();
330 queue_files[other].push(file_name);
331 }
332
333 for (int i = 0; i < l; i++)
334 pthread_join(tid[i], NULL);
335
336 swap(now, other);
337 }
338
339 //删除所有temp文件并将最后生成的文件改名为sorted.txt
340 //也可在merge完成后关闭句柄的时候,判断文件名是否含有
341 //temp然后删除
342 for (int i = 0; i < merge_times; i++) {
343 string s;
344 Int2String(s, i);
345 s = string("temp") + s;
346 if (i != merge_times - 1)
347 remove(s.c_str());
348 else
349 rename(s.c_str(), "sorted.txt");
350 }
351
352
353
354 return 0;
355 } /* end main() */