• 个人作业项目报告(三)输出结果及测试样例的结果(附代码)


    代码调用图

    代码调用图的图例:(感谢刘泽@kfk的vs2015企业版的强大功能)

     

    输出结果

    因为助教给的程序是在WIndows平台上运行代码得到的结果,所以主要放出的是Windows平台下的结果,Linux平台下的结果留待以后分析。

     

    我的结果

    The number of character is:173669785
    The number of line is:2278666
    The number of word is:16639829
    The top 10 words is:
    HAVE    107383
    WITH    158745
    CLASS   192004
    THIS    152454
    THEY    145945
    SPAN    116118
    THAT    259186
    SAID    208861
    HARRY   184732
    SPAN CLASS  62861
    REFERENCE INTERNAL  26668
    SPAN SPAN   41286
    HREF LEAP   22569
    THAT GOOD   61427
    INTERNAL HREF   26668
    SAID HARRY  24981
    CLASS SPAN  23146
    SAID HERMIONE   19193
    CLASS REFERENCE 31289
    Use Time :36
    
     

    助教的结果

    char_number :173654417
    line_number :2278666
    word_number :16629955
    
    the top ten frequency of word :
    THAT    259186
    SAID    208861
    CLASS   192004
    HARRY   184732
    WITH    158745
    THIS    152454
    THEY    145945
    Span    116118
    HAVE    107383
    FROM    105494
    
    
    the top ten frequency of phrase :
    Span CLASS  62861
    THAT GOOD   61427
    Span Span   41286
    CLASS Reference 31289
    Reference INTERNAL  26668
    INTERNAL href   26668
    SAID HARRY  24981
    CLASS Span  23146
    href LEAP   22569
    SAID HERMIONE   19193
    
     

    结果对比

    可以看到的是,行数的结果正确,字符和单词数的误差在一万个左右,跟总体来相比误差在0.1%以内,Top10的单词和词组的频率也是一模一样,只不过没有正序输出。

     

    测试结果

    我生成了10个测试样例,用来测试一些特殊情况。 
    当然,能够测试本身就说明在平台下已经可以做到命令行参数是单个文件或者文件夹了。 
    下面是对其的展示和分析:

     

    1.pptx

    测试结果: 
    Linux平台:

    The number of character is:184816
    The number of line is:1565
    The number of word is:1718
    The top 10 words is:
    SLIDELAYOUTS    48
    RELSPK  38
    RELS    114
    SLIDES  48
    SLIDE5  48
    NOTESSLIDE10    40
    NOTESSLIDES 40
    XMLPK   46
    SLIDELAYOUT3    48
    STEVT   36
    SLIDES SLIDE2   24
    NOTESSLIDES NOTESSLIDE5 20
    SLIDELAYOUTS RELS   24
    SLIDELAYOUTS SLIDELAYOUT12  24
    SLIDES RELS 24
    RELS SLIDE5 24
    RELSPK SLIDELAYOUTS 13
    RELS NOTESSLIDE10   20
    NOTESSLIDES RELS    20
    RELS SLIDELAYOUT3   24
    Use Time :0
    

    Windows平台:

    The number of character is:52
    The number of line is:1
    The number of word is:2
    The top 10 words is:
    CONTENT 1
    TYPES   1
    CONTENT TYPES   1
    Use Time :13
    

    看起来很尴尬的答案,我感觉是编码方式有不同,然后在Win平台下读到了某个-1(在汉字或其他码字编码里)

     

    1.txt

    文件内容:

    Good123 AREE Good456 AREE214 yesh
    

    特性:普通的英文单词文件

    Win平台:

    The number of character is:33
    The number of line is:1
    The number of word is:5
    The top 10 words is:
    GOOD123 2
    AREE    2
    YESH    1
    GOOD123 AREE    2
    AREE GOOD456    1
    AREE214 YESH    1
    Use Time :4
    

    Linux平台:

    The number of character is:33
    The number of line is:1
    The number of word is:5
    The top 10 words is:
    GOOD123 2
    AREE    2
    YESH    1
    GOOD123 AREE    2
    AREE GOOD456    1
    AREE214 YESH    1
    Use Time :0
    

    PS:不含英文的txt文件真是良心啊

     

    2.txt

    文件特性:空文件,看看效果

    Win平台:

    The number of character is:0
    The number of line is:0
    The number of word is:0
    The top 10 words is:
    Use Time :0
    

    Linux平台:

    The number of character is:0
        The number of line is:0
        The number of word is:0
        The top 10 words is:
        Use Time :0
    

    分析:说明两个平台对空文件都能适应

     

    8.txt

    文件内容:

    翘课购房款签到表来开会多花费了可分类尽快不考虑呢情况你嚄很浪费和库布里克冰冷的空气不卡的我翻了翻和
    
    发
    asf
    efwf
     2过3
    跟4gfdgdsvsdvv去污粉3
    跟65和K(……》《……》&。……*(。?&
    ¥JH/qQs>kFAHRH3%$t^%u5jREB zx.
    V 
    GF
    
    F
     G3H4hg%wg?
    5.J58
    J4.5Fz
    S
    u
    $ 
    

    特性:共有19行,和很多汉字。

    Win平台:

    The number of character is:90
    The number of line is:19
    The number of word is:4
    The top 10 words is:
    EFWF    1
    GFDGDSVSDVV 1
    KFAHRH3 1
    JREB    1
    EFWF GFDGDSVSDVV    1
    GFDGDSVSDVV KFAHRH3 1
    KFAHRH3 JREB    1
    Use Time :0
    

    Linux平台:

    The number of character is:90
    The number of line is:19
    The number of word is:4
    The top 10 words is:
    EFWF    1
    GFDGDSVSDVV 1
    KFAHRH3 1
    JREB    1
    EFWF GFDGDSVSDVV    1
    GFDGDSVSDVV KFAHRH3 1
    KFAHRH3 JREB    1
    Use Time :0
    

    令人惊讶,中文的问题居然也不大,只不过这些都什么破玩意。。。。

     

    1.pdf

    Linux平台:

    The number of character is:1174015
    The number of line is:16897
    The number of word is:15033
    The top 10 words is:
    TYPE    566
    RECT    232
    ENDOBJ  993
    LENGTH  340
    STREAM  234
    LINK    232
    ANNOT   232
    FILTER  234
    SUBTYPE 431
    GOTO    233
    GOTO BORDER 232
    XOBJECT SUBTYPE 139
    RECT ENDOBJ 232
    ENDSTREAM ENDOBJ    198
    SUBTYPE LINK    232
    BORDER RECT 232
    ENDOBJ ENDOBJ   215
    TYPE ANNOT  232
    ENDOBJ TYPE 524
    FILTER FLATEDECODE  226
    Use Time :1
    

    Windows平台:

    The number of character is:859
    The number of line is:20
    The number of word is:28
    The top 10 words is:
    TYPE    1
    XOBJECT 2
    TEXT    1
    STREAM  1
    PROCSET 1
    IMAGEI  1
    FLATEDECODE 1
    FORMTYPE    1
    LENGTH  1
    PTEX    3
    TYPE XOBJECT    1
    XOBJECT SUBTYPE 1
    XOBJECT STREAM  1
    SHADING XOBJECT 1
    PROCSET TEXT    1
    IMAGEI SHADING  1
    FLATEDECODE FORMTYPE    1
    FORMTYPE LENGTH 1
    LENGTH PTEX 1
    PTEX FILENAME   1
    Use Time :0
    

    分析:看到这种结果我感觉真的伤不起。。。看起来Liunux平台下的编码更适合,至少上面的答案感觉更加合理一些,至少没有莫名其妙的遇到文件结束符而结束遍历。

     

    1.php

    文件特性:随便找的自己写过的php文件,既有中午又有英文

    Windows平台:

    The number of character is:1539
    The number of line is:45
    The number of word is:130
    The top 10 words is:
    INLINE  6
    HTML    3
    HREF    8
    GLASS   3
    ECHO    10
    BOOK    4
    INFO    4
    CLASS   6
    DISPLAY 6
    STYLE   11
    STYLE DISPLAY   6
    HEIGHT AUTO 2
    INFO HREF   3
    ECHO TABLE  2
    DISPLAY INLINE  6
    ADMIN STYLE 2
    STYLE HEIGHT    3
    GLASS STYLE 3
    SYSTEM EXCELSIOR    2
    CLASS GLASS 3
    Use Time :0
    

    Linux平台:

    The number of character is:1539
    The number of line is:45
    The number of word is:130
    The top 10 words is:
    INLINE  6
    HTML    3
    HREF    8
    GLASS   3
    ECHO    10
    BOOK    4
    INFO    4
    CLASS   6
    DISPLAY 6
    STYLE   11
    STYLE DISPLAY   6
    HEIGHT AUTO 2
    INFO HREF   3
    ECHO TABLE  2
    DISPLAY INLINE  6
    ADMIN STYLE 2
    STYLE HEIGHT    3
    GLASS STYLE 3
    SYSTEM EXCELSIOR    2
    CLASS GLASS 3
    Use Time :0
    

    分析: 只要是能够是两个平台上都用文本文档打开而不会出乱码的文件,得到的结果都含有沁人心脾的味道。

     

    1.css

    文件特性: 类似于1.php,随便找了个模板库里的。

    Windows平台下:

    The number of character is:6590
    The number of line is:320
    The number of word is:534
    The top 10 words is:
    COLOR   26
    TEXT    23
    BACKGROUND  22
    HSLA    33
    BORDER  27
    WIDTH   21
    SHADOW  22
    FONT    26
    GLASS   16
    MARGIN  20
    SHADOW HSLA 22
    RGBA TEXT   11
    FONT SIZE   13
    MARGIN AUTO 14
    SIZE FONT   13
    OVERFLOW HIDDEN 11
    FAMILY AVENIR   13
    TEXT SHADOW 11
    FONT FAMILY 13
    BORDER RADIUS   12
    Use Time :0
    

    Linux平台:

    The number of character is:6590
    The number of line is:320
    The number of word is:534
    The top 10 words is:
    COLOR   26
    TEXT    23
    BACKGROUND  22
    HSLA    33
    BORDER  27
    WIDTH   21
    SHADOW  22
    FONT    26
    GLASS   16
    MARGIN  20
    SHADOW HSLA 22
    RGBA TEXT   11
    FONT SIZE   13
    MARGIN AUTO 14
    SIZE FONT   13
    OVERFLOW HIDDEN 11
    FAMILY AVENIR   13
    TEXT SHADOW 11
    FONT FAMILY 13
    BORDER RADIUS   12
    Use Time :0
    

    分析:效果不错。

     

    jieshi.docx

    文件特性:类似于pptx,属于会爆乱码的文件。

    Linux平台下:

    The number of character is:4791
    The number of line is:63
    The number of word is:68
    The top 10 words is:
    CONTENT 2
    TYPES   2
    RELS    6
    WORD    14
    DOCUMENT    4
    DOCPROPS    4
    CORE    2
    THEME   4
    STYLES  2
    XMLPK   9
    XMLPK WORD  5
    XMLPK DOCPROPS  2
    WORD STYLES 2
    RELS WORD   2
    WORD RELS   2
    WORD WEBSETTINGS    2
    WORD SETTINGS   2
    WORD DOCUMENT   2
    WORD FONTTABLE  2
    THEME THEME1    2
    Use Time :0
    

    Windows平台下:

    The number of character is:96
    The number of line is:3
    The number of word is:2
    The top 10 words is:
    CONTENT 1
    TYPES   1
    CONTENT TYPES   1
    Use Time :0
    

    分析:这种文件,我都已经不抱希望了,没有办法按照合适的方式解码的话,两个平台的结果大相径庭是很显然的。 
    当然,Linux下的结果还是看起来好一点。

     

    toefl文件夹

    文件特性:一个装toefl资料的文件夹,里面有一个rar和四个pdf

    Linux平台:

    The number of character is:2671446
    The number of line is:37968
    The number of word is:22789
    The top 10 words is:
    TYPE    784
    ENDSTREAM   440
    LEFT    405
    STREAM  440
    GROUP   360
    RIGHT   405
    FONT    379
    LENGTH  453
    ENDOBJ  1011
    PAGE    342
    FILTER FLATEDECODE  273
    TYPE GROUP  180
    ENDOBJ TYPE 203
    TYPE PAGE   180
    ENDOBJ FILTER   217
    LENGTH STREAM   327
    PROCSET TEXT    180
    PARENT RESOURCES    180
    FLATEDECODE LENGTH  244
    ENDSTREAM ENDOBJ    440
    Use Time :0
    

    Windows 平台下:

    The number of character is:3121
    The number of line is:54
    The number of word is:141
    The top 10 words is:
    TYPE    13
    TRUE    3
    ENDOBJ  10
    DEVICERGB   3
    STREAM  4
    FLATEDECODE 4
    IMAGE23 4
    GROUP   6
    FILTER  4
    PAGES   6
    ENDOBJ TYPE 6
    LANG STRUCTTREEROOT 3
    TRUE ENDOBJ 3
    LENGTH STREAM   3
    MARKINFO MARKED 3
    PAGES LANG  3
    STRUCTTREEROOT MARKINFO 3
    TYPE CATALOG    3
    KIDS ENDOBJ 3
    FILTER FLATEDECODE  4
    Use Time :1
    

    分析:显然Linux平台下的结果看起来是更加符合这个文件里面含有的东西和字数。

     

    11.txt

    文件内容:

    文件特性: 专门用来测试单词和词组的保存

    Linux平台下:

    The number of character is:62
    The number of line is:1
    The number of word is:7
    The top 10 words is:
    TEST123 5
    TESTAFS 1
    TEST123TEST324  1
    TEST123 TEST3456    2
    TEST3456 TESTAFS    1
    TESTAFS TEST13  1
    TEST13 TEST123TEST324   1
    TEST123TEST324 TEST123  1
    Use Time :0
    

    Windows平台下:

    The number of character is:62
    The number of line is:1
    The number of word is:7
    The top 10 words is:
    TEST123 5
    TESTAFS 1
    TEST123TEST324  1
    TEST123 TEST3456    2
    TEST3456 TESTAFS    1
    TESTAFS TEST13  1
    TEST13 TEST123TEST324   1
    TEST123TEST324 TEST123  1
    Use Time :0
    

    分析:并没有什么问题~

    总结: 设计了11个测试样例,其中发现在Win平台下,许多文件的编码格式会导致读取出现严重问题,因此,觉得助教选择最终的测试平台为linux ubuntu 是更加正确的选择。 
    在可以用ASCII方式编解码及可以用文本文档或gedit打开的文件中,两个平台的结果都是一致的,而词组和单词的样例测试也证明了这一点。

    实验代码:

     
        #include <fstream>  
        #include<string>
        #include <vector>
        #include <sstream>
        #include <iostream>
        #include <stdio.h>
        
        
        
        #include <vector>
        #include <sstream>
        
        #include<functional>
        #include <time.h>
        #include<unordered_map>
        
        
        
        
        using namespace std;
        
        
        
        class word_count;
        
        void getAllFiles(string path, vector<string>& files);
        int fin_to_s(string &str, vector<string> &files, int i);
        
        class word_classifier {
        public:
            string* str;
            //string* temp;//
            string* num_rear;
            int num;
            word_classifier();
            ~word_classifier();
            int judge(char c, word_count* word);//
                                                //void classify(char c);//
            void clear();
            void set(word_count* word);
        
        };
        
        class word_count
        {
        public:
            //string* temp;//
            string* str;//    
            string* num;//
            int num_rear;//
            int str_count;//
            int flag;//
            int size;//
            string* word;//
        
            word_count* next_ptr;
            word_count();
            ~word_count();
        };
        
        
        class phrase_count {
        public:
            word_count* phrase1;
            word_count* phrase2;
            int phr_count;//
            int flag;//
            phrase_count* next_ptr;
            phrase_count();
            ~phrase_count();
        
        
        };
        
        
        void word_to_word(word_count* word, word_count* word1);
        int freq_count(word_count* &arr1, word_count* temp, int flag);
        int compare(word_count* word, word_count* word1, int& flag);
        void phrase_to_phrase(phrase_count* phrase, phrase_count* phrase1);//
        int freq_countP(phrase_count* &arr1, phrase_count* temp, int flag);//
        int compareP(phrase_count* phrase, phrase_count* phrase1, int& flag1, int &flag2);//
        
        
        
        #ifdef WIN32
        #include <io.h>
        void getAllFiles(string path, vector<string>& files)
        {
        
            long   hFile = 0;
            struct _finddata_t fileinfo;
            string p;
            if ((hFile = _findfirst(p.assign(path).append("\*").c_str(), &fileinfo)) != -1)
            {
                do
                {
                    if ((fileinfo.attrib &  _A_SUBDIR))
                    {
                        if (strcmp(fileinfo.name, ".") != 0 && strcmp(fileinfo.name, "..") != 0)
                        {
                            files.push_back(p.assign(path).append("\").append(fileinfo.name));
                            getAllFiles(p.assign(path).append("\").append(fileinfo.name), files);
                        }
                    }
                    else
                    {
                        files.push_back(p.assign(path).append("\").append(fileinfo.name));
                    }
        
                } while (_findnext(hFile, &fileinfo) == 0);
        
                _findclose(hFile);
            }
        
        }
        #endif
        #ifdef __linux__
        #include <dirent.h>
        void getAllFiles(string path, vector<string>& files)
        {
            string name;
            DIR* dir = opendir(path.c_str());
            dirent* p = NULL;
            while ((p = readdir(dir)) != NULL)
            {
                if (p->d_name[0] != '.')
                {
                    string name = path + "/" + string(p->d_name);
                    files.push_back(name);
                    //cout << name << endl;
                    if (p->d_type == 4) {
                        getAllFiles(name, files);
                    }
                }
        
            }
            closedir(dir);
        
        }
        #endif
        
        
        int fin_to_s(string &str, vector<string> &files, int i) {
            ifstream infile;
            infile.open(files[i]);
            infile >> str;
            infile.close();
            return 0;
        }
        
        
        
        void word_to_word(word_count* word, word_count* word1) {
            //
            *(word1->str) = *(word->str);
        
            *(word1->num) = *(word->num);
        
            word1->num_rear = word->num_rear;
            //
            word1->str_count = word->str_count;
            //
            word1->flag = word->flag;
            word1->size = word->size;
            *(word1->word) = *(word->word);
            word1->next_ptr = word->next_ptr;
        
        }
        
        
        word_classifier::word_classifier() {
            //
            num = 0;
            str = new string();
            num_rear = new string();
            //temp = new string();
        
        }
        
        word_classifier::~word_classifier() {
            //
            delete str;
            delete num_rear;
        
        
        }
        
        void word_classifier::set(word_count* word) {
            //
            string stri;
            //word->temp = temp;
            stri = *str + *num_rear;
            *(word->str) = stri;
            *(word->word) = *str;
            word->str_count = 1;
            word->size = num;
            word->num = num_rear;
            word->num_rear = num - num_rear->size();//
            return;
        
        }
        void word_classifier::clear() {
            //
            num = 0;
        
            str->clear();
            num_rear->clear();
        
        }
        
        int word_classifier::judge(char c, word_count* word) {
            //
        
            if (c >= 'a'&&c <= 'z') c = c - 32;//
            if (c >= '0'&&c <= '9' || c <= 'Z'&&c >= 'A') {
                //*
                if (c >= '0'&&c <= '9') {
                    // **
                    if (num < 4) {
                        // ***
                        clear();
                        return 0;
                    }
                    else {
                        //
                        num_rear->append(1, c);
        
                        num++;
                        return 2;
                    }
                }
                else {
                    //
                    if (num_rear->empty()) {
        
                        str->append(1, c);
        
                        num++;
                        return 2;
                    }
                    else {
        
                        str->append(*num_rear);
                        num += num_rear->size();
                        str->append(1, c);
                        num_rear->clear();
                        return 2;
                    }
                }
            }
            else {
        
                if (num < 4) {
        
                    clear();
                    return 0;
                }
                else {
        
                    set(word);
                    clear();
                    return 1;
                }
            }
        
        }
        
        word_count::word_count()
        {
            str = new string();
            num = new string();
            word = new string();
            next_ptr = NULL;
            size = 0;
            str_count = 0;
            num_rear = 0;
            flag = 0;
        }
        
        
        word_count::~word_count()
        {
            delete str;
        }
        
        bool operator==(const word_count& word1, const word_count& word2)
        {
        
            return (*(word1.word) == *(word2.word)) && (word1.num_rear == word2.num_rear);
        
        }
        
        int freq_count(word_count* &arr1, word_count* temp, int flag) {
        
            word_count* arr = arr1;
            if (arr == NULL) {
                //
                arr = new word_count();
                arr->str_count = -1;
            }
        
            if (arr->next_ptr == NULL)
            {
                //
                arr->next_ptr = temp;
                flag = 1;
                temp->flag = 1;
            }
            else
            {
                //
                if (temp->flag == 1) return 0;//
                word_count* parent = arr;
                word_count* change = NULL;
                arr = arr->next_ptr;
                int i = 0;//
                int t = temp->str_count;//
                int result = -100;//
                int flag_equal = -1;//
                                    //
        
                while (i < 10 && arr->next_ptr != NULL) {
                    //
                    if (t > arr->str_count) {
                        //)    *
                        flag_equal = -1;
                        if (change == NULL)
                            change = parent;//  **
                        else {
                            //**
                            int j = change->str_count - arr->str_count;
                            if (j>0) {
                                //change  **
                                change = parent;
        
                            }
                            else if (j == 0) {
                                //j
                                int k = compare(change, arr, flag);
                                if (k == -1) change = parent;
                            }
                            //else 
                        }
        
                    }//end if
                    else if (t == arr->str_count) {
                        //
                        if (change == NULL) {
                            //
                            result = compare(arr, temp, flag);
                            if (result == -1) {
                                //
                                change = parent;
                                flag_equal = 1;//
                            }//end if    
                             //result=0,
                        }
                        else {
                            //
                            if (flag_equal == 1) {
                                //
                                result = compare(change, temp, flag);
                                if (result == -1) {
                                    //
                                    change = parent;
                                }//end if    
                            }//end if
                             //flag_equal!=1,
        
                        }//end else
        
        
                    }//end else if
        
        
                    i++;
                    parent = arr;
                    arr = arr->next_ptr;
        
                }//end while
        
                if (i<10) {
                    //
                    arr->next_ptr = temp;
                    temp->flag = 1;
                }//end if
                else if (i == 10) {
                    //10
                    if (change != NULL) {
                        //change
                        temp->next_ptr = change->next_ptr->next_ptr;
                        change->next_ptr->flag = 0;
                        temp->flag = 1;
                        change->next_ptr = temp;
                    }//end if
                }//end else if
            }//end else
        
        }//end freq_count
        
        int freq_countP(phrase_count* &arr1, phrase_count* temp, int flag) {
        
            phrase_count* arr = arr1;
            if (arr == NULL) {
        
                arr = new phrase_count();
                arr->phr_count = -1;
            }
        
            if (arr->next_ptr == NULL)
            {
        
                arr->next_ptr = temp;
                flag = 1;
                temp->flag = 1;
            }
            else
            {
        
                if (temp->flag == 1) return 0;
                phrase_count* parent = arr;
                phrase_count* change = NULL;
                arr = arr->next_ptr;
                int i = 0;//
                int t = temp->phr_count;//
                int para1, para2 = 0;//
                int result = -100;//
                int flag_equal = -1;
        
                while (i < 10 && arr->next_ptr != NULL) {
                    //
                    if (t > arr->phr_count) {
                        //    *
                        flag_equal = -1;
                        if (change == NULL)
                            change = parent;//  **
                        else {
                            //    **
                            int j = change->phr_count - arr->phr_count;
                            if (j>0) {
                                //  **
                                change = parent;
        
                            }
                            else if (j == 0) {
                                //
                                int k = compareP(change, arr, para1, para2);
                                if (k == -1) change = parent;
                            }
                            //else 
                        }
        
                    }//end if
                    else if (t == arr->phr_count) {
                        //
                        if (change == NULL) {
                            //
                            result = compareP(arr, temp, para1, para2);
                            if (result == -1) {
                                //
                                change = parent;
                                flag_equal = 1;//
                            }//end if    
                             //result=0,
                        }
                        else {
                            //
                            if (flag_equal == 1) {
                                //
                                result = compareP(change, temp, para1, para2);
                                if (result == -1) {
                                    //
                                    change = parent;
                                }//end if    
                            }//end if
        
                        }//end else
        
        
                    }//end else if
        
                     //
                    i++;
                    parent = arr;
                    arr = arr->next_ptr;
        
                }//end while
        
                if (i<10) {
                    //
                    arr->next_ptr = temp;
                    temp->flag = 1;
                }//end if
                else if (i == 10) {
        
                    if (change != NULL) {
                        //
                        temp->next_ptr = change->next_ptr->next_ptr;
                        change->next_ptr->flag = 0;
                        temp->flag = 1;
                        change->next_ptr = temp;
                    }//end if
                }//end else if
            }//endelse
        
        }//end freq_count
        
        phrase_count::phrase_count() {
            phrase1 = new word_count();
            phrase2 = new word_count();
            next_ptr = NULL;
            phr_count = 0;
        
        }
        
        phrase_count::~phrase_count() {
            delete phrase1;
            delete phrase2;
        
        }
        
        void phrase_to_phrase(phrase_count* phrase, phrase_count* phrase1) {
        
            phrase1->flag = phrase->flag;
            phrase1->next_ptr = phrase->next_ptr;
            phrase1->phr_count = phrase->phr_count;
            word_to_word((phrase->phrase1), (phrase1->phrase1));
            word_to_word((phrase->phrase2), (phrase1->phrase2));
        
        }
        
        int compare(word_count* word, word_count* word1, int& flag) {
            //
            int w0 = word->num_rear, w1 = word1->num_rear;

    个人作业项目报告(三)输出结果及测试样例的结果(附代码)

        flag = 0;
            string s0(*(word->str), 0, w0), s1(*(word1->str), 0, w1);
            if (s0 < s1)  return -1;//
            else if (s0 > s1) return 1;//
            else {
        
        
                if (word->num < word1->num) flag = -1;
                else if (word->num > word1->num) flag = 1;
        
                return 0;
            }
        }//end compare
        int compareP(phrase_count* phrase, phrase_count* phrase1, int& flag1, int &flag2) {
            //
            int s1, s2 = 0;
            s1 = compare(phrase->phrase1, phrase1->phrase1, flag1);
            if (s1<0) {
                return -1;
            }
            else if (s1 > 0) {
                return 1;
            }
            else {
                s2 = compare(phrase->phrase2, phrase1->phrase2, flag2);
                if (s2 < 0) {
                    return -1;
                }
                else if (s2 > 0) {
                    return 1;
                }
                else {
                    return 0;
                }
            }
        
        }//end compareP
        
        
        int main(int argc, char* argv[])
        {
            unordered_map<string, word_count> wordmap;//Hash table for word
            unordered_map<string, phrase_count> phrasemap;//hash table for phrase
            vector<string> files;
            string path;
            if (argv[1] == NULL) {
        
                path.append("C:/test/11.txt");
            }
            else {
                path.append(argv[1]);
        
            }
        
            time_t start, stop;
            start = time(NULL);
            if (path.find(".") != string::npos) {// if the path is a file path
                files.push_back(path);
        
            }
            else {
                getAllFiles(path, files);   //get all file paths  
        
            }
        
        
        
            int size = files.size();//length of file
            int con = 1;//parameter;
            int line_count = 0;//count of line
            int char_count = 0;//count of character
            int word_all_count = 0;//count of word
            word_classifier classifier_word;//char-word analyzer
            word_count* word_temp = NULL;
            word_count* word_temp1 = NULL;
            word_count* arr = new word_count();
            word_count* word_test = NULL;//using for pointing to word_count in H_table;
            string str_test;//used to store *(word_temp->word)
            int flag = 0; //flag of judging word
            int phrase_all_count = 0;//count of phrase
            phrase_count* phrase_temp = NULL;
            phrase_count* arrp = new phrase_count();
            phrase_count* phrase_test = NULL;//using for pointing to phrase_count in H_table;
            word_temp = new word_count();
            word_temp1 = new word_count();
            phrase_temp = new phrase_count();
        
        
        
            int phr_flag = 0;//fag of phrase,judging if should get phrase
        
        
            char c = 0, optr = 0;//optr is a copy of c
        
            ifstream infile;//ptr of file
            for (int i = 0; i < size; i++) {
                //going to all files
                infile.open(files[i],ios::in);
                
                //judge if the path is a folder or document
                if (infile.fail()) {
                    continue;}// fail,meaning a folder
                else {
                    //get the length of file,and store in FileSize
                    int begin = infile.tellg();
                    int end = begin;
                    int FileSize = 0;
                    infile.seekg(0, ios_base::end);
                    end = infile.tellg();
                    infile.seekg(0, ios_base::beg);
                    FileSize = end - begin;
                    //end of getting file
                    if (FileSize != 0) {
                        line_count += 1;
                        for (int j = 0; j <= FileSize; j++) {
                            //operation in each File
                            //get a char and count
                            infile.get(c);
                            if (32 <= c&&c <= 126) char_count = char_count + 1;
                            if (c == '
    ') line_count = line_count + 1;
                            optr = c;
                            c = 0;// clear c,avoiding mistakes                                        
                                  //end of counting char and line
                            flag = classifier_word.judge(optr, word_temp);
                            //judging if the word is ok
                            if (flag == 1) {
                                //get a word,
                                flag = 0;
                                //cout << *(word_temp->str) << endl;
                                word_all_count += 1;
                                str_test = *(word_temp->word);
                                if (wordmap.find(str_test) == wordmap.end()) {
                                    //if don't exist
                                    word_to_word(word_temp, &wordmap[str_test]);
                                    word_test = &wordmap[str_test];
                                    freq_count(arr, word_test, 1);
                                }
                                else {
                                    //if exist
                                    wordmap[str_test].str_count++;
                                    word_test = &wordmap[str_test];
                                    //change the rear
                                    con = (*(word_test->num) > *(word_temp->num));
                                    if (con == 1) {
                                        //word_temp has a smaller rear
                                        *(word_test->num) = *(word_temp->num);
                                        *(word_test->str) = *(word_test->word) + *(word_test->num);
        
                                    }
                                    freq_count(arr, word_test, 1);
        
                                }
                                if (phr_flag == 0) {
                                    phr_flag = 1;
                                    word_to_word(word_test, word_temp1);
                                }
                                else if (phr_flag == 1) {
        
                                    phrase_temp->phrase2 = word_temp;
                                    phrase_temp->phrase1 = word_temp1;
                                    str_test = *(word_temp1->word) + *(word_temp->word);
                                    if (phrasemap.find(str_test) == phrasemap.end()) {
                                        //if don;t exist
                                        phrase_to_phrase(phrase_temp, &phrasemap[str_test]);
                                        phrasemap[str_test].phr_count = 1;
                                        phrase_test = &phrasemap[str_test];
                                        freq_countP(arrp, phrase_test, 1);
                                        word_to_word(word_temp, word_temp1);
        
                                    }
                                    else {
                                        //if exist
                                        phrasemap[str_test].phr_count++;
                                        phrase_test = &phrasemap[str_test];
                                        freq_countP(arrp, phrase_test, 1);
                                        word_to_word(word_temp, word_temp1);
        
                                    }
                                    if (optr == 0) phr_flag = 0;
                                }
        
                            }//end if else 
        
        
                        }//end for 2
        
                        
                            infile.get(optr);
        
                            if(infile.eof())infile.close();
        
                    }//end if
                }//end else
        
            }//end for1
            string dist = "Result.txt";
            ofstream ofn(dist);
            ofn << "The number of character is:" << char_count << endl;
            ofn << "The number of line is:" << line_count << endl;
            ofn << "The number of word is:" << word_all_count << endl;
            word_count* ptr_temp = arr;
            phrase_count* ptr_tempp = arrp;
            word_count* q = arr;
            phrase_count*qp = arrp;
        
            int i = 0, j = 0;//
        
            ofn << "The top 10 words is:" << endl;
        
            while (ptr_temp->next_ptr != NULL&&i<10) {
        
                //print the top 10 word
                ptr_temp = ptr_temp->next_ptr;
                ofn << *(ptr_temp->str) + '    ';
                ofn << ptr_temp->str_count << endl;
                i++;
            }
            i = 0;
        
            while (ptr_tempp->next_ptr != NULL&&i<10) {
                //print the top 10 phrase
        
                ptr_tempp = ptr_tempp->next_ptr;
                ofn << *(ptr_tempp->phrase1->str) + ' ' + *(ptr_tempp->phrase2->str) + '    ';
                ofn << ptr_tempp->phr_count << endl;
                i++;
            }
            stop = time(NULL);
            ofn << "Use Time :" << stop - start << endl;
        
            infile.close();
            ofn.close();
        }//end main
  • 相关阅读:
    甲午年总结
    浅谈数字营销
    机器学习笔记
    上海GDG活动有感
    我也谈谈游戏
    CSS3新增属性
    js事件详解
    DOM与BOM相关操作
    JS基础知识
    js数据类型
  • 原文地址:https://www.cnblogs.com/ZucksLiu/p/8678520.html
Copyright © 2020-2023  润新知