• 结对第二次—文献摘要热词统计及进阶需求




    格式描述

    • 这个作业属于哪个课程:软件工程实践
    • 这个作业要求在哪里:作业要求
    • 结对学号: 欧福源221600431 朱伟榜221600441
    • 这个作业的目标:实现一个能够对文本文件中的单词的词频进行统计的控制台程序。并在此基础上,编码实现顶会热词统计器。
    • 结对同学的博客链接:朱伟榜221600441
    • Github项目地址:Github地址
    • Github的代码签入记录:

    - 具体分工: - 221600431欧福源 - 需求分析 - 爬虫程序编写 - 代码测试 - 博客撰写,熟悉Github的操作 - 221600441朱伟榜 - 需求分析 - 主要代码实现 - 辅助博客撰写 - 代码测试



    PSP

    PSP2.1 Personal Software Process Stages 预估耗时(分钟) 实际耗时(分钟)
    Planning 计划
    • Estimate • 估计这个任务需要多少时间 610 630
    Development 开发
    • Analysis • 需求分析 (包括学习新技术) 70 90
    • Design Spec • 生成设计文档 60 50
    • Design Review • 设计复审 30 40
    • Coding Standard • 代码规范 (为目前的开发制定合适的规范) 40 30
    • Design • 具体设计 70 90
    • Coding • 具体编码 310 300
    • Code Review • 代码复审 30 30
    • Test • 测试(自我测试,修改代码,提交修改) 40 70
    Reporting 报告
    • Test Report • 测试报告 60 80
    • Size Measurement • 计算工作量 30 40
    • Postmortem & Process Improvement Plan • 事后总结, 并提出过程改进计划 40 55
    合计 740 820



    解题思路描述

    刚开始拿到题目时,是比较迷茫的,毕竟题目看起来很繁琐,但仔细思考一下,发现不是很难。我们通过百度以及CSDN论坛等渠道找到了需要的资料,这些资料对我们起到了很大的帮助。



    设计实现过程

    基本需求

    类图如下:

    关键函数是getWords。基本思路是每一行读取后根据正则表达式匹配,每找到一个符合的单词就加一。
    流程图如下:
    这些代码的关键在于细节的处理。

    爬虫程序

    使用了Jsoup工具进行网站页面的爬取。通过对CVPR2018官网的首页,我们发现文章标题都属于一个class,即ptitle。于是我们先用 Jsoup.connect(url).get( )得到整个页面,用getElementsByClass(ptitle)得到标题,接着用attr(href)得到该文章的链接,并用它得到该文章的页面,接着用getElementById("abstract")得到文章的摘要,最后将它们输入到result.txt中。

    进阶需求

    进阶需求在基本需求上增加了自定义输入输出文件、加入权重的词频统计(词组未实现)、自定义词频统计输出(词组未实现)、多参数的混合使用等功能。



    改进程序性能

    花费的时间:55分钟
    改进思路:优化了算法。



    具体代码

    基本需求代码

    import java.io.*;
    import java.util.Map.Entry;
    import java.util.regex.Pattern;
    
    
    
    import java.util.*;
    import java.util.regex.Matcher;
    
    class EntryComparator implements Comparator<Entry<String, Integer>> {
    public int compare(Entry<String, Integer> o1, Entry<String, Integer> o2) {
    	if(o2.getValue() == o1.getValue()) {
    		return o1.getKey().compareTo(o2.getKey());
    	}
    	else return (o2.getValue() - o1.getValue());
    }
    }
    
    public class Main {
    
    private static ArrayList<String> wordStrings =new ArrayList<String>();
    private static int count = 0;
    private static int lines = 0;
    private static int words = 0;
    
    
    
    //获取字符数
    private static void getCharacter(String filename)
    {
    	int ch, ef = 0;
        try {
        	BufferedInputStream bis = new BufferedInputStream(new FileInputStream(new File(filename)));
        	BufferedReader in = new BufferedReader(new InputStreamReader(bis, "utf-8"), 20* 1024* 1024  );
        	while (in.ready()) {
        		ch = in.read();
        		count++;
            	if((char)ch == '
    ') count--;
            	if((char)ch != ' ' && (char)ch != '	' && (char)ch != '
    ' && (char)ch != '
    ') ef++;
            	if((char)ch == '
    ' && ef > 0) {
            		lines++;
            		ef = 0;
            	}
        	}
        	if(ef > 0) {
            	lines++;
            	ef = 0;
            }
        	in.close();
        	} catch (IOException ex) {
        		ex.printStackTrace();
        	}
    }
    
    //获取单词数
    private static void getWords(String filename)throws IOException {
    	FileReader fr = new FileReader(filename);
    	String s = "([A-Za-z]{4,})([A-Za-z0-9]*)";
    	BufferedReader br = new BufferedReader(fr);
    	String line = "";
    	while((line = br.readLine()) != null) {
    		line = line.replaceAll("[^a-zA-Z0-9]([0-9]{1,})([a-zA-Z0-9]*)", "");
    		Pattern  pattern=Pattern.compile(s);  
            Matcher  ma=pattern.matcher(line);  
            while(ma.find()){ 
            	words++;
                //System.out.println(ma.group());  
            }
    	}
    	br.close();
        fr.close();
    }
    
    //输出前10的单词及个数
    private static void getMostWord(String filename)throws IOException {
    	FileReader fr = new FileReader(filename);
    	String s = "([A-Za-z]{4,})([A-Za-z0-9]*)";
    	ArrayList<String> text = new ArrayList<String>();
    	BufferedReader br = new BufferedReader(fr);
    	String line = "";
    	while((line = br.readLine()) != null) {
    		line = line.toLowerCase();
    		line = line.replaceAll("[^a-z0-9]([0-9]{1,})([a-z0-9]*)", "");
    		Pattern  pattern=Pattern.compile(s);  
            Matcher  ma=pattern.matcher(line);  
            while(ma.find()){ 
            	text.add(ma.group());
                //System.out.println(ma.group());  
            }
    	}
    	br.close();
        fr.close();
        Map<String, Integer> map = new HashMap<String, Integer>();
        for(String st : text) {
        	if(map.containsKey(st)) {
        		map.put(st, map.get(st)+1);
        	}else {
        		map.put(st, 1);
        	}
        }
        
        List<Map.Entry<String, Integer>> list = new ArrayList<Map.Entry<String,Integer>>();
        for(Entry<String, Integer> entry : map.entrySet()) {
        	list.add(entry);
        }
        Collections.sort(list,new EntryComparator());
        int i = 0;
        String ssString;
        for(Entry<String, Integer> obj : list) {
     	   if(i>9) break;
     	   ssString="<"+obj.getKey()+">: " + obj.getValue()+"
    ";
     	   wordStrings.add(ssString);
     	   ++i;
     	  // System.out.print(ssString);
        }
    }
    
    
    private static void writers(String c,String w,String l,ArrayList<String>ws,String path) {
    	try {
    		File file1 =new File(path);
     		Writer out =new FileWriter(file1);
    	    out.write(c);
        	out.write(w);
        	out.write(l);
        	for(int i=0;i<ws.size();i++)out.write(ws.get(i));
        	out.close();
    	}catch (Exception e) {
    		// TODO: handle exception
    	}
    }
    
    public static void main(String[] args) throws IOException {
    	String path = "input3.txt";
    	//String path = args[0];
    	//long start = System.currentTimeMillis();//要测试的程序或方法
    	getCharacter(path);
    	getWords(path);
    	getMostWord(path);
    	String c,w,l;
    	c = "characters: "+count+"
    ";
    	w = "words: "+words+"
    ";
    	l = "lines: "+lines+"
    ";
    	writers(c, w, l, wordStrings, "result.txt");
    	//long end = System.currentTimeMillis();
    	//System.out.println("程序运行时间:"+(end-start)+"ms");
    		
    }
    
    }
    

    ###爬虫代码
    package 爬虫;
    
    import java.io.File;
    import java.io.FileWriter;
    import java.io.IOException;
    import java.io.Writer;
    
    import org.jsoup.Connection;
    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    import org.jsoup.select.Elements;
    
    public class 爬虫 {
     public static void main(String []args)
     {
    	 
    	String url1="http://openaccess.thecvf.com/CVPR2018.py";
    	Document document1 = null,document2 = null;   	
    	try 
    	{
    		File file1 =new File("result.txt");
     		Writer out =new FileWriter(file1);
     		Connection connection = Jsoup.connect(url1);
     		connection.maxBodySize(0);
    		document1 = connection.get();
    		Elements x = document1.getElementsByClass("ptitle");
    		//System.out.print(x.size());
    		for(int i=0;i<x.size();i++)
    		{
    			//System.out.print(i+1+" ");
    			//System.out.print("Title: "+x.get(i).text()+" ");
    			String n = i+"
    ";
    			String t="Title: "+x.get(i).text()+"
    ";
    			Elements links = document1.select("dt a");
    			String url2=links.get(i).attr("href");
    			url2="http://openaccess.thecvf.com/"+url2;
    			document2 = Jsoup.connect(url2).get();
    			Element y= document2.getElementById("abstract");
    			//System.out.println("Abstract:"+y.text()+"
    
    ");
    			String a="Abstract: "+y.text()+"
    
    
    ";
    			
         		out.write(n);
    		    out.write(t);
    	    	out.write(a);		    	
    		}
    		out.close();
    	} 
    	catch (IOException e) 
    	{
    		System.out.println("爬取失败");
    	}  
     }
    }
    

    ###进阶需求代码 import java.io.*; import java.util.Map.Entry; import java.util.*;
    //排序
    class EntryComparator implements Comparator<Entry<String, Integer>> {
    public int compare(Entry<String, Integer> o1, Entry<String, Integer> o2) {
    	if(o2.getValue() == o1.getValue()) {
    		return o1.getKey().compareTo(o2.getKey());
    	}
    	else return (o2.getValue() - o1.getValue());
    }
    }
    
    public class WordCount {
    
    private static ArrayList<String> wordStrings =new ArrayList<String>();
    private static int count = 0;
    private static int lines = 0;
    private static int words = 0;
    
    //获取字符数
    private static void getCharacter(String filename)
    {
    	int ls=0;
    	String chhString="";
        try {
        	BufferedInputStream bis = new BufferedInputStream(new FileInputStream(new File(filename)));
        	BufferedReader in = new BufferedReader(new InputStreamReader(bis, "utf-8"), 20* 1024* 1024  );
        	while (in.ready()) {
        		chhString = in.readLine();
        		ls++; 	        
        		if(chhString.indexOf("Abstract:") == 0||chhString.indexOf("Title:") == 0) {
            		chhString = chhString.replaceAll("[^\u0000-\u007f]", "");
            		count +=chhString.length();
            		lines++;           		
    			}       		
        	}
        	ls = ls/5*2;
        	count += ls;
        	in.close();
        } catch (IOException ex) {
        	ex.printStackTrace();
        }
        count -= lines/2*17;
    }
    
    private static boolean isWord(String s) {
    	char[] temp = s.toCharArray();
    	if(temp.length>3) 
    		if(temp[0]>=97 && temp[0]<=122 && temp[1]>=97 && temp[1]<=122 &&temp[2]>=97 && temp[2]<=122 && temp[3]>=97 && temp[3]<=122)
    			return true;
    	    else return false;
    	else return false;
    }
    
    //获取单词数
    private static void getWords(String filename )throws IOException {
    	try {
    		FileReader fr = new FileReader(filename);
    		BufferedReader br = new BufferedReader(fr);
    		String line = "";
    		while((line = br.readLine()) != null) {
    			line = line.replace("[^\u0000-\u007f]", "");
    			line = line.toLowerCase();
    			String[] strings = line.split("[^a-z0-9]");
    			for(int i=1;i<strings.length;i++) {
    				if(isWord(strings[i])) words++;		
    			}		
    		}
    		br.close();
            fr.close();	
    	}catch (Exception e) {
    		e.printStackTrace();
    	}
    }
    
    //输出前n的单词及个数
    private static void getMostWord(String filename,boolean w,int times)throws IOException {
    	int t=1;
    	if(w) t=10;
    	int a=t;
    	ArrayList<String> text = new ArrayList<String>();
    	try {
    		FileReader fr = new FileReader(filename);
    		BufferedReader br = new BufferedReader(fr);
    		String line = "";
    		while((line = br.readLine()) != null) {
    			line = line.toLowerCase();
    			line = line.replace("[^\u0000-\u007f]", "");
    			if(line.indexOf("title:")==0) {
    				String[] strings = line.split("[^a-z0-9]");
    			    for(int nu=1;nu<strings.length;nu++) {
    			    	if(isWord(strings[nu])) {
    			    		while((t)>0) {
    		        			t--;
    		        			text.add(strings[nu]);
    		        		}
    			    		t = a;
    			    	}
    			    }
    			}
    			if(line.indexOf("abstract:")==0) {
    				String[] strings = line.split("[^a-z0-9]");
    			    for(int nu=1;nu<strings.length;nu++) {
    			    	if(isWord(strings[nu])) text.add(strings[nu]);
    			    }
    			}        
    		}
    		br.close();
            fr.close();
    	}catch (Exception e) {
    		e.printStackTrace();
    	}
    	
    	
        Map<String, Integer> map = new HashMap<String, Integer>();
        for(String st : text) {
        	if(map.containsKey(st)) {
        		map.put(st, map.get(st)+1);
        	}else {
        		map.put(st, 1);
        	}
        }
        
        List<Map.Entry<String, Integer>> list = new ArrayList<Map.Entry<String,Integer>>();
        for(Entry<String, Integer> entry : map.entrySet()) {
        	list.add(entry);
        }
        Collections.sort(list,new EntryComparator());
        int ii = 0;
        String ssString;
        for(Entry<String, Integer> obj : list) {
     	   if(ii>(times-1)) break;
     	   ssString="<"+obj.getKey()+">: " + obj.getValue()+"
    ";
     	   wordStrings.add(ssString);
     	   ++ii;
        }
    }
    
    private static void writers(String c,String w,String l,ArrayList<String>ws,String path) {
    	try {
    		File file1 =new File(path);
     		Writer out =new FileWriter(file1);
    	    out.write(c);
        	out.write(w);
        	out.write(l);
        	for(int i=0;i<ws.size();i++)out.write(ws.get(i));
        	out.close();
    	}catch (Exception e) {
    		e.printStackTrace();
    	}
    }
    
    public static void main(String[] args) throws IOException  {
    	
    	
    	//long start = System.currentTimeMillis();//要测试的程序或方法
    	String ifile = "";
    	String ofile = "";
    	String w;
    	boolean b = false;
    	int times = 10;
    	
    	for(int ar=0;ar<args.length;ar=ar+2) {
    		if("-i".equals(args[ar])) ifile = args[ar+1];
    		if("-o".equals(args[ar])) ofile = args[ar+1];
    		if("-w".equals(args[ar])) {
    			w = args[ar+1];
    			if(w.equals("1")) {
    				b = true;
    			}
    			else if(w.equals("0")) {
    				b = false;
    			}
    		}
    		if("-n".equals(args[ar])) {
    
    			times = Integer.valueOf(args[ar+1]).intValue();
    		}
    		/*if("-m".equals(args[ar])) {
    			nl = Integer.valueOf(args[ar+1]).intValue();
    		}*/
    	}
    	
    	getCharacter(ifile);
    	getWords(ifile);
    	getMostWord(ifile, b, times);
    	String c,ws,l;
    	c = "characters: "+count+"
    ";
    	ws = "words: "+words+"
    ";
    	l = "lines: "+lines+"
    ";
    	writers(c, ws, l, wordStrings, ofile);
    	//long end = System.currentTimeMillis();
    	//System.out.println("程序运行时间:"+(end-start)+"ms");
    	
    		
    }
    
    }
    



    遇到的困难及解决方法

    困难描述

    需求比较模糊,理解用了较长的时间,有些理解错误导致走了弯路;还有对爬虫jsoup不熟悉;细节上出现了错误。

    解决方法

    与同学讨论;上网查教程;细心处理。



    对队友的评价

    • 221600431欧福源

      • 细节把握不错,但写代码的速度有点慢
    • 221600441朱伟榜

      • 对需求理解透彻,很清楚解题思路,编码能力强

  • 相关阅读:
    uboot和内核分区的修改
    2440移植内核到uboot上,打印乱码
    启动新内核出现:No filesystem could mount root, tried: ext3 ext2 cramfs vfa
    启动新内核出现:Kernel panic
    移植最新版本3.4.2内核
    2017团体程序设计天梯赛大区赛 L3-3 球队“食物链”
    leetcode543 Diameter of Binary Tree
    CF599B Spongebob and Joke
    poj1930 Dead Fraction
    poj3040 Allowance
  • 原文地址:https://www.cnblogs.com/ofy666/p/10519176.html
Copyright © 2020-2023  润新知