单词统计--续

本周对上周所做的单词统计进行加深练习，在上周基础上加出以下内容：

　　第1步：输出单个文件中的前 N 个最常出现的英语单词。

　　功能1：输出文件中所有不重复的单词，按照出现次数由多到少排列，出现次数同样多的，以字典序排列。

　　功能2：指定文件目录，对目录下每一个文件执行统计的操作。

　　功能3：指定文件目录，是会递归遍历目录下的所有子目录的文件进行统计单词的功能。

　　功能4：输出出现次数最多的前 n 个单词，

一、问题分析

和上周一样，我们要在上次代码的基础上加上以下内容：1）排序 2）统计前n次（n可以使随机输入的）3）指定文件目录

好，现在针对以下问题进行代码的修改

二、代码实现

首先，仍然是在读取到文件内容之后将文件内容存入到map中，在存储过程中仍然是要将单词按照规则进行分割，然后将分割后的单词存入到map中。

 1 HashMap<String, Integer> wordMap = new HashMap<String, Integer>();
 2 
 3         scan = new Scanner(System.in);
 4         System.out.println("请指定文件路径:");
 5         String f = scan.nextLine();
 6         File file = new File(f);
 7                 if (!file.exists()) {
 8             System.out.println("文件不存在");
 9             return;
10         }
11         BufferedReader br = new BufferedReader(new FileReader(file));
12 
13         StringBuilder sb = new StringBuilder();
14         String line = null;
15         while ((line = br.readLine()) != null) {
16             sb.append(line);
17         }
18         br.close();
19 
20         String words = sb.toString();// 全部的单词字符串
21         String target = words.replaceAll("\pP|\pS", "");// 将标点替换为空
22         String[] single = target.split(" ");

这里，我们使用另外一种解决非单词的方法， 1 String target = words.replaceAll("\pP|\pS", ""); ，其中，

小写 p 表示 Unicode 属性，用于 Unicode 正表达式的前缀
大写 P 表示 Unicode 字符集七个字符属性之一：标点字符
大写S表示符号（比如数学符号、货币符号等）；

然后，我们将单词与其出现的次数关联起来

1 for (int i = 0; i < single.length; i++) {
2             if (wordMap.get(single[i]) == null) {
3                 wordMap.put(single[i], 1);
4             } else {
5                 wordMap.put(single[i], wordMap.get(single[i]) + 1);
6             }
7         }

随后，将得到的内容在比较器中进行比较，并按值排序（比较器这里我也不是太懂的，感兴趣的同学可以学习一下，这里推荐一篇博客java如何对map进行排序详解(map集合的使用)）

1 List<Entry<String, Integer>> list = new ArrayList<Entry<String, Integer>>(wordMap.entrySet());
2         Collections.sort(list, new Comparator<Entry<String, Integer>>() {
3 
4             @Override
5             public int compare(Entry<String, Integer> o1, Entry<String, Integer> o2) {
6                 return o2.getValue() - o1.getValue();
7             }
8 
9         });

设置静态值n，将排序的结果进行输出

 1 for (Map.Entry<String, Integer> entry : list) {
 2             if (entry.getKey().equals("#")) {
 3                 continue;
 4             }
 5             System.out.println(entry.getKey() + ":" + entry.getValue());
 6             count++;
 7             if (count == n + 1) {
 8                 break;
 9             }
10         }

三、实验截图

但是，在一篇英语文章中，频率出现最高的单词一般都是 "a", "it", "the", "and", "this", 这些词，可以做一个停词表，在统计词汇的时候，跳过这些词。

这里，我们可以将这些单词储存在一个key的关键值数组中，当读取到他们的要将他们储存在map中的时候，将它们替换成固定的样式，如#等，然后在接下来读到他们的时候自动跳过。

例如以下操作：

建立停词表

1 String[] keys = { "you", "i", "he", "she", "me", "him", "her", "it", "they", "them", "we", "us", "your",
2                 "yours", "our", "his", "her", "its", "my", "in", "into", "on", "for", "out", "up", "down", "at", "to",
3                 "too", "with", "over", "from", "be", "been", "am", "is", "are",
4                 "was", "were",  "the", "of", "and", "a", "an", "that", "this", "be", "or", "as", "will",
5                 "would", "can", "could", "may", "must", "has", "have", "had", "than" };

读到替换

1 for (int i = 0; i < single.length; i++) {
2             for (String str : key) {
3                 if (str.equals(str)) {
4                     single[i] = "#";
5                 }
6             }
7         }

实验截图

总结：关于停词表在存储的时候会有问题，首先，在读取的时候有时候会出现错误，造成读取到的所有的单词都会被替换，因此会出现没有读取数据的情况。

相关阅读:
⛅剑指 Offer 11. 旋转数组的最小数字
 ✨Shell脚本实现Base64 加密解密
 Linux配置Nginx
378. Kth Smallest Element in a Sorted Matrix
875. Koko Eating Bananas
278. First Bad Version
704. Binary Search
69. Sqrt(x)
LeetCode 110 判断平衡二叉树
 LeetCode 43 字符串相乘
原文地址：https://www.cnblogs.com/yandashan666/p/11062289.html