awk 统计

命令太多，记不住，组合起来用一把…..
示例文件:

[root@lovedan test]# cat a.txt
hello
good
world
hello
hello
good
dandan
good
hello
world

场景/分析: 统计a.txt出现次数前3名的单词

出现次数用awk统计
排名用sort命令排序
取文件前N行用head命令
awk命令
awk是以文件的一行为处理单位的,awk每接收文件的一行，然后执行相应的命令处理文本
awk玩法请参考文档

1

2

3

4

5

[root@lovedan test]# awk '{sum[$1]+=1} END {for(k in sum) print k ":" sum[k]}' a.txt

hello:4

dandan:1

good:3

world:2

注: 如上结果,每读取一行，得取到那个单词，这是$1,有其它的分隔符则-F等，取具体的$n,
以上用sum数组存储，key是自每行的单词,每读取一行加1，END是最终执行，循环打印内容
单词由次数显示出来，则只要按冒号后的数字倒序排序即可
ok,单词及次数已整理出来，只要排序就妥了，sort命令走起

sort命令

格式 sort 【参数】【文件】

参数 -n 以数字排序

参数 -r 倒序

参数 -t 第几区间【分隔后分隔后的第几列】

参数 -k 以第几区间【分隔后分隔后的第几列】来排序

eg: sort -n -r -k 2 -t ‘:’ xx.txt -n数字排序方式， -r倒序, -t ‘:’以冒号分隔, -k 2表示以冒号分隔后的第2例

结果示例

1

2

3

4

5

[root@lovedan test]# awk '{sum[$1]+=1} END {for(k in sum) print k ":" sum[k]}' a.txt | sort -n -r -k 2 -t ':'

hello:4

good:3

world:2

dandan:1

ok,排序了后，只用取前多少行就妥了，head命令走起

head命令

格式 head 【参数】【文件】

参数 -n<行数> 显示的行数

显示前10行 head -10 xx.txt

结果示例

1

2

3

4

[root@lovedan test]# head -n 3 a.txt

hello

good

world

最终结果

[root@lovedan test]# awk '{sum[$1]+=1} END {for(k in sum) print k ":" sum[k]}' a.txt | sort -n -r -k 2 -t ':' | head -n 3
hello:4
good:3
world:2

貌似上面都复杂了但awk是个神器，uniq命令也可以而有时会显得局限(毕竟日志中没有这么简单的数据)

[root@lovedan test]# sort a.txt | uniq -c | sort -nr  -t ' ' -k 1 | head -n 3
hello
good
world

文不对题请见谅，以上都是小打小闹，请君看下面

若有道面试说有个文件中有1000W行，每行一个单词，现要统单词词频排名前10的查询出来
你有哪些方案方法？

shell统计如上

读取文件再统计排名前10(如下python)

# encoding=utf-8
from collections import defaultdict
 
words = defaultdict(int)
with open('/usr/local/test/a.txt') as f:
    for line in f:
        words[line.strip()] += 1
 
list = sorted(words.items(), key=lambda words: words[1],reverse=True)
print(list[0:10])

若文件大到几个G，数据条数过亿，而且最快最高效率完成目标
面试官问你怎么办？答:MapReduce 见【传送门*大世界^_^】

重要的是思维与格局，分而治之，智慧合作

相关阅读:
hihocoder 1049 后序遍历
hihocoder 1310 岛屿
Leetcode 63. Unique Paths II
Leetcode 62. Unique Paths
Leetcode 70. Climbing Stairs
poj 3544 Journey with Pigs
Leetcode 338. Counting Bits
Leetcode 136. Single Number
Leetcode 342. Power of Four
Leetcode 299. Bulls and Cows