awk, python, perl文本处理效率对比(zz)

转载，比较结果不一定正确，比较设计不一定科学.

以下3个文件依次是用python、awk和perl写的脚本，做同一件事情：

diff.sh f1 f2

f1和f2每一行的第一个字段（以空格分割）为key，如果f2某一行的key在f1中不存在，则输出f2该行。

比如：

a.dat的内容是

1 a

2 a

b.dat的内容是

1 b

3 b

那么diff.sh a.dat b.dat则输出

3 b

代码：

#!/usr/bin/python import sys if len(sys.argv) != 3: print "Usage: " + sys.argv[0] + "file1 file2"; sys.exit(-1); file1 = sys.argv[1] file2 = sys.argv[2] list1 = {}; for line in open(file1): list1[line.split()[0]] = 1; for line in open(file2): key = line.split()[0]; if key not in list1: sys.stdout.write(line)

#!/bin/sh if [[ $# < 2 ]];then echo "Usage: $0 file1 file2" exit fi function do_diff() { if [[ $# < 2 ]];then echo "Usage: $0 file1 file2" return 1 fi if [[ ! -f $1 ]];then echo "$1 is not file" return 2 fi if [[ ! -f $2 ]];then echo "$2 is not file" return 3 fi awk ' BEGIN{FS=OFS=" "} ARGIND == 1 { arr[$1] = 1; } ARGIND == 2 { if (!($1 in arr)) { print $0; } } ' $1 $2 } do_diff $1 $2

#!/usr/bin/perl -w exit if (1 > $#ARGV); my %map_orig; my $file_orig = shift @ARGV; open FH, "<$file_orig" or die "can't open file: $file_orig"; while (<FH>) { chomp; #$map_orig{$_} = 1; my ($filed) = split /\s+/; $map_orig{$filed} = 1; } close (FH); my $file_diff = shift @ARGV; open FH, "<$file_diff" or die "can't open file: $file_diff"; while (<FH>) { chomp; my ($filed) = split /\s+/; print "$_\n" if (!defined$map_orig{$filed}); } close (FH)

测试方法：time diff.xx f1 f2 > out

测试文件f1有107375330行，每一行格式为：

key value（两个字段）

文件大小为2.2G

f2有473951行，每一行的格式也是：

key value(两字段）

文件大小为5.9M

测试结果：

diff.py的时间为3m24.687s = 205s

diff.sh的时间为3m39.762s = 220s

diff.pl的时间为5m49.478s = 349s

结果显示awk和python的性能差不多，perl则要明显差些。看来python的dict优化得很好，居然能赶上awk的性能，很出乎我的意料。

相关阅读:
微信小程序之某个节点距离顶部和底部的距离 createSelectorQuery
js正则手机号验证
算法将一个对象中的某一个key值变为true,其他值都为false
更改上传框的大小
Educational Codeforces Round 85 (Div. 2)
Codeforces Round #632 (Div. 2)
AtCoder Beginner Contest 161
Codeforces Round #631 (Div. 2)
Codeforces Round #630 (Div. 2)
Codeforces Round #629 (Div. 3)

原文地址：https://www.cnblogs.com/zeushuang/p/2738987.html