文本处理三剑客与shell正则表达式

文本处理三剑客与shell正则表达式

文本处理三剑客

提到对于文本的处理上，除了vim这个强大的编辑器之外，还有使用命令的形式去处理你要处理的文本，而不需要手动打开文本再去编辑。
这样做的好处是能够以shell命令的形式将编辑和处理文本的工作放到脚本中去实现，非常高效和方便。
而在linux之中，最出名的处理文本的命令行工具就是文本处理三剑客：grep/egrep sed awk
但是随之带来了一个问题，那就是我们如何去找到和定位到文本中我们想要处理的内容呢？毕竟我们已经不想要再使用vim去打开并跳转到相应位置了。
这个时候就不得不提到正则表达式了，正则在每种语言中都有，要实现的功能就是匹配符合你预期要求的字符串。
所以要用好三剑客，就需要提前了解和熟悉正则。
shell正则表达式：
shell的正则分为两类：基础正则表达式、扩展正则表达式 {扩展的有+、？、|、（）}
正则表达式就是为了处理大量的文本和字符串而定义的一套规则和方法。
通过这些规则，管理员可以快速过滤、替换或者输出需要的字符串，linux的正则表达式一般以行为单位处理。

过滤出想要的字符串：grep/egrep
[jerry@centos ~]$ grep -i "hist" /etc/profile
HISTSIZE=1000
if [ "$HISTCONTROL" = "ignorespace" ] ; then
    export HISTCONTROL=ignoreboth
    export HISTCONTROL=ignoredups
export PATH USER LOGNAME MAIL HOSTNAME HISTSIZE HISTCONTROL
export HISTTIMEFORMAT="%F %T `whoami` "

基础正则：

^ —— 表示匹配字符串开头
[jerry@centos ~]$ grep -i "^hist" /etc/profile
HISTSIZE=1000
grep -i "^hist" /etc/profile #表示过滤出以hist开头的字符串，-i表示忽略大小写

区分正则和通配符：
1、只要涉及grep/egrep sed awk都是正则。其他的都是统配符
2、涉及文件和目录名——统配符；涉及文本内容——正则

. —— 匹配一个字符，且该字符必须存在
[jerry@centos ~]$ grep "v.r" /etc/profile
# System wide environment and startup programs, for login setup
# /etc/profile.d/ to make custom changes to your environment, as this
    MAIL="/var/spool/mail/$USER"
在这里匹配了var vir，通过-o选项可查看，-o：only match
[jerry@centos ~]$ grep -o "v.r" /etc/profile
vir
vir
var

* —— 匹配前面一个字符的0个或者多个。
.* —— 可以匹配任意长度的任意内容。
$ —— 匹配字符串结尾
[] —— 表示匹配括号里的任意一个字符。
[jerry@centos ~]$ egrep "^$|^#" /etc/init.d/functions | wc -l
123
表示匹配以#开头的注释行和空行（^$）,使用wc的统计结果为123行

[jerry@centos ~]$ echo "hello world" | grep [eol]
hello world
[jerry@centos ~]$ echo "hello world" | grep [eol] -o
e
l
l
o
o
l
实际上是一个一个匹配的，也即是匹配[]中的任意一个字符。
注意[]之中如果有逗号，那么这个逗号也表示一个字符，而不是分隔的含义，[]自动分隔，无须人为操作。
[jerry@centos ~]$ echo "hello" | grep "^[^s]"
hello
[jerry@centos ~]$ echo "shello" | grep "^[^s]"

^[^s] —— ^在中括号中有^表示取反，非。匹配不以s开头的字符

[jerry@centos ~]$ echo "shh" | grep "^[sa]"
shh
[jerry@centos ~]$ echo "ahh" | grep "^[sa]"
ahh
表示匹配以a或者s开头的字符。

{n,m} —— 匹配字符n到m次，至少n次，至多m次。
[jerry@centos ~]$ echo "kk kkk kkkk" | grep "k{2,3}" -o
kk
kkk
kkk
匹配k，2-3个。

混合使用的正则:
[jerry@centos ~]$ ip a | grep "^ .*inet.*em1$"
    inet 172.16.254.9/16 brd 172.16.255.255 scope global noprefixroute em1

<> —— 单词锚定
[jerry@centos ~]$ grep "<root>" /etc/passwd
root:x:0:0:root:/root:/bin/bash
operator:x:11:0:operator:/root:/sbin/nologin

() —— 分组
[jerry@centos ~]$ grep "(root).*1" /etc/passwd
root:x:0:0:root:/root:/bin/bash
#后面的1就是对第一个分组的调用。

拓展正则
grep -E —— -E支持拓展正则，或者直接使用egrep
拓展正则支持所有的基础正则
？ —— 最多匹配一个，可以是0个。
+ ——至少匹配一个可以是多个
| —— 或
（）
{}
<>

[jerry@centos ~]$ echo "k kk kkk kkkk" | egrep -o "k?"
k
k
k
k
k
k
k
k
k
k
[jerry@centos ~]$ ip a | egrep ".+em1$"
    inet 172.16.254.9/16 brd 172.16.255.255 scope global noprefixroute em1

匹配个数的正则：* {} ? +
匹配位置的正则：^ $
匹配字符：. [] <> ()
逻辑关系： |
拓展正则：
？ —— 最多匹配一个，可以是0个。
+ ——至少匹配一个可以是多个
| —— 或
（）
{}
<>

如果使用了拓展正则的内容，那么就需要让grep和sed支持它。
grep支持拓展正则加上-E选项

[jerry@centos ~]$ egrep "[A-Z]+" /etc/init.d/functions | wc -l
160
#匹配大写字母的行，统计行数
[jerry@centos ~]$ grep -E -v "[1-9]+" /etc/init.d/functions | wc -l
557
#匹配非数字字符，统计行数
[jerry@centos ~]$ grep -E "[1-9]+" /etc/init.d/functions | wc -l
155
#统计有数字的行，统计行数
[jerry@centos ~]$ grep "a.*b" /etc/init.d/functions | wc -l
50
#匹配a在前b在后的行，统计行数。
[jerry@centos ~]$ grep "^# .*" /etc/init.d/functions | wc -l
30
#匹配以#开头，第二个字符是空格的行，统计行数。

grep详解

grep使用的常用选项：-i -n -o -c -v -E -A -B -C
-i——忽略大小写
-n——打印行号
-o——只打印匹配的内容
-c——匹配的内容有多少行
-v——取反，打印不匹配的行
-E——扩展正则表达式
-A——after打印匹配的后几行
-B——before打印匹配的后几行
-C——打印匹配的前后几行

[jerry@centos ~]$ cat > a.txt << EOF
> eE
> aA
> xX
> cC
> EOF
#在a.txt上写入以下内容
[jerry@centos ~]$ grep -i "e" a.txt
eE
#匹配到了E，-i忽略了大小写。

[jerry@centos ~]$ grep -in "[ac]" a.txt
2:aA
4:cC
#加上-n使其打印出相应的行号

[jerry@centos ~]$ grep -o "[ac]" a.txt
a
c
#答应只匹配的内容

[jerry@centos ~]$ grep -c "[ac]" a.txt
2
#答应匹配的内容一共有多少行

[jerry@centos ~]$ grep -vn "[ac]" a.txt
1:eE
3:xX
#-v打印不匹配的行

监控后台程序crond（可以是其他程序）
[jerry@centos ~]$ ps aux | grep crond | grep -v grep
root      9015 0.0 0.0 126288 1644 ?        Ss   10月09   0:01 /usr/sbin/crond -n
[jerry@centos ~]$ ps aux | grep crond | grep -v grep | wc -l
1
#将这个值赋值给一个变量，然后使用脚本监测即可。或者使用zabbix自定义监控。

[jerry@centos ~]$ grep -A 3 "<root>" /etc/passwd
root:x:0:0:root:/root:/bin/bash
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
--
operator:x:11:0:operator:/root:/sbin/nologin
games:x:12:100:games:/usr/games:/sbin/nologin
ftp:x:14:50:FTP User:/var/ftp:/sbin/nologin
nobody:x:99:99:Nobody:/:/sbin/nologin
#打印匹配内容的后三行。
[jerry@centos ~]$ grep -C 2 "<root>" /etc/passwd
root:x:0:0:root:/root:/bin/bash
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
--
halt:x:7:0:halt:/sbin:/sbin/halt
mail:x:8:12:mail:/var/spool/mail:/sbin/nologin
operator:x:11:0:operator:/root:/sbin/nologin
games:x:12:100:games:/usr/games:/sbin/nologin
ftp:x:14:50:FTP User:/var/ftp:/sbin/nologin
#-A -C -B的使用方式相同。
[jerry@centos ~]$ grep -E "^#|^$" /etc/init.d/functions | wc -l
123
#统计注释行和空行的数量。

grep：默认不支持扩展正则，加上-E支持扩展正则，如果不加上-E，使用{}要加上进行转义。
egrep：支持基础和扩展正则，相当于grep -E。
awk：默认支持所有正则。
sed：默认不支持扩展正则，机上-r选项支持扩展正则，如果不加上-r，使用{}要加上进行转义。
三剑客都是以行为单位进行处理。

sed详解

linux中一切皆文件，如配置文件、日志文件、启动文件等，如果我们要对文件进行一些编辑查询的工作，可能最容易想到的是vi vim more cat等，
但是这些命令的效率都不高，而在linux中有三种工具：awk（顶配大剑客）sed（中配二剑客）grep（低配三剑客），使用这些工具，在能够达到效果
的前提下，节省大量的重复动作，提高效率。

处理的内容可以来自：文件、键盘输入、管道符
当你学会sed命令，你会发现它在处理文件的一系列修改是很有用的。

sed用法：
[jerry@centos ~]$ sed --help
用法: sed [选项]... {脚本(如果没有其他脚本)} [输入文件]...

选项后面的脚本实际上是sed-commands，是sed命令内置的一些选项，为了和前面的options区分，故称作sed命令（可以是命令的组合——脚本）

sed工作原理：
sed读取一行内容，首先将这行放到缓存中，然后再进行处理，处理完成后将缓存区中的内容发送到终端。
存储sed读取到的内容的缓存空间称之为：模式空间（pattern space）

选项说明：
options：
-n （no）取消默认的sed软件输出，常与sed-commands的p连用。
-e    多点操作，一条命令语句执行多个sed操作。
-r        使用扩展正则表达式
-i        直接修改写入到文件内容，而不是输出到终端。如果不使用-i那么只会修改内存中的内容，而不会写入到磁盘。

sed-commands：
a —— append 在指定行后面追加一行或者多行文本。
c —— change 取代指定的行或者多行
d —— delete 删除指定的行或者多行
i —— insert 插入，在指定的行前添加一行或多行文本
p —— print 打印模式空间的内容，通常p和-n选项一起使用。

！ —— 取非，对指定行以外的所有行应用命令。

准备一份测试文件a.txt,将/etc/passwd中的内容导入到其中。
[jerry@centos ~]$ sed '1a hello world' a.txt
root:x:0:0:root:/root:/bin/bash
hello world
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
#在第一行后面追加内容hello world
[jerry@centos ~]$ sed '1i challenge accepted' a.txt
challenge accepted
root:x:0:0:root:/root:/bin/bash
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
#在第一行插入内容
当让行数可以随意定义还可以使用$表示末尾行。
[jerry@centos ~]$ sed '$i challenge accepted' a.txt

#这些内容都是输出到终端的，也就是说没有写入到文件磁盘之中，如果需要可以加上-i选项写入，但是修改文件建议先备份。

[jerry@centos ~]$ sed "1d" a.txt
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
#删除了第一行的内容

[jerry@centos ~]$ sed "1,33d" a.txt
tom:x:1005:1005::/home/tom:/bin/bash
jaoo:x:1007:1007::/home/jaoo:/bin/bash
ken:x:1008:1008::/home/ken:/bin/bash
aaa:x:1009:1009::/home/aaa:/bin/bash
#将1-33行的内容删除

[jerry@centos ~]$ sed '1c hello' a.txt
hello
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
#将第一行的内容更改为hello

[jerry@centos ~]$ sed '1c 1 2 3' a.txt
1
2
3
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
#也可以将一行内容替换为多行，加上换行符即可。
[jerry@centos ~]$ sed '1i 1 2 3' a.txt
1
2
3
root:x:0:0:root:/root:/bin/bash
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
#对于插入和追加都生效。

[jerry@centos ~]$ sed -n '1p' a.txt
root:x:0:0:root:/root:/bin/bash
#只打印第一行的内容。

[jerry@centos ~]$ sed '1,33c ken' a.txt
ken
tom:x:1005:1005::/home/tom:/bin/bash
jaoo:x:1007:1007::/home/jaoo:/bin/bash
#change也可以是多行

[jerry@centos ~]$ sed -n '1,4p' a.txt
root:x:0:0:root:/root:/bin/bash
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
#答应1-4行的内容

[jerry@centos ~]$ sed -n '$p' a.txt
rljwg:x:1034:1034::/home/rljwg:/bin/bash

[jerry@centos ~]$ sed -n '/root/p' a.txt
root:x:0:0:root:/root:/bin/bash
operator:x:11:0:operator:/root:/sbin/nologin
#打印匹配的内容

[jerry@centos ~]$ sed -n '/^root/p' a.txt
root:x:0:0:root:/root:/bin/bash
#打印以root开头的行

-e支持多点操作：
sed -e '1d' -e '5,7d' a.txt
sed -i '1,33d' a.txt   #写入磁盘，没有输出到终端。
[jerry@centos ~]$ sed -i '1,33d' a.txt
[jerry@centos ~]$ cat a.txt
tom:x:1005:1005::/home/tom:/bin/bash
jaoo:x:1007:1007::/home/jaoo:/bin/bash
ken:x:1008:1008::/home/ken:/bin/bash
aaa:x:1009:1009::/home/aaa:/bin/bash
jek:x:1010:1010::/home/jek:/bin/bash
huwan:x:1011:1011::/home/huwan:/bin/bash
[jerry@centos ~]$ sed -i -n '1,10p' a.txt
[jerry@centos ~]$ cat a.txt
tom:x:1005:1005::/home/tom:/bin/bash
jaoo:x:1007:1007::/home/jaoo:/bin/bash
ken:x:1008:1008::/home/ken:/bin/bash
#只剩下10行的内容，如果写入文件最好先不加-i输出到终端检查确认。
[jerry@centos ~]$ head /etc/passwd > a.txt
[jerry@centos ~]$ sed -n '2,5p' a.txt
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin
[jerry@centos ~]$ sed -n '2,5!p' a.txt
root:x:0:0:root:/root:/bin/bash
sync:x:5:0:sync:/sbin:/bin/sync
shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
halt:x:7:0:halt:/sbin:/sbin/halt
mail:x:8:12:mail:/var/spool/mail:/sbin/nologin
operator:x:11:0:operator:/root:/sbin/nologin
#！表示取非

[jerry@centos ~]$ sed 's/root/kkkkkkkk/g' a.txt
kkkkkkkk:x:0:0:kkkkkkkk:/kkkkkkkk:/bin/bash
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin
sync:x:5:0:sync:/sbin:/bin/sync
shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
halt:x:7:0:halt:/sbin:/sbin/halt
mail:x:8:12:mail:/var/spool/mail:/sbin/nologin
operator:x:11:0:operator:/kkkkkkkk:/sbin/nologin
#替换，加上g全局替换，不加g替换匹配到的第一个。
[jerry@centos ~]$ sed 's/root/kkkkkkkk/' a.txt
kkkkkkkk:x:0:0:root:/root:/bin/bash
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin
sync:x:5:0:sync:/sbin:/bin/sync
shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
halt:x:7:0:halt:/sbin:/sbin/halt
mail:x:8:12:mail:/var/spool/mail:/sbin/nologin
operator:x:11:0:operator:/kkkkkkkk:/sbin/nologin

[jerry@centos ~]$ sed '/^root/{s/root/ken/g}' a.txt
ken:x:0:0:ken:/ken:/bin/bash
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
#匹配以root开头的行，然后将root替换为ken。不加g只是替换匹配到的第一个。
[jerry@centos ~]$ cat /etc/sysconfig/selinux > a.txt
[jerry@centos ~]$ sed -r -i 's/(SELINUX=)disabled/1enforcing/' a.txt
[jerry@centos ~]$ cat a.txt
# This file controls the state of SELinux on the system.
# SELINUX= can take one of these three values:
#     enforcing - SELinux security policy is enforced.
#     permissive - SELinux prints warnings instead of enforcing.
#     disabled - No SELinux policy is loaded.
SELINUX=enforcing
# SELINUXTYPE= can take one of three values:
#     targeted - Targeted processes are protected,
#     minimum - Modification of targeted policy. Only selected processes are protected.
#     mls - Multi Level Security protection.

#-r 支持扩展正则，否则会报错（）分组，1调用分组。

[jerry@centos ~]$ sed -e '/^#/d' -e '/^$/d' a.txt
SELINUX=enforcing
SELINUXTYPE=targeted

grep也同样支持-e多点操作。
[jerry@centos ~]$ grep -v -e '^#' -e "^$" a.txt
SELINUX=enforcing
SELINUXTYPE=targeted

[jerry@centos ~]$ sed -r '/(^#)|(^$)/d' a.txt
SELINUX=enforcing
SELINUXTYPE=targeted

awk详解

awk不仅仅是liunx系统中的一个命令，而且是一种编程语言，可以用来处理数据和生成报告。
处理的数据可以是一个或者多个文件，可以来自于标准输入，也可以通过管道获取标准输入，awk可以在命令行上直接编辑命令操作，也可以编写
成awk程序来进行更为复杂的运用。

awk的格式：
awk指令是由模式、动作、或者模式和动作的组合。
模式即pattern，类似于sed的模式匹配，可以由表达式组成，如NR==1这就是模式，可以理解为筛选条件。
动作即action，是由大括号里面的一条或者多条语句组成，语句之间用分号隔开。

awk 选项 'pattern{action}' filename
pattern表示匹配内容
action表示匹配到内容后要执行的操作
选项-F —— 指定分隔符 #支持所有正则

关于awk的几个小概念：
记录（record）：一行就是一个记录
分隔符（field separator）：对记录进行切割时候所使用的字符
字段（field）：将一条记录分隔成的每一段
filename：当前处理的文件名称
FS——field separator，默认以空格为字段分隔符
NR——number of record，记录的标号，awk每读取一行，NR就加1
NF——number of filed ，字段的数量
ORS——output record separator，指定输出记录的分隔符，默认为
OFS——output field separator，输出字段分隔符
RS——记录分隔符
$1 $2 $3 ....输出一个指定的字段
$NF——输出最后一个字段
$0 输出整条记录

[jerry@centos ~]$ awk 'NR==2{print $0}' a.txt
# This file controls the state of SELinux on the system.
注意一定要使用单引号。

[jerry@centos ~]$ awk 'NR>=2&&NR<=5{print $0}' a.txt
# This file controls the state of SELinux on the system.
# SELINUX= can take one of these three values:
#     enforcing - SELinux security policy is enforced.
#     permissive - SELinux prints warnings instead of enforcing.
命令说明：条件NR>=2表示行号大于等于2的时候，执行{print $0}的操作，awk是通过一行一行处理文件的，这条命令之中包含了模式和动作两个
部分，awk将处理模式指定的行。

awk的执行过程：
1、awk读入第一行内容
2、判断是否符合条件NR>=2
a、如果匹配指定操作action
b、如果不匹配则继续读取下一行的内容
3、继续读取下一行
4、重复过程1-3，直到读取到最后一行。

[jerry@centos ~]$ awk '{print NR,$0}' a.txt
1
2 # This file controls the state of SELinux on the system.
3 # SELINUX= can take one of these three values:
4 #     enforcing - SELinux security policy is enforced.
5 #     permissive - SELinux prints warnings instead of enforcing.
6 #     disabled - No SELinux policy is loaded.
7 SELINUX=enforcing
8 # SELINUXTYPE= can take one of three values:
9 #     targeted - Targeted processes are protected,
10 #     minimum - Modification of targeted policy. Only selected processes are protected.
11 #     mls - Multi Level Security protection.
12 SELINUXTYPE=targeted
13
14
[jerry@centos ~]$ head /etc/passwd > a.txt
[jerry@centos ~]$ awk -F : '{print $1}' a.txt
root
bin
daemon
adm

[jerry@centos ~]$ awk -F ":" '{print $NF}' a.txt
/bin/bash
/sbin/nologin
/sbin/nologin
/sbin/nologin
/sbin/nologin
/bin/sync

[jerry@centos ~]$ awk '/<root>/{print NR" "$0}' a.txt
1 root:x:0:0:root:/root:/bin/bash
10 operator:x:11:0:operator:/root:/sbin/nologin

[jerry@centos ~]$ awk -F ":" 'NR>=6{print $NF}' a.txt
/bin/sync
/sbin/shutdown
/sbin/halt
/sbin/nologin
/sbin/nologin

awk进阶
$0——表示一行内容，整条记录
~ —— 表示正则匹配
// —— 进行内容匹配

[jerry@centos ~]$ awk '$0~/root/' a.txt
root:x:0:0:root:/root:/bin/bash
operator:x:11:0:operator:/root:/sbin/nologin
#匹配整条记录
[jerry@centos ~]$ awk -F ":" '$NF~/sbin/{print $0}' a.txt
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
#匹配最后一个字段

[jerry@centos ~]$ ip a | grep global
    inet 172.16.254.9/16 brd 172.16.255.255 scope global noprefixroute em1
[jerry@centos ~]$ ip a | grep global | awk -F '[ /]+' '{print $3}'
172.16.254.9

awk特殊模式：BEGIN模式和END模式
begin模式是在awk读取文件之前就执行，一般用来定义预定义变量，如：FS RS
需要注意的是begin模式后面要接一个action操作模块，包含在大括号内，awk必须在输入文件进行任何处理前处理begin里面的动作。
我们可以不用输入任何文件就能对begin模块进行测试，因为awk首先处理的就是begin里的内容。
begin模式常常用来修改内置变量，ORS RS FS OFS等
ORS——output record separator，指定输出记录的分隔符，默认为
OFS——output field separator，输出字段分隔符

[jerry@centos ~]$ awk -F : 'BEGIN{print "USERNAME"} {print $1}' a.txt
USERNAME
root
bin
daemon
adm

[jerry@centos ~]$ awk -F : 'BEGIN{print "USERNAME"} {print $1} END{print "END OF FILE"}' a.txt
USERNAME
root
bin
daemon
adm
shutdown
halt
mail
operator
END OF FILE

[jerry@centos ~]$ awk 'BEGIN{num=0}/nologin/{num++}END{print num}' a.txt
6
#统计文件中nologin的出现次数

awk基本结构：
awk BEGIN{CMDS} /pattern/{CMDS} END{CMDS} filename

awk数组：
结构： arrayname[string]=value
arrayname:数组名，可以类比于酒店名称
string：元素名，可以类比于房间号
value：值，类比于房间里的人

[jerry@centos ~]$ awk -F : '{ken[$NF]++}END{for (i in ken) print ken[i] i}' a.txt
1/bin/sync
1/bin/bash
6/sbin/nologin
1/sbin/halt
1/sbin/shutdown

#ken——数组名     i，$NF —— 元素名    ken[i] —— 值，value

统计网站访问ip：
[jerry@centos logs]$ cat access.log | awk '{print $1}' | sort | uniq -c | sort -rn | head -3
27242 167.172.78.90
20025 124.200.101.50
17432 118.25.185.46

[jerry@centos logs]$ awk '{ken[$1]++}END{for (i in ken) print ken[i],i}' access.log | sort -rn | head -3
27242 167.172.78.90
20025 124.200.101.50
17432 118.25.185.46
相关阅读:
工程的创建
 scrapy框架简介和基础应用
 移动端数据爬取
 Python网络爬虫之图片懒加载技术、selenium和PhantomJS
验证码处理
 Python网络爬虫之requests模块
 Python网络爬虫之三种数据解析方式
 Python网络爬虫之requests模块
 scrapy
基于XML的AOP配置
原文地址：https://www.cnblogs.com/getbird/p/11178405.html