apache日志生成器+apache日志分析器

问题描述：

apache日志记录了很多访问者的信息

网络上已经有了很多专业的apache日志分析工具

本文首先给出了一个简易apache日志生成器

然后用python对该日志做了一些常见的分析

目的1：是为了熟练python及python里的re模块

目的2：了解一些apache日志分析

=======================================================================

本文的apache日志主要生成如下几个部分：
远程主机地址 - - [时间] "METHOD RESOURCE PROTOCOL" 状态代码发送给客户端的总字节数

例如：97.83.32.128 - - [22/Jun/2012:04:47:01 +0800] "GET /abc/acb/abc/?bac=0 HTTP/1.1" 300 1676

废话不多说，先给出apache日志生成器的代码

apache日志生成器

 1 import random
 2 def generate_mrp():
 3     method_list = ['GET', 'POST', 'HEAD']
 4     re_li = ['abc', 'acb', 'bac', 'bca', 'cab', 'cba']
 5     method = method_list[random.randint(0, 2)]
 6     resource ='/%s/%s/%s/?%s=%s' % (re_li[random.randint(0, 5)], re_li[random.randint(0, 5)],
 7                                     re_li[random.randint(0, 5)], re_li[random.randint(0, 5)], random.randint(0, 3))
 8     return '\"%s %s HTTP/1.1\"' % (method,resource)
 9 
10 def generate_ip():
11     global global_ip
12     if random.random() > 0.2:
13         temp = []
14         for i in range(4):
15             id = random.randint(1, 128)
16             temp.append(str(id))
17         global_ip = '.'.join(temp)
18     return global_ip
19 
20 def generate_time():
21     global global_time
22     if random.random() < 0.2:
23         return global_time
24     month_list = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
25     day = random.randint(1, 31)
26     rmonth = random.randint(4, 5)
27     month = month_list[rmonth]
28     year = 2012
29     hour = random.randint(0, 23)
30     if hour < 10:
31         hour = '0' + str(hour)
32     minute = random.randint(0, 59)
33     second = random.randint(0, 59)
34     if minute < 10:
35         minute = '0' + str(minute)
36     if second < 10:
37         second = '0' + str(second)
38     global_time = '[%s/%s/%s:%s:%s:%s +0800]' % (day, month, year, hour, minute, second)
39     return global_time
40 
41 #start here
42 global_ip = '127.0.0.1'
43 global_time = '[5/Nov/2012:12:01:07 +0800]'
44 
45 print 'please enter the number of log to generate'
46 while True:
47     try:
48         number = input()
49     except NameError:
50         print  'please enter an number'
51         continue
52     if number <= 0:
53         print 'please enter an possitive number'
54         continue 
55     break
56 
57 print 'please enter a filename to output the apache log data'
58 filename = raw_input()
59 f=open(filename, 'w')
60 while number:
61     temp=[]
62     temp.append(generate_ip())
63     temp.append('- -')
64     temp.append(generate_time())
65     temp.append(generate_mrp())
66     temp.append(str(100 * random.randint(1, 3)))
67     temp.append(str(random.randint(0, 2000)))
68     f.write(' '.join(temp) + '\n')
69     number = number - 1
70 f.close()

=======================================================================

至于apache日志分析，本文主要做了如下两个分析不过可以扩展。

3. 对于apache日志, 找出类似爬虫的查询语句(同一个ip一天访问次数超过N次), 放到以扫描当天日期如20090704命名的文件中

使用了一个dic（a, list[b, c]），a是由ip地址和日期生成的hash值, b 是ip地址 c是计数用

4. 对于apache日志，统计所有query的pv和平均pv

同理使用了一个dic（a, list[b, c, d]) , a是由日期生成的hash值， b是日期 c 是当天的pv值，d是当天ip数（去重复）

为了去重复这里 d用的是一个set

还有一点这里的pv没有考虑局域网内多个用户共用一个IP地址的情况，假设一个IP就一个用户在用。

如果要考虑的话还要检测cookie

目前没想到更好的方法如果你有更好的方法一定要告诉我

最后给出apache日志分析器的代码

apache日志分析器

 1 import time
 2 import re
 3 
 4 class Apache(object):
 5     filename = 'apache.log'
 6     number = 1
 7     #global range means the range of the number you can enter is [0: range]
 8     range = 3
 9     linedata = []
10     def show(self):
11         print '>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>'
12         print '>>please enter a number to start the apache log analysis                   >>' 
13         print '>>1:找出同一天访问次数超过N次的IP并放到以扫描当天日期命名的文件中          >>'
14         print '>>2:统计每天所有query的pv和平均pv并放到文件pv                              >>'
15         print '>>3:'
16         print '>>0:exit system                                                            >>'
17         print '>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>'
18         self.range = 3
19 
20     def input_number(self, low, up):
21         try:
22             self.number = input()
23         except NameError:
24             print '\n!!please enter an number'
25             return False
26         if self.number >= low and self.number <=up :
27             return True
28         else:
29             print '\n!!please enter an number in [0:'+str(self.range)+']'
30             return False
31 
32     def analysis(self):
33         try:
34             f=open(self.filename, 'r')
35         except IOError:
36             print '\n!!file %s is not exits' % apache.filename
37             exit(0)
38         data = f.readlines()
39         for line in data:
40             self.linedata.append(re.split(' ',line))
41         while self.number:
42             self.show()
43             if not self.input_number(0, self.range):
44                 continue
45             if self.number == 0:
46                 print '\n!!system exiting, byebye'
47             else:
48                 if not self.method():
49                     continue
50             print '\n!!finished'
51 
52     def method(self):
53         if self.number == 1:
54             print '\n!!please enter the number N'
55             if not self.input_number(1, 1000000):
56                 return False
57             strtime = time.strftime("%Y%m%d",time.localtime())
58             f = open(strtime, 'w')
59             count = {}
60             for line in self.linedata:
61                 iptime = re.split('/', line[3])
62                 temp = line[0] + iptime[0] + iptime[1] +']'
63                 key = hash(temp)
64                 if key in count:
65                     count[key] = [temp, count[key][1] + 1]
66                 else:
67                     count[key] = [temp, 1]
68             for k, v in count.items():
69                 if v[1] > self.number:
70                     f.write('%s %s\n' % (v[0], v[1]))
71             f.close()
72             return True
73 
74         if self.number == 2:
75             f = open('pv.txt', 'w')
76             count = {}
77             for line in self.linedata:
78                 timelist = re.split('/', line[3])
79                 iplist = re.split('\.', line[0])
80                 monthday = timelist[0] + timelist[1] + ']'
81                 ipnum = int(''.join(iplist))
82                 key = hash(monthday)
83                 if key in count:
84                     count[key][1] = count[key][1] + 1
85                     count[key][2].add(ipnum)
86                 else:
87                     ipset=set()
88                     ipset.add(ipnum)
89                     count[key] = [monthday, 1, ipset]
90             
91             for k, v in count.items():
92                 f.write('%s %s %s\n' %(v[0], v[1], float(v[1]) / len(v[2])))
93             f.close()
94 
95 if __name__ == '__main__':
96     apache = Apache()
97     apache.analysis()

最后很多分析工作可以结合shell命令来做比如：

问题1：在apachelog中找出访问次数最多的10个IP。

awk '{print $1}' apache.log |sort |uniq -c|sort -nr|head

问题2：在apache日志中找出访问次数最多的几个分钟。

awk '{print $4}' apache.log |awk -F"/" '{print $3}'|cut -c 6-10|sort|uniq -c|sort -nr|head

问题3：在apache日志中找到访问最多的页面：

awk '{print $7}' apache.log|sort|uniq -c|sort -nr|head

问题4：分析日志查看当天的ip连接数

grep '22/Jun/2012' apache.log| awk '{print $1}' |wc -l

问题5：查看指定的ip在当天究竟访问了什么urlgrep '^97.83.32.128.*22/Jun/2012' apache.log| awk '{print $7}'

用apache日志生成器生成日志后就可以直接使用上面的命令了

分析apache日志还是很有用的，可惜我的数据是随机生成的，很多规律都无法模拟。

如果要较真实的模拟的话，还得改下代码。

相关阅读:
从IL角度彻底理解回调_委托_指针
 微信个人机器人开发
 个人微信接口开发
 淘客微信接口
 python爬虫添加请求头代码实例
 用 Django 开发一个 Python Web API
Common encryption methods and implementation in Python Python中常用的加密方法及实现
 python aes加密
 # Python语言程序设计基础
 Python语言程序设计基础——4 程序的控制结构
原文地址：https://www.cnblogs.com/2010Freeze/p/2558527.html