• Python正则处理多行日志一例(可配置化)


     正则表达式基础知识请参阅《正则表达式基础知识》,本文使用正则表达式来匹配多行日志并从中解析出相应的信息。

        假设现在有这样的SQL日志:    

    SELECT * FROM open_app WHERE 1 and `client_id` = 'a08f5e32909cc9418f' and `is_valid` = '1' order by id desc limit 32700,100;
    # Time: 160616 10:05:10
    # User@Host: shuqin[qqqq] @  [1.1.1.1]  Id: 46765069
    # Schema: db_xxx  Last_errno: 0  Killed: 0
    # Query_time: 0.561383  Lock_time: 0.000048  Rows_sent: 100  Rows_examined: 191166  Rows_affected: 0
    # Bytes_sent: 14653
    SET timestamp=1466042710;
    SELECT * FROM open_app WHERE 1 and `client_id` = 'a08f5e32909cc9418f' and `is_valid` = '1' order by id desc limit 36700,100;
    # User@Host: shuqin[ssss] @  [2.2.2.2]  Id: 46765069
    # Schema: db_yyy  Last_errno: 0  Killed: 0
    # Query_time: 0.501094  Lock_time: 0.000042  Rows_sent: 100  Rows_examined: 192966  Rows_affected: 0
    # Bytes_sent: 14966
    SET timestamp=1466042727;

       

         要求从中解析出相应的信息, 有如下知识点:

       (1)  默认正则是单行模式, 要匹配多行,需要开启 "多行模式": MULTILINE; 对于点号,默认不匹配换行符,为了匹配换行符,也需要开启 "DOTALL模式";

       (2)  为了匹配每个多行日志,必须使用非贪婪模式,即在 .* 后面加 ? , 否则第一个匹配会匹配到末尾;

       (3)  分而治之。编写正确的正则表达式匹配指定长字符串是不容易的,采用的策略是分而治之,将整个字符串分解成多个子串,分别匹配字串。这里每个字串都是一行,匹配好一行后,可以进一步在行内更细化的匹配; 

       (4)  无处不在的空格符要使用 s* 或 s+ 来增强健壮性; 固定的普通字符串可以在正则表达式中更好地标识各个字串,更容易地匹配到。

       (5)  Python 正则有两个常用用法: re.findall , re.match ; 前者的匹配结果是一个列表, 每个列表元素是一个元组, 匹配一个多行日志;元组的每个元素用来提取对应捕获分组的字符串; re.match 的匹配结果是一个 Match 对象, 可以通过 group(n) 来获取每个捕获分组的匹配字符串。下面的程序特意两种都用到了。对于多行匹配,使用了 re.findall ; 对于行内匹配,使用了 re.match ;  初学者常问这两者那两者有什么区别, 其实动手试试就知道了。

       (6)  展示结构使用 Map. 解析出结果后,必然要展示或做成报告,使用 Map & List 结合的复合结构通常是非常适宜的选择。 比如这一例,如果要展示所有 SQL 日志详情,可以做成

               {"tablename1": [{sqlobj11}, {sqlobj12}], ...,  "tablenameN": [{sqlobjN1}, {sqlobjN2}] } ,每个 sqlobj 结构为:

        {"sql": "select xxx", "QueryTime": 0.5600, ...}

               要展示简要的报告,比如每个表的 SQL 统计, 可以做成

        {"tablename1": {"sql11": 98, "sql12": 16}, ..., "tablenameN": {"sqlN1": 75, "sqlN2": 23} } 

     

      Python 程序实现:     

    import re
    
    globalRegex = r'^s*(.*?)# (User@Host:.*?)# (Schema:.*?)# (Query_time:.*?)# Bytes_sent:(.*?)SET timestamp=(d+);s*$'
    costRegex = r'Query_time:s*(.*)s*Lock_time:s*(.*)s*Rows_sent:s*(d+)s*Rows_examined:s*(d+)s*Rows_affected:s*(d+)s*'
    schemaRegex = r'Schema:s*(.*)s*Last_errno:(.*)s*Killed:s*(.*)s*'
    
    def readSlowSqlFile(slowSqlFilename):
        f = open(slowSqlFilename)
        ftext = ''
        for line in f:
             ftext += line
        f.close()
        return ftext
    
    def findInText(regex, text):
        return re.findall(regex, text, flags=re.DOTALL+re.MULTILINE)
    
    def parseSql(sqlobj, sqlText):
        try:
            if sqlText.find('#') != -1:
                sqlobj['sql'] = sqlText.split('#')[0].strip()
                sqlobj['time'] = sqlText.split('#')[1].strip()
            else:
                sqlobj['sql'] = sqlText.strip()
                sqlobj['time'] = ''
        except:
            sqlobj['sql'] = sqlText.strip()
    
    def parseCost(sqlobj, costText):
        matched = re.match(costRegex, costText)
        sqlobj['Cost'] = costText
        if matched:
            sqlobj['QueryTime'] = matched.group(1).strip()
            sqlobj['LockTime'] = matched.group(2).strip()
            sqlobj['RowsSent'] = int(matched.group(3))
            sqlobj['RowsExamined'] = int(matched.group(4))
            sqlobj['RowsAffected'] = int(matched.group(5))
    
    def parseSchema(sqlobj, schemaText):
        matched = re.match(schemaRegex, schemaText)
        sqlobj['Schema'] = schemaText
        if matched:
            sqlobj['Schema'] = matched.group(1).strip()
            sqlobj['LastErrno'] = int(matched.group(2))
            sqlobj['Killed'] = int(matched.group(3))
    
    def parseSQLObj(matched):
        sqlobj = {}
        try:
            if matched and len(matched) > 0:
                parseSql(sqlobj, matched[0].strip())
                sqlobj['UserHost'] = matched[1].strip()
                sqlobj['ByteSent'] = int(matched[4])
                sqlobj['timestamp'] = int(matched[5])
                parseCost(sqlobj, matched[3].strip())
                parseSchema(sqlobj, matched[2].strip())
                return sqlobj
        except:
            return sqlobj
    
    
    if __name__ == '__main__':
    
        files = ['slow_sqls.txt']
    
        alltext = ''
        for f in files:
            text = readSlowSqlFile(f)
            alltext += text
        allmatched = findInText(globalRegex, alltext)
    
        tablenames = ['open_app']
    
        if not allmatched or len(allmatched) == 0:
            print 'No matched. exit.'
            exit(1)
    
        sqlobjMap = {}
        for matched in allmatched:
            sqlobj = parseSQLObj(matched)
            if len(sqlobj) == 0:
                continue
            for tablename in tablenames:
                if sqlobj['sql'].find(tablename) != -1:
                     if not sqlobjMap.get(tablename):
                         sqlobjMap[tablename] = []
                     sqlobjMap[tablename].append(sqlobj)
                     break
    
        resultMap = {}
        for (tablename, sqlobjlist) in sqlobjMap.iteritems():
            sqlstat = {}
            for sqlobj in sqlobjlist:
                if sqlobj['sql'] not in sqlstat:
                    sqlstat[sqlobj['sql']] = 0
                sqlstat[sqlobj['sql']] += 1
            resultMap[tablename] = sqlstat
    
        f_res = open('/tmp/res.txt', 'w')
        f_res.write('-------------------------------------: 
    ')
        f_res.write('Bref results: 
    ')
        for (tablename, sqlstat) in resultMap.iteritems():
            f_res.write('tablename: ' + tablename + '
    ')
            sortedsqlstat = sorted(sqlstat.iteritems(), key=lambda d:d[1], reverse = True)
            for sortedsql in sortedsqlstat:
                f_res.write('sql = %s
    counts: %d
    
    ' % (sortedsql[0], sortedsql[1]))
        f_res.write('-------------------------------------: 
    
    ')
    
        f_res.write('-------------------------------------: 
    ')
        f_res.write('Detail results: 
    ')
        for (tablename, sqlobjlist) in sqlobjMap.iteritems():
            f_res.write('tablename: ' + tablename + '
    ')
            f_res.write('sqlinfo: 
    ')
            for sqlobj in sqlobjlist:
                f_res.write('sql: ' + sqlobj['sql'] + ' QueryTime: ' + str(sqlobj.get('QueryTime')) + ' LockTime: ' + str(sqlobj.get('LockTime')) + '
    ')
                f_res.write(str(sqlobj) + '
    
    ')
        f_res.write('-------------------------------------: 
    ')
        f_res.close()

       可配置

       事实上,可以做成可配置的。只要给定行间及行内关键字集合,可以分割多行及行内字段,就可以分别提取相应的内容。

       这里有个基本函数 matchOneLine: 根据一个依序分割一行内容的关键字列表,匹配一行内容,得到每个关键字对应的内容。这个函数用于匹配行内内容。

       配置方式: 采用列表的列表。列表中的每个元素列表是可以分割和匹配单行内容的关键字列表。 每个关键字都用于分割单行的某个区域的内容。 为了提升解析性能,这里对关键字列表进行了预编译正则表达式,以便在解析字符串的时候不做重复工作。

       见如下代码:

    #!/usr/bin/python
    #_*_encoding:utf-8_*_
    
    import re
    
    # config line keywords to seperate lines.
    ksconf = [['S'], ['# User@Host:','Id:'] , ['# Schema:', 'Last_errno:', 'Killed:'], ['# Query_time:','Lock_time:', 'Rows_sent:', 'Rows_examined:', 'Rows_affected:'], ['# Bytes_sent:'], ['SET timestamp=']]
    files = ['slow_sqls.txt']
    
    #ksconf = [['id:'], ['name:'], ['able:']]
    #files = ['stu.txt']
    
    globalConf = {'ksconf': ksconf, 'files': files}
    
    def produceRegex(keywordlistInOneLine):
        ''' build the regex to match keywords in the list of keywordlistInOneLine '''
        oneLineRegex = "^s*"
        oneLineRegex += "(.*?)".join(keywordlistInOneLine)
        oneLineRegex += "(.*?)s*$"
        return oneLineRegex
    
    def readFile(filename):
        f = open(filename)
        ftext = ''
        for line in f:
            ftext += line
        f.close()
        return ftext
    
    def readAllFiles(files):
        return ''.join(map(readFile, files))
    
    def findInText(regex, text, linesConf):
        '''
           return a list of maps, each map is a match to multilines,
                  in a map, key is the line keyword
                             and value is the content corresponding to the key
        '''
        matched = regex.findall(text)
        if empty(matched):
            return []
    
        allMatched = []
        linePatternMap = buildLinePatternMap(linesConf)
        for onematch in matched:
            oneMatchedMap = buildOneMatchMap(linesConf, onematch, linePatternMap)
            allMatched.append(oneMatchedMap)
        return allMatched
    
    def buildOneMatchMap(linesConf, onematch, linePatternMap):
        sepLines = map(lambda ks:ks[0], linesConf)
        lenOflinesInOneMatch = len(sepLines)
        lineMatchedMap = {}
        for i in range(lenOflinesInOneMatch):
            lineContent = sepLines[i] + onematch[i].strip()
            linekey = getLineKey(linesConf[i])
            lineMatchedMap.update(matchOneLine(linesConf[i], lineContent, linePatternMap))
        
        return lineMatchedMap    
    
    def matchOneLine(keywordlistOneLine, lineContent, patternMap):
        '''
           match lineContent with a list of keywords , and return a map 
           in which key is the keyword and value is the content matched the key.
           eg. 
           keywordlistOneLine = ["host:", "ip:"] , lineContent = "host: qinhost ip: 1.1.1.1"
           return {"host:": "qinhost", "ip": "1.1.1.1"}
        '''
        
        ksmatchedResult = {}
        if len(keywordlistOneLine) == 0 or lineContent.strip() == "":
            return {}
        linekey = getLineKey(keywordlistOneLine)
        
        if empty(patternMap):
            linePattern = getLinePattern(keywordlistOneLine)
        else:
            linePattern = patternMap.get(linekey)
        
        lineMatched = linePattern.findall(lineContent)
        if empty(lineMatched):
            return {}
        kslen = len(keywordlistOneLine)
        if kslen == 1:
            ksmatchedResult[cleankey(keywordlistOneLine[0])] = lineMatched[0].strip()
        else:
            for i in range(kslen):                            
                ksmatchedResult[cleankey(keywordlistOneLine[i])] = lineMatched[0][i].strip()
        
        return ksmatchedResult
    
    def empty(obj):
        return obj is None or len(obj) == 0
    
    def cleankey(dirtykey):
        ''' clean unused characters in key '''
        return re.sub(r"[# :]", "", dirtykey)
    
    def printMatched(allMatched, linesConf):
        allks = []
        for kslist in linesConf:
            allks.extend(kslist)
        for matched in allMatched:
            for k in allks:
                print cleankey(k) , "=>", matched.get(cleankey(k))
            print '
    '    
    
    def buildLinePatternMap(linesConf):
        linePatternMap = {}
        for keywordlistOneLine in linesConf:
            linekey = getLineKey(keywordlistOneLine)
            linePatternMap[linekey] = getLinePattern(keywordlistOneLine)
        return linePatternMap    
    
    def getLineKey(keywordlistForOneLine):
        return "_".join(keywordlistForOneLine)
    
    def getLinePattern(keywordlistForOneLine):
        return re.compile(produceRegex(keywordlistForOneLine))
    
    def testMatchOneLine():
        assert len(matchOneLine([], "haha", {})) == 0
        assert len(matchOneLine(["host"], "", {})) == 0
        assert len(matchOneLine("", "haha", {})) == 0 
        assert len(matchOneLine(["host", "ip"], "host:qqq addr: 1.1.1.1", {})) == 0
    
        lineMatchMap1 = matchOneLine(["id:"], "id: 123456", {"id:": re.compile(produceRegex(["id:"]))})
        assert lineMatchMap1.get("id") == "123456"
    
        lineMatchMap2 = matchOneLine(["host:", "ip:"], "host: qinhost  ip: 1.1.1.1  ", {"host:_ip:": re.compile(produceRegex(["host:", "ip:"]))})
        assert lineMatchMap2.get("host") == "qinhost"
        assert lineMatchMap2.get("ip") == "1.1.1.1"
        print 'testMatchOneLine passed.'
    
    
    if __name__ == '__main__':
    
        testMatchOneLine()
    
        files = globalConf['files']
        linesConf = globalConf['ksconf']
        sepLines = map(lambda ks:ks[0], linesConf)
    
        text = readAllFiles(files)
        wholeRegex = produceRegex(sepLines)
        print 'wholeRegex: ', wholeRegex
    
        compiledPattern = re.compile(wholeRegex, flags=re.DOTALL+re.MULTILINE)
        allMatched = findInText(compiledPattern, text, linesConf)
        printMatched(allMatched, linesConf)

         如果想以下多行解析文本文件,只需要修改下 ksconf =  [['id:'], ['name:'], ['able:']]。

    id:1
    name:shu
    able:swim,study
    
    id:2
    name:qin
    able:sleep,run

       

  • 相关阅读:
    《编写高质量代码》读书笔记一
    [转] Markdown
    皓首穷经还是及时行乐!
    有用的iOS网站地址
    [股票] 入市
    https原理 就是两次http
    数据预处理
    重新建立程序员的应对方式
    ROC曲线手画
    机器学习的总结
  • 原文地址:https://www.cnblogs.com/lovesqcc/p/5661313.html
Copyright © 2020-2023  润新知