• python基础之re模块


    就其本质而言,正则表达式(或 RE)是一种小型的、高度专业化的编程语言,(在Python中)它内嵌在Python中,并通过 re 模块实现。正则表达式模式被编译成一系列的字节码,然后由用 C 编写的匹配引擎执行。

    正则就是给字符串操作得。
    爬虫里会大量用到字符串。要处理一定是对字符串处理。

    正则表达式是模糊匹配,这就是正则表达式得真正关键所在。

    匹配是一个一个对应的关系,匹配上就放进自己的列表中。

    字符匹配(普通字符,元字符):

    1 普通字符:大多数字符和字母都会和自身匹配
                  >>> re.findall('alvin','yuanaleSxalexwupeiqi')
                          ['alvin'] 

    2 元字符:. ^ $ * + ? { } [ ] | ( )  #共11个元字符

    def findall(pattern, string, flags=0):
        """Return a list of all non-overlapping matches in the string.
    
        If one or more capturing groups are present in the pattern, return
        a list of groups; this will be a list of tuples if the pattern
        has more than one group.
    
        Empty matches are included in the result."""
        return _compile(pattern, flags).findall(string)
    findall剑谱

    re.findall(pattern,string) #找到所有的匹配元素,返回列表

    (1) . : 匹配除 以外的任意符号

    print(re.findall("a.+d","abcd"))

    (2)^ :从字符串开始位置匹配

    print(re.findall("^luchuan","luchuan123asd"))

    (3)* + ? {} :重复

    print(re.findall("[0-9]{4}","asd1231asd123"))
    print(re.findall("[0-9]{1,}","asd1231asd123"))

    贪婪匹配: #用得比较多

    print(re.findall("d+","af5324jh523hgj34gkhg53453"))

    非贪婪匹配:

    print(re.findall("d+?","af5324jh523hgj34gkhg53453"))
    print(re.findall("d","af5324jh523hgj34gkhg53453"))

    (4)字符集[]:起一个或者的意思

    print(re.findall("a[bc]d","hasdabdjhacd"))

    注意: *,+ .等元字符都是普通符号,- ^ :

    print(re.findall("[0-9]+","dash234sdfj223"))
    print(re.findall("d+","dash234sdfj223"))
    
    print(re.findall("[a-z]+","dash234sdfj223"))
    
    print(re.findall("[^2]","d2a2"))
    print(re.findall("[^d]","d2a2"))
    print(re.findall("[^d]+","d2a24sdf2ff23df21sfsf32d2d21d"))

    (5)():分组

    print(re.findall("(ad)+","addd"))
    print(re.findall("(ad)+luchuan","adddluchuangfsdui"))
    print(re.findall("(ad)+luchuan","adadluchuangfsdui")) #adadyuan都匹配到了,但是只把ad放进列表里了
    print(re.findall("(?:ad)+luchuan","adadluchuangfsdui")) #取消组内优先级,将匹配所有匹配到得内容
    print(re.findall("(d)+luchuan","ad12343luchuangfs234dui"))
    print(re.findall("(?:d)+luchuan","ad12343luchuangfs234dui"))

    命名分组:

    ret=re.findall(r"w+.aticles.d{2}","lu.aticles.1234")
    print(ret)
    ret=re.findall(r"(w+).aticles.(d{2})","lu.aticles.1234")
    print(ret)
    ret=re.search(r"(?P<author>w+).aticles.(?P<id>d{2})","lu.aticles.1234") #命名分组,可以通过别名来取值
    print(ret.group("id"))
    print(ret.group("author"))

    (6)| : 或

    print(re.findall("www.(oldboy|baidu).com","www.oldboy.com")) #不命名分组
    print(re.findall("www.(?:oldboy|baidu).com","www.oldboy.com"))

    (7) : 转义

    1 后面加一个元字符使其变成普通符号 . *
    2 将一些普通符号变成特殊符号 比如 d w

    print(re.findall("-?d+.?d**d+.?d*","-2*6+7*45+1.4*3-8/4"))
    print(re.findall("w","$da@s4 234"))
    print(re.findall("asb","a badf"))
    print(re.findall(r"I","hello I am LIA")) #ASCII码中有字符,所以需要原生字符
    print(re.findall("\bI","hello I am LIA"))
    print(re.findall(r"I","hello$I am LIA"))
    print(re.findall("c\\l","abcl")) #python解释器默认会把\解释成,re模块又会把\解释成\,所以需要四个
    print(re.findall(r"c\l","abcl")) #告诉python解释器按照正则去匹配。
    print(re.findall("d+.?d**d+.?d*","3.5*22+3*2+4.5*33-8+2"))

    re的方法:

    s=re.finditer("d+","ad324das32")
    print(s)
    
    print(next(s).group()) #next后只是个对象,还需要进行操作
    print(next(s).group())

    search:只匹配第一个结果

    ret=re.search("d","jksf34asd3") #使用search做计算器
    print(ret)
    print(ret.group()) #通过group()取值,None得话是匹配未成功

    match:只在字符串开始的位置匹配

    ret=re.match("d+","432jksf34asd3")
    print(ret)
    print(ret.group())
    

    split:拆分

    s2=re.split("d+","fh233jfd324sfsa213190sdf",2)
    print(s2)
    
    ret3=re.split("l","hello luchuan")
    print(ret3)

    re.sub:替换

    ret4=re.sub("d+","A","hello 234jkhh23")
    ret4=re.sub("d+","A","hello 234jkhh23",1)
    print(ret4)

    re.subn:

    ret4=re.subn("d+","A","hello 234jkhh23")
    print(ret4)

    compile :编译方法,一次得话,没什么意义,匹配多个字符串就有意义了

    c=re.compile("d+")
    ret5=c.findall("hello32world53")
    print(ret5)
    

    链接:http://www.cnblogs.com/yuanchenqi/articles/5732581.html

  • 相关阅读:
    《设计模式》-原则二:里氏代换原则(LSP)
    设计模式从0开始
    net reactor加密源码保软件安全-net reactor使用教程
    python中import和from...import区别
    Python之import
    c#执行bat批处理文件,并通过线程将结果显示在控件中
    C#中双问号、双冒号等几个特殊关键字
    Eclipse debug高级技巧(转)
    Android从零开始--安装
    以另一个用户来运行程序
  • 原文地址:https://www.cnblogs.com/luchuangao/p/6776384.html
Copyright © 2020-2023  润新知