import re
1.findall
import re ret=re.findall("d+","123456sdfasdgfd789") print(ret) 结果:['123456', '789']
参数 返回值类型:列表 返回值个数:2 返回值内容:所有匹配上的项
import re ret=re.findall("s","123456sdfasdgfd789") print(ret) 结果:[]
2.search
import re ret=re.search("d+","123456sdfasdgfd789") print(ret) 结果:<re.Match object; span=(0, 6), match='123456'>
返回值类型: 正则匹配结果的对象 返回值个数:1 如果匹配上了就返回对象
import re ret=re.search("d+","123456sdfasdgfd789") print(ret.group()) 结果:123456
返回的对象通过group来获取匹配到的第一个结果
import re ret=re.search("s+","123456sdfasdgfd789") print(ret) 结果:None
返回值类型: None 如果没有匹配上就是None
3.match
import re ret=re.match("d+","123456sdfasdgfd789") print(ret) 结果:<re.Match object; span=(0, 6), match='123456'>
import re ret=re.match("d+","123456sdfasdgfd789") print(ret.group()) 结果:123456
对比看,search和match的功能一样,但是也有不一样的时候
import re ret=re.search("d+","#$%^&*123456sdfasdgfd789") print(ret) ret=re.match("d+","#$%^&*123456sdfasdgfd789") print(ret) 结果: <re.Match object; span=(6, 12), match='123456'> None
4.sub替换
replace print("replace789,2afseeeefgeesfs".replace("e","HH")) 结果:rHHplacHH789,2afsHHHHHHHHfgHHHHsfs print("replace789,2afseeeefgeesfs".replace("e","HH",3)) 结果:rHHplacHH789,2afsHHeeefgeesfs sub替换 import re ret=re.sub("d+","HH","replace789,2afseeeefgeesfs") print(ret) 结果:replaceHH,HHafseeeefgeesfs import re ret=re.sub("d+","HH","replace789,2afseeeefgeesfs",1) print(ret) 结果:replaceHH,2afseeeefgeesfs
5.subn
import re ret=re.subn("d+","HH","replace789,2afseeeefgeesfs") print(ret) 结果:('replaceHH,HHafseeeefgeesfs', 2)
6.split
import re ret=re.split("d+",'小明18小红25小李32') print(ret) 结果:['小明', '小红', '小李', '']
7.comple #时间效率
import re ret=re.compile("-?d+(.d*)") res=ret.search("54.2fvdsfas5633656-5213sf5") print(res) print(res.group()) 结果: <re.Match object; span=(0, 4), match='54.2'> 54.2
节省时间 : 只有在多次使用某一个相同的正则表达式的时候,这个compile才会帮助我们提高程序的效率
8.finditer #空间效率
import re ret=re.finditer("d","asfsjgs552dfs df663adaf8sf") print(ret) for i in ret: print(i) 结果: <re.Match object; span=(7, 8), match='5'> <re.Match object; span=(8, 9), match='5'> <re.Match object; span=(9, 10), match='2'> <re.Match object; span=(16, 17), match='6'> <re.Match object; span=(17, 18), match='6'> <re.Match object; span=(18, 19), match='3'> <re.Match object; span=(23, 24), match='8'> print(i.group()) 结果: 5 5 2 6 6 3 8
python中的正则表达式
1 findall 会优先显示分组中的内容,要想取消分组优先,(?:正则表达式)
2 split 遇到分组 会保留分组内被切掉的内容
3 search 如果search中有分组的话,通过group(n)就能够拿到group中的匹配的内容
分组----findall
import re ret=re.findall("www.baidu.com|www.xinlang.com","www.xinlang.com") print(ret) 结果:['www.xinlang.com'] import re ret=re.findall("www.(baidu|xinlang).com","www.xinlang.com") print(ret) 结果:['xinlang'] import re ret=re.findall("www.(?:baidu|xinlang).com","www.xinlang.com") print(ret) 结果:['www.xinlang.com'] import re ret = re.findall('-0.d+|-[1-9]d*(.d+)?','-100aass-0.23ddxx-200') print(ret) 结果:['', '', ''] import re ret = re.findall('-0.d+|-[1-9]d*(?:.d+)?','-100aass-0.23ddxx-200') print(ret) 结果:['-100', '-0.23', '-200']
分组-----split
import re ret=re.split("d+","小明12小红13小李14") print(ret) 结果:['小明', '小红', '小李', ''] import re ret=re.split("(d+)","小明12小红13小李14") print(ret) 结果:['小明', '12', '小红', '13', '小李', '14', '']
分组-----search
import re ret=re.search("d+(.d+)(.d+)(.d+)?","1.2.3.4.5.6.7.899998885") print(ret.group()) print(ret.group(0)) print(ret.group(1)) print(ret.group(2)) print(ret.group(3)) 结果: 1.2.3.4 1.2.3.4 .2 .3 .4
分组练习
1匹配所有的整数
import re
ret=re.findall(r"d+","1-2*(60+(-40.35/5)-(-4*3))")
print(ret)
结果:['1', '2', '60', '40', '35', '5', '4', '3']
由结果知道40.35分隔开了
import re
ret=re.findall(r"d+(?:.d+)|(d+)","1-2*(60+(-40.35/5)-(-4*3))")
ret.remove("")
print(ret)
结果:['1', '2', '60', '5', '4', '3']
2从类似 <a>wahaha<a>匹配出wahaha 匹配出 a
import re
ret=re.findall(">(w+)<",r"<a>wahaha<a>")
print(ret)
结果:['wahaha']
import re
ret=re.findall("<(w+)>",r'<a>wahaha<a>')
print(ret)
结果:['a']
import re
ret=re.search(r"<(w+)>(w+)</(w+)>",r"<a>wahaha</b>")
print(ret.group())
print(ret.group(1))
print(ret.group(2))
print(ret.group(3))
结果:
<a>wahaha</b>
a
wahaha
b
正则表达式进阶
1 分组命名
(?P<name>正则表达式) 表示给分组起名字
(?P=name)表示使用这个分组,这里匹配到的内容应该和分组中的内容完全相同
2 通过索引使用分组
1 表示使用第一组,匹配到的内容必须和第一个组中的内容完全相同
import re ret=re.search("<(?P<name>w+)>w+</(?P=name)>","<a>wahaha</a>") print(ret.group()) print(ret.group("name")) 结果: <a>wahaha</a> a import re ret = re.search(r'<(w+)>w+</1>',r"<a>wahaha</a>") print(ret.group(1)) print(ret.group()) 结果: a <a>wahaha</a> import re ret=re.search(r"<(?P<asd>w+)>(?P<zxc>w+)</w+>",r"<a>wahaha</a>") print(ret.group("asd")) print(ret.group("zxc")) 结果: a wahaha
爬虫
import re
from urllib.request import urlopen
# 内置的包 来获取网页的源代码 字符串
# res = urlopen('http://www.cnblogs.com/Eva-J/articles/7228075.html')
# print(res.read().decode('utf-8'))
def getPage(url):
response = urlopen(url)
return response.read().decode('utf-8')
def parsePage(s): # s 网页源码
ret = com.finditer(s)
for i in ret:
ret = {
"id": i.group("id"),
"title": i.group("title"),
"rating_num": i.group("rating_num"),
"comment_num": i.group("comment_num")
}
yield ret
def main(num):
url = 'https://movie.douban.com/top250?start=%s&filter=' % num # 0
response_html = getPage(url) # response_html是这个网页的源码 str
ret = parsePage(response_html) # 生成器
print(ret)
f = open("move_info7", "a", encoding="utf8")
for obj in ret:
print(obj)
data = str(obj)
f.write(data + " ")
f.close()
com = re.compile(
'<div class="item">.*?<div class="pic">.*?<em .*?>(?P<id>d+).*?<span class="title">(?P<title>.*?)</span>'
'.*?<span class="rating_num" .*?>(?P<rating_num>.*?)</span>.*?<span>(?P<comment_num>.*?)评价</span>', re.S)
count = 0
for i in range(10):
main(count) # count = 0
count += 25
# flags有很多可选值: # # re.I(IGNORECASE)忽略大小写,括号内是完整的写法 # re.M(MULTILINE)多行模式,改变^和$的行为 # re.S(DOTALL)点可以匹配任意字符,包括换行符 # re.L(LOCALE)做本地化识别的匹配,表示特殊字符集 w, W, , B, s, S 依赖于当前环境,不推荐使用 # re.U(UNICODE) 使用w W s S d D使用取决于unicode定义的字符属性。在python3中默认使用该flag # re.X(VERBOSE)冗长模式,该模式下pattern字符串可以是多行的,忽略空白字符,并可以添加注释