Python 正则表达式 - 润新知

Python 正则表达式
1.正则表达式基本概念

背景

我们要匹配以xxx开头的字符串、xxx结尾的字符串等时，每一次匹配都要单独写一个函数或语句来完成，正则表达式就是将匹配的方法抽象成一个规则，然后使用这个规则来进行文本或数据的匹配。

概念

是使用单个字符串来描述匹配一系列符合某个语法规则的字符串

是对字符串操作的一种逻辑公式

应用场景

处理文本或数据

处理过程

依次拿出表达式和目标数据进行字符比较，如果每一个字符都能匹配，则匹配成功，否则，失败。

2.Python正则表达式之re模块

字符串自带的查找方法

str1.find(str2)

str1.startswith(str2)

str2.endswith(str2)

详见：python基础--02 Python内置基本类型中的1.4节

re模块使用

导入re模块

import re

生成pattern实例

pa=re.pattern(正则表达式, flag)

参数

         正则表达式

                  最好是raw字符串；

                  如果正则表达式首位带括号，则最终的match实例.groups()方法可以以元组的形式展示匹配到的字符串，如r'(study)'，但是元组中始终只有一个元素。

         flag

                  re.A | re.ASCII

                           对w、W、、B、d、D、s和S产生影响，编译后的模式对象在进行匹配的时候，只会匹配ASCII字符，而不是Unicode字符。

                  re.I | re.IGNORECASE

                           在匹配的时候忽略大小写

                  re.M | re.MULTILINE

默认，元字符^会匹配字符串的开始处，元字符$会匹配字符串的结束位置和字符串后面紧跟的换行符之前（如果存在这个换行符）。

如果指定了这个选项，则^将会匹配字符串的开头和每一行的开始处，紧跟在每一个换行符后面的位置。

类似的，$会匹配字符串的最后和每一行的最后，在接下来的换行符的前面的位置。
```
>>> p = re.compile(r'(^hello$)s(^hello$)s(^hello$)s')
>>> m = p.search('hello
hello
hello
')
>>> print(m)

None

>>> p = re.compile(r'(^hello$)s(^hello$)s(^hello$)s', re.M)
>>> m = p.search('
hello
hello
hello
')
>>> m.groups()

('hello', 'hello', 'hello')
```
re.S | re.DOTALL

使得.元字符可以匹配任何字符，包括换行符。

re.X | re.VERBOSE

这个选项允许编写可读性更强的正则表达式代码，并进行自由的格式化操作。

当这个选项被指定以后，在正则表达式之间的空格符会被忽略，除非这个空格符是在一个字符类中[ ]，或者在空格前使用一个反斜杠。

这个选项允许对正则表达式进行缩进，使得正则表达式的代码更加格式化，更加清晰。并且可以在正则表达式的代码中使用注释，这些注释会被正则表达式引擎在处理的时候忽略。

注释以'#'字符开头。所以如果需要在正则表达式中使用'#'符号，需要在前面添加反斜杠'#'或者将它放在[]中,[#]。
```
charref = re.compile(r"""
&[#]                # Start of a numeric entity reference
(
     0[0-7]+         # Octal form
   | [0-9]+          # Decimal form
   | x[0-9a-fA-F]+   # Hexadecimal form
)
;                   # Trailing semicolon
""", re.VERBOSE)

如果没有指定re.**VERBOSE**选项，则相当于：

    charref = re.compile("&#(0[0-7]+"
             "|[0-9]+"
             "|x[0-9a-fA-F]+);")
```
使用pattern实例来进行匹配

match()    从字符串指定位置开始匹配，匹配到就停止，返回match对象

    match(string[, pos[, endpos]]) --> match object or None.

    Matches zero or more characters at the beginning of the string

search()   从字符串指定位置之后的任意位置开始匹配，匹配到就停止，返回Match对象

    search(string[, pos[, endpos]]) --> match object or None.

    Scan through string looking for a match, and return a corresponding match object instance. Return None if no position in the string matches.

findall()    从字符串指定位置之后的任意位置开始匹配，匹配到了继续匹配，返回字符串中所有匹配的字符串组成的列表。

注意：如果正则表达式中有()分组，则findall返回的是被()括起来的分组字符串所组成的列表。

    findall(string[, pos[, endpos]]) --> list.

    Return a list of all non-overlapping matches of pattern in string.

finditer() 从字符串指定位置之后的任意位置开始匹配，匹配到了继续匹配，返回一个包含了所有的Match对象的迭代器

    finditer(string[, pos[, endpos]]) --> iterator.

    Return an iterator over all non-overlapping matches for the RE pattern in string. For each match, the iterator returns a match object.

sub()         将字符串通过正则表达式匹配到的字符使用repl进行制定次数的替换（默认全部替换），repl可以是字符串，也可以使方法名。

当为方法名时，repl方法接收匹配到的match对象，且该sub()方法返回repl方法的返回值

    sub(repl, string[, count = 0]) --> newstring

    Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl.

split()        将字符串通过正则表达式匹配到的字符进行指定次数的分割（默认全部），返回分割后的列表

    split(string[, maxsplit = 0]) --> list.

    Split string by the occurrences of pattern.
```
# 导入re模块

>>> import re

# 生成pattern对象

>>> pa=re.compile(r'(ddd)')

# 使用pattern对象通过match方法进行匹配，得到match对象

>>> ma=pa.match('dddsssdddsssddd
dddsssdddsssddd',5)

>>> ma.groups()

Traceback (most recent call last):

  File "<stdin>", line 1, in <module>

AttributeError: 'NoneType' object has no attribute 'groups'
```
```
# 使用pattern对象通过search方法进行匹配，得到match对象
```
```
>>> ma=pa.search('dddsssdddsssddd
dddsssdddsssddd',5)

>>> ma.groups()

('ddd',)
```
```
# 使用pattern对象通过findall方法进行匹配，得到匹配到的字符串所组成的列表
```
```
>>> ma=pa.findall('dddsssdddsssddd
dddsssdddsssddd',5) 

>>> ma ['ddd', 'ddd', 'ddd', 'ddd', 'ddd'] 
```
```
# 使用pattern对象通过finditer方法进行匹配，得到匹配到的Match对象所组成的迭代器
```
```
>>> for i in pa.finditer('dddsssdddsssddd
dddsssdddsssddd',5): 

... 　　print i 

... 

<_sre.SRE_Match object at 0x0000000002607378> 

<_sre.SRE_Match object at 0x0000000002544F30> 

<_sre.SRE_Match object at 0x0000000002607378> 

<_sre.SRE_Match object at 0x0000000002544F30> 

<_sre.SRE_Match object at 0x0000000002607378> 
```
```
# 使用pattern对象通过sub方法进行替换，得到替换后的新字符串
```
```
>>> ma=pa.sub('aaa','dddsssdddsssddddddsssdddsssddd') 

>>> print type(ma),ma 

<type 'str'> aaasssaaasssaaaaaasssaaasssaaa 

>>> ma=pa.sub('aaa','dddsssdddsssddddddsssdddsssddd',2) 

>>> print type(ma),ma 

<type 'str'> aaasssaaasssddddddsssdddsssddd 

>>> def upper_str(match): 

... 　　return match.group().upper() 

... 

>>> ma=pa.sub(upper_str,'dddsssdddsssddddddsssdddsssddd',2) 

>>> print type(ma),ma 

<type 'str'> DDDsssDDDsssddddddsssdddsssddd 
```
```
# 使用pattern对象通过sub方法进行分割，得到分割后的字符串组成的列表
```
```
>>> ma=pa.split('dddsssdddsssddddddsssdddsssddd',2) 

>>> print type(ma),ma 

<type 'list'> ['', 'sss', 'sssddddddsssdddsssddd'] 

>>> ma=pa.split('dddsssdddsssddddddsssdddsssddd') 

>>> print type(ma),ma 

<type 'list'> ['', 'sss', 'sss', '', 'sss', 'sss', '']
```
匹配对象属性

group() 返回正则表达式匹配到的字符串

groups() 返回正则表达式匹配到的字符串构成的元组。注意：如果正则表达式中有()分组，则groups()中是被()括起来的分组字符串所组成的列表。
```
>>> ma = re.match(r'[w]{6,11}@(163|qq|huawei)(163|qq|huawei).com12','yc123456@163qq.com163qq')

>>> ma.group()

'yc123456@163qq.com163qq'

>>> ma.groups()

('163', 'qq')
```
start()       返回匹配的起始位置

end()        返回匹配的结束位置

span()      返回一个包含匹配的起始位置和结束位置的元组(start, end)

string       进行匹配的源字符串

ma.re       进行匹配的正则表达式

3.正则表达式基本语法

匹配单个字符

.    匹配任意字符，除了

[...]    匹配字符集。如[a-z]

d|D    匹配数字|非数字

s|S    匹配空白|非空白

w|W    匹配word字符|非word字符。[a-zA-Z0-9]

[]    匹配字符串中的[]

匹配多个字符

*    匹配前一个字符0次或无限次

+    匹配前一个字符1次或无限次。如匹配有效标识符r'[_a-zA-Z]+w'

?    匹配前一个字符0次或1次。如匹配两位数r'[1-9]?[0-9]'。注：09的匹配结果是0

{m}|{m,n}    匹配前一个字符m次或者m到n次。如匹配qq邮箱r'w{6,10}@qq.com'

*? |+? |??    *、+、?的匹配模式变为非贪婪模式。即返回的匹配结果会是最少的。
```
>>> re.findall(r'[0-9]k*','1kkkk')

['1kkkk']

>>> re.findall(r'[0-9]k*?','1kkkk')

['1']

>>> re.findall(r'[0-9]k?','1kkkk')

['1k']

>>> re.findall(r'[0-9]k??','1kkkk')

['1']

>>> re.findall(r'[0-9]k+','1kkkk')

['1kkkk']

>>> re.findall(r'[0-9]k+?','1kkkk')

['1k']
```
边界匹配

^    匹配字符串开头

$    匹配字符串结尾

A|    指定的字符串必须为开头|结尾
```
>>> re.findall(r'A[0-9].*k','1kkkk')

['1kkkk']

>>> re.findall(r'A[0-9].*k','1kkkz')

[]
```
分组匹配

| 匹配左右任意一个表达式。如匹配0~100：r'^[0-9]$|^[1-9][0-9]$|^100$'
```
>>> re.findall(r'^[0-9]$|^[1-9][0-9]$|^100$','100')

['100']

>>> re.findall(r'^[0-9]$|^[1-9][0-9]$|^100$','9')

['9']

>>> re.findall(r'^[0-9]$|^[1-9][0-9]$|^100$','99')

['99']

>>> re.findall(r'^[0-9]$|^[1-9][0-9]$|^100$','09')
```
[] 单字符集。

(ab) 括号中的表达式作为一个分组。
```
从左到右按顺序为1，2，3。常用于不同的个别单词。如匹配163邮箱和qq邮箱：r'w{6,11}@(163|qq|huawei).com'

>>> re.match(r'[w]{6,11}@(163|qq|huawei).com','yc123456@163.com').group()

'yc123456@163.com'

>>> re.match(r'[w]{6,11}@(163|qq|huawei).com','yc123456@qq.com').group()

'yc123456@qq.com'

>>> re.match(r'[w]{6,11}@(163|qq|huawei).com','yc123456@huawei.com').group()

'yc123456@huawei.com'
```
<number> 引用编号为num的分组匹配到的字符串。类似于管道命令中的xargs -i。

注：1对应第一个()所匹配到的字符串。如果只有1个分组()，但是使用2，则会报错。如用来匹配XML文件
```
>>> re.match(r'<(w+>).*</1','<book>test</book>').group()

'<book>test</book>'

>>> re.match(r'<(w+>).*</1','<book>test</ebook>').group()

Traceback (most recent call last):

  File "<stdin>", line 1, in <module>

AttributeError: 'NoneType' object has no attribute 'group'

>>> re.match(r'<(w+>).*</1','<book>test</book1>').group()

Traceback (most recent call last):

  File "<stdin>", line 1, in <module>

AttributeError: 'NoneType' object has no attribute 'group'
```
(?P<name>) 给分组起别名

(?P=name) 引用起过别名的分组
```
>>> re.match(r'[w]{6,11}@(?P<type1>163|qq|huawei)(?P<type2>163|qq|huawei).com(?P=type1)(?P=type2)','yc123456@163qq.com163qq').group()

'yc123456@163qq.com163qq'

>>> re.match(r'[w]{6,11}@(?P<type1>163|qq|huawei)(?P<type2>163|qq|huawei).com(?P=type1)(?P=type2)','yc123456@163qq.com163163').group()

Traceback (most recent call last):

  File "<stdin>", line 1, in <module>

AttributeError: 'NoneType' object has no attribute 'group'
```
相关阅读:
10.3 noip模拟试题
 9.30 noip模拟试题
 9.29 奶牛练习题
 9.29noip模拟试题
 9.28noip模拟试题
 9.27 noip模拟试题
 二维数据结构学习
 9.26 noip模拟试题
 ContentProvider ContentResolver ContentObserver 内容：提供、访问、监听
 Cursor 游标
原文地址：https://www.cnblogs.com/yc913344706/p/7821676.html

Copyright © 2020-2023 润新知