python-re之中文匹配

 1 #coding=utf-8
 2 import re
 3 import chardet#检测网页编码形式的模块
 4   
 5 p = re.compile(r'd+')  
 6 print p.findall('one1two2three3four4')  
 7 
 8 a="rewfd231321ewq21weqeqw"
 9 p=re.compile(r"(d+)D+(d+)",re.S)
10 b=p.findall(a)
11 print b
12 
13 a=u"我爱@糗百，你呢"
14 print a
15 b=re.findall (u"(.+?)@糗百(.+)",a,re.S)
16 print b
17 for i in b:
18     for j in i:
19         print j

结果：

['1', '2', '3', '4']
[('231321', '21')] #findall的结果是[(),()]这种形式的，如果元组只有一个元素，则是["",""]这样子的
我爱@糗百，你呢
[(u'u6211u7231', u'uff0cu4f60u5462')]
我爱
，你呢

——————————————————————————————————————————

如果不知道汉字文本的编码，比如说是一段网上爬来的文字（通常情况下就是不知道的）

 1 import re
 2 import chardet#检测网页编码形式的模块
 3   
 4 a="我爱@糗百，你呢"
 5 if isinstance(a, unicode) :
 6     pass
 7 else:
 8     codesty=chardet.detect(a)
 9     a=a.decode(codesty['encoding'])
10 print a
11 b=re.findall (u"(.+?)@糗百(.+)",a,re.S)
12 print b
13 for i in b:
14     for j in i:
15         print j

则利用chardet这个模块得到它的编码，并将其转化为unicode

结果：

我爱@糗百，你呢
[(u'u6211u7231', u'uff0cu4f60u5462')]
我爱
，你呢

当然，如果想双击,py在windows下演示，得到的字符串应该再加j.encode("GBK")

注意：处理中文前要将其转化为unicode，不要ascii码直接正则匹配，ascII码如何转Unicode?遇到再说吧~

相关阅读:
希腊字母写法
The ASP.NET MVC request processing line
lambda aggregation
UVA 10763 Foreign Exchange
UVA 10624 Super Number
UVA 10041 Vito's Family
UVA 10340 All in All
UVA 10026 Shoemaker's Problem
HDU 3683 Gomoku
UVA 11210 Chinese Mahjong

原文地址：https://www.cnblogs.com/fkissx/p/3935875.html