参考:
http://www.cnblogs.com/kaituorensheng/p/3595879.html
https://github.com/fxsjy/jieba
判断是否包含中文
def contain_zh(word): zh_pattern = re.compile(u'[u4e00-u9fa5]+') word = word.decode() match = zh_pattern.search(word) return match
提取中文
def remain_zh(word): zh_pattern = re.compile(u'[^u4e00-u9fa5]+') word = word.decode() word = re.sub(zh_pattern,"", word) return word
中文分词
使用模块jieba。安装pip install jieba
import jieba seg_list = jieba.cut("我来到北京清华大学", cut_all = True) print "Full Mode:", ' '.join(seg_list) seg_list = jieba.cut("我来到北京清华大学") print "Default Mode:", ' '.join(seg_list)
Full Mode: 我 来到 北京 清华 清华大学 华大 大学
Default Mode: 我 来到 北京 清华大学