分词是将字符串切割成可识破的构成一块语言数据的语言单元。
分词的简单方法
raw = """'When I'M a Duchess,'she said to herself, (not in a very hopeful tone ... though), 'I won'thave any pepper in mykitchenATALL.Soupdoesvery ... wellwithout--Maybeit's always pepper that makespeoplehot-tempered,'...""" #最简单的方法是在空格处分割文本 re.split(r’\s+’,raw)
如果想更好的来使用正则表达式来起到分词的效果,还需要对正则表达式有更深的认识
符号 功能
\b 词边界(零宽度)
\d 任一十进制数字(相当于[0-9])
\D 任何非数字字符(等价于[^ 0-9])
\s 任何空白字符(相当于[ \t\n\r\f\v])
\S 任何非空白字符(相当于[^ \t\n\r\f\v])
\w 任何字母数字字符(相当于[a-zA-Z0-9_])
\W 任何非字母数字字符(相当于[^a-zA-Z0-9_])
\t 制表符
\n 换行符
NLTK的正则表达式分词器
>>>text = 'That U.S.A.poster-print costs$12.40...' >>>pattern =r'''(?x) #set flag to allow verbose regexps ... ([A-Z]\.)+ #abbreviations, e.g. U.S.A. ... | \w+(-\w+)* #words with optional internal hyphens ... | \$?\d+(\.\d+)?%? #currency and percentages,e.g. $12.40,82% 116 ... | \.\.\. #ellipsis ... | [][.,;"'?():-_`] #these are separate tokens ... ''' >>>nltk.regexp_tokenize(text, pattern) ['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']