MIT自然语言处理第二讲:单词计数(第四部分)
自然语言处理:单词计数
Natural Language Processing: (Simple) Word Counting
作者:Regina Barzilay(MIT,EECS Department, November 15, 2004)
译者:我爱自然语言处理(www.52nlp.cn ,2009年1月11日)
四、 分词相关
a) Tokenization
i. 目标(Goal):将文本切分成单词序列(divide text into a sequence of words)
ii. 单词指的是一串连续的字母数字并且其两端有空格;可能包含连字符和撇号但是没有其它标点符号(Word is a string of contiguous alphanumeric characters with space on either side; may include hyphens and apostrophes but no other punctuation marks (Kucera and Francis))
iii. Tokenizatioan 容易吗(Is tokenization easy)?
b) 什么是词(What’s a word)?
i. English:
1. “Wash. vs wash”
2. “won’t”, “John’s”
3. “pro-Arab”, “the idea of a child-as-required-yuppie-possession must be motivating them”, “85-year-old grandmother”
ii. 东亚语言(East Asian languages):
1. 词之间没有空格(words are not separated by white spaces)
c) 分词(Word Segmentation)
i. 基于规则的方法(Rule-based approach): 基于词典和语法知识的形态分析(morphological analysis based on lexical and grammatical knowledge)
ii. 基于语料库的方法(Corpus-based approach): 从语料中学习(learn from corpora(Ando&Lee, 2000))
iii. 需要考虑的问题(Issues to consider): 覆盖面,歧义,准确性(coverage, ambiguity, accuracy)
d) 统计切分方法的动机(Motivation for Statistical Segmentation)
i. 未登录词问题(Unknown words problem):
——存在领域术语和专有名词(presence of domain terms and proper names)
ii. 语法约束可能不充分(Grammatical constrains may not be sufficient)
——例子(Example): 名词短语的交替切分(alternative segmentation of noun phrases)
iii. 举例一
1. Segmentation:sha-choh/ken/gyoh-mu/bu-choh
2. Translation:“president/and/business/general/manager”
iv. 举例二
1. Segmentation:sha-choh/ken-gyoh/mu/bu-choh
2. Translation:“president/subsidiary business/Tsutomi[a name]/general manag
e) 一个切分算法:
i. 核心思想(Key idea): 对于每一个候选边界,比较这个边界邻接的n元序列的频率和跨过这个边界的n元序列的频率(for each candidate boundary, compare the frequency of the n-grams adjacent to the proposed boundary with the frequency of the n-grams that straddle it)。
ii. 注:由于公式编辑问题,具体算法请自行参考lec02.pdf,此处略。
f) 实验框架(Experimental Framework)
i. 语料库(Corpus): 150兆1993年Nikkei新闻语料(150 megabytes of 1993 Nikkei newswire)
ii. 人工切分(Manual annotations): 用于开发集的50条序列(调节参数)和用于测试集的50条序列(50 sequences for development set (parameter tuning) and 50 sequences for test set)
iii. 基线算法(Baseline algorithms): Chasen和Juma的形态分析器(Chasen and Juman morphological analyzers (115,000 and 231,000 words))
g) 评测方法(Evaluation Measures)
i. tp — true positive (真正, TP)被模型预测为正的正样本;
ii. fp — false positive (假正, FP)被模型预测为正的负样本;
iii. tn — true negative (真负 , TN)被模型预测为负的负样本 ;
iv. fn — false negative (假负 , FN)被模型预测为负的正样本;
v. 准确率(Precision) — the measure of the proportion of selected items that the system got right:
P = tp / ( tp + fp)
vi. 召回率(Recall) — the measure of the target items that the system selected:
R = tp / ( tp + fn )
vii. F值(F-measure):
F = 2 ∗ PR / (R + P)
viii. Word precision (P) is the percentage of proposed brackets that match word-level brackets in the annotation;
ix. Word recall (R) is the percentage of word-level brackets that are proposed by the algorithm.
五、 结论(Conclusions)
a) 语料库被广泛用于文本处理中(Corpora widely used in text processing)
b) 使用的语料库是熟语料或生语料(Corpora used either annotated or raw)
c) 齐夫定律及其与自然语言的联系(Zipf’s law and its connection to natural language)
d) 数据稀疏问题是语料库处理方法中的一个主要问题(Sparsity is a major problem for corpus processing methods)
第二讲结束!