11.4使用Toolbox数据
语言结构中使用XML
(2) <entry>
<headword>whale</headword>
<pos>noun</pos>
<gloss>anyofthe larger cetaceanmammalshaving a streamlined
bodyand breathing through a blowhole onthe head</gloss>
</entry>
XML的作用
(关于XML更多的基础知识请自己查询相关资料)
ElementTree接口
>>>from nltk.etree.ElementTreeimport ElementTree >>>merchant= ElementTree().parse(merchant_file) >>>merchant <Element PLAYat 22fa800> >>>merchant[0] <ElementTITLEat 22fa828> >>>merchant[0].text 'The MerchantofVenice' >>>merchant.getchildren() [<Element TITLEat 22fa828>, <Element PERSONAE at 22fa7b0>, <Element SCNDE SCRat 2300170>, <ElementPLAYSUBTat 2300198>, <ElementACTat 23001e8>, <ElementACTat 2 34ec88>, <ElementACTat 23c87d8>, <ElementACTat 2439198>, <ElementACTat 24923c8 >]
我们可以使用更多的方法来操作XML:
>>>for i, act in enumerate(merchant.findall('ACT')): ... for j, scene in enumerate(act.findall('SCENE')): ... for k,speechin enumerate(scene.findall('SPEECH')): ... for line in speech.findall('LINE'): ... if 'music' in str(line.text): ... print "Act %dScene %dSpeech %d:%s"%(i+1, j+1, k+1, line.text) Act3Scene2Speech9: Let musicsoundwhilehedoth makehis choice;
Act3Scene2Speech9: Fadingin music:that the comparison Act3Scene2Speech9:Andwhatis musicthen? Thenmusicis Act5Scene1Speech23:Andbring yourmusicforth into the air. Act5Scene1Speech23: Herewillwesit and let the sounds ofmusic Act5Scene1Speech23:Anddrawher homewithmusic. Act5Scene1Speech24: I am never merrywhenI hear sweet music. Act5Scene1Speech25: Orany air ofmusictouch their ears, Act5Scene1Speech25: Bythe sweet powerof music:therefore the poet Act5Scene1Speech25: Butmusicfor the time doth changehis nature. Act5Scene1Speech25: Themanthat hathnomusicin himself, Act5Scene1Speech25: Let nosuchmanbe trusted. Markthe music. Act5Scene1Speech29: It is yourmusic,madam,of the house. Act5Scene1Speech32: Nobetter a musicianthan the wren.
我们也可以查查演员的顺序。我们可以使用频率分布看看谁最能说:
>>>speaker_seq = [s.text for s in merchant.findall('ACT/SCENE/SPEECH/SPEAKER ')] >>>speaker_freq = nltk.FreqDist(speaker_seq) >>>top5 =speaker_freq.keys()[:5] >>>top5 ['PORTIA', 'SHYLOCK', 'BASSANIO', 'GRATIANO', 'ANTONIO']
我们也可以查看对话中谁跟着谁的模式。
>>>mapping= nltk.defaultdict(lambda: 'OTH') >>>for s in top5: ... mapping[s]= s[:4] ... >>>speaker_seq2 = [mapping[s] for s in speaker_seq] >>>cfd =nltk.ConditionalFreqDist(nltk.ibigrams(speaker_seq2)) >>>cfd.tabulate()
使用ElementTree访问Toolbox数据
我们可以用toolbox.xml()来访问Toolbox文件。
>>>from nltk.corpusimport toolbox >>>lexicon = toolbox.xml('rotokas.dic')
可以通过这样的方式来访问内容:
>>>lexicon[3][0] <Element lx at 77bd28> >>>lexicon[3][0].tag 'lx' >>>lexicon[3][0].text 'kaa'
我们也可以使用路径访问XML的内容:
>>>[lexeme.text.lower() for lexeme in lexicon.findall('record/lx')] ['kaa', 'kaa', 'kaa', 'kaakaaro', 'kaakaaviko', 'kaakaavo', 'kaakaoko', 'kaakasi', 'kaakau', 'kaakauko', 'kaakito', 'kaakuupato', ..., 'kuvuto']
>>>import sys >>>from nltk.etree.ElementTreeimport ElementTree >>>tree = ElementTree(lexicon[3]) >>>tree.write(sys.stdout) <record> <lx>kaa</lx> <ps>N</ps> <pt>MASC</pt> <cl>isi</cl> <ge>cookingbanana</ge> <tkp>bananabilong kukim</tkp> <pt>itoo</pt> <sf>FLORA</sf> <dt>12/Aug/2005</dt> <ex>Taeaviiria kaaisi kovopaueva kaparapasia.</ex> <xp>Taeavii bin planim gadenbanana bilongkukim tasol long paia.</xp> <xe>Taeaviplantedbanana in orderto cookit.</xe> </record>
格式化条目
我们可以根据自己的需要,来生成特定的格式输出。
>>>html= "<table>\n" >>>for entry in lexicon[70:80]: ... lx = entry.findtext('lx') ... ps = entry.findtext('ps') ... ge = entry.findtext('ge') ... html +=" <tr><td>%s</td><td>%s</td><td>%s</td></tr>\n"%(lx, ps,ge) >>>html+="</table>" >>>print html <table> <tr><td>kakae</td><td>???</td><td>small</td></tr> <tr><td>kakae</td><td>CLASS</td><td>child</td></tr> <tr><td>kakaevira</td><td>ADV</td><td>small-like</td></tr> <tr><td>kakapikoa</td><td>???</td><td>small</td></tr> <tr><td>kakapikoto</td><td>N</td><td>newbornbaby</td></tr> <tr><td>kakapu</td><td>V</td><td>placein sling for purposeof carrying</td></tr> <tr><td>kakapua</td><td>N</td><td>slingfor lifting</td></tr> <tr><td>kakara</td><td>N</td><td>armband</td></tr> <tr><td>Kakarapaia</td><td>N</td><td>villagename</td></tr> <tr><td>kakarau</td><td>N</td><td>frog</td></tr> </table>
11.5使用Toolbox数据
为每个条目添加一个字段
例11-2. 为词汇条目添加新的cv字段 from nltk.etree.ElementTreeimport SubElement def cv(s): s = s.lower() s = re.sub(r'[^a-z]', r'_', s) s = re.sub(r'[aeiou]', r'V', s) s = re.sub(r'[^V_]', r'C', s) return (s) def add_cv_field(entry): for field in entry: if field.tag =='lx': cv_field = SubElement(entry,'cv') cv_field.text = cv(field.text) >>>lexicon = toolbox.xml('rotokas.dic') >>>add_cv_field(lexicon[53]) >>>print nltk.to_sfm_string(lexicon[53]) \lx kaeviro \ps V \pt A \ge lift off \ge take off \tkp goantap \sc MOTION \vx 1 \nt usedto describe action of plane \dt 03/Jun/2005 \ex Pitakaeviroroekepakekesia oavuripierevo kiuvu. \xp Pitai goantap nalukim hauswini bagarapim. \xe Peterwentto look at the housethat the winddestroyed. \cv CVVCVCV
验证Toolbox词汇
Toolbox格式的许多词汇不符合任何特定的模式。有些条目可能包括额外的字段,或以一种新的方式排序现有字段。
例如,我们可以在FreqDist的帮助下,很容易的找到频率异常的字段序列:
>>>fd = nltk.FreqDist(':'.join(field.tag for field in entry) for entry in lexicon) >>>fd.items() [('lx:ps:pt:ge:tkp:dt:ex:xp:xe', 41),('lx:rt:ps:pt:ge:tkp:dt:ex:xp:xe', 37), ('lx:rt:ps:pt:ge:tkp:dt:ex:xp:xe:ex:xp:xe', 27), ('lx:ps:pt:ge:tkp:nt:dt:ex:xp:xe', 20), ..., ('lx:alt:rt:ps:pt:ge:eng:eng:eng:tkp:tkp:dt:ex:xp:xe:ex:xp:xe:ex:xp:xe', 1)]