Wiki语料处理

最近在做知识图谱相关工作，源数据主要来自百度百科，互动百科，中文维基百科等。其中中文维基百科提供数据库下载，下文主要讨论如何处理Wiki数据。

1. 中文维基数据下载

下载dump：https://dumps.wikimedia.org/zhwiki/latest/，维基数据主要包含以下几部分

zhwiki-latest-pages-articles.xml.bz2	词条正文
zhwiki-latest-redirect.sql	词条重定向（同义词）
zhwiki-latest-pagelinks.sql	词条页面内容外链
zhwiki-latest-page.sql	词条标题及摘要
zhwiki-latest-categorylinks.sql	词条开放分类链接

本文处理的数据是： zhwiki-latest-pages-articles.xml.bz2

2. 数据的抽取

Gensim是一个相当专业的主题模型Python工具包，提供了wiki数据的抽取处理类WikiCorpus，能对下载的数据（*articles.xml.bz2）进行抽取处理，得到纯净的文本语料。

class WikiCorpus(TextCorpus):
    """
    Treat a wikipedia articles dump (*articles.xml.bz2) as a (read-only) corpus.
    The documents are extracted on-the-fly, so that the whole (massive) dump
    can stay compressed on disk.
    >>> wiki = WikiCorpus('enwiki-20100622-pages-articles.xml.bz2') # create word->word_id mapping, takes almost 8h
    >>> MmCorpus.serialize('wiki_en_vocab200k.mm', wiki) # another 8h, creates a file in MatrixMarket format plus file with id->word
    """

源码在此，感兴趣的可以详细品味。下面是处理代码 process_wiki_1.py，将wiki数据处理得到文本语料 wiki.zh.txt，860M。

# -*- coding: utf-8 -*-
import logging
import sys
from gensim.corpora import WikiCorpus
logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s', level=logging.INFO)
'''
    extract data from wiki dumps(*articles.xml.bz2) by gensim.
    @chenbingjin 2016-05-11
'''
def help():
    print "Usage: python process_wiki.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.txt"

if __name__ == '__main__':
    if len(sys.argv) < 3:
        help()
        sys.exit(1)
    logging.info("running %s" % ' '.join(sys.argv))
    inp, outp = sys.argv[1:3]
    i = 0

    output = open(outp, 'w')
    wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
    for text in wiki.get_texts():
        output.write(" ".join(text) + "
")
        i = i + 1
        if (i % 10000 == 0):
            logging.info("Save "+str(i) + " articles")
    output.close()
    logging.info("Finished saved "+str(i) + "articles")

process_wiki_1.py

3. 数据预处理

由于中文维基包含繁体字及不规范字符，需要进行繁体转简体，以及字符编码转换。同时为了后续工作，需要对语料进行分词处理。

（1）繁体转简体：使用的是开源简繁转换工具OpenCC，安装说明在此，下面是linux下安装方式。

sudo apt-get install opencc

（2）字符编码转换：使用iconv命令将文件转换成utf-8编码

iconv -c -t UTF-8 < input_file > output_file
#iconv -c -t UTF-8 input_file -o output_file

（3）分词处理：使用jieba分词工具包，命令行分词

python -m jieba input_file > cut_file

下面是处理代码 process_wiki_2.sh

#!/bin/bash

# preprocess data
# @chenbingjin 2016-05-11

# Traditional Chinese to Simplified Chinese
echo "opencc: Traditional Chinese to Simplified Chinese..."
#time opencc -i wiki.zh.txt -o wiki.zh.chs.txt -c zht2zhs.ini
time opencc -i wiki.zh.txt -o wiki.zh.chs.txt -c t2s.json

# Cut words
echo "jieba: Cut words..."
time python -m jieba -d ' ' wiki.zh.chs.txt > wiki.zh.seg.txt

# Change encode
echo "iconv: ascii to utf-8..."
time iconv -c -t UTF-8 < wiki.zh.seg.txt > wiki.zh.seg.utf.txt

process_wiki_2.sh

4. 实验结果

处理器 Intel(R) Xeon(R) CPU X5650 @ 2.67GHz

数据处理过程：主要是分词耗时48m4s。

opencc: Traditional Chinese to Simplified Chinese...

real    0m57.765s
user    0m45.494s
sys    0m6.910s
-----------------------------
jieba: Cut words...
Building prefix dict from /usr/local/lib/python2.7/dist-packages/jieba/dict.txt ...
Loading model from cache /tmp/jieba.cache
Dumping model to file cache /tmp/jieba.cache
Loading model cost 2.141 seconds.
Prefix dict has been built succesfully.

real    48m4.259s
user    47m36.987s
sys    0m22.746s
-----------------------------
iconv: ascii to utf-8...

real    0m22.039s
user    0m9.304s
sys    0m3.464s

数据处理结果：1.1G 已分词的中文语料

-rw-r--r-- 1 chenbingjin data 860M  7月  2 14:33 wiki.zh.txt
-rw-r--r-- 1 chenbingjin data 860M  7月  2 17:46 wiki.zh.chs.txt
-rw-r--r-- 1 chenbingjin data 1.1G  7月  2 18:34 wiki.zh.seg.txt
-rw-r--r-- 1 chenbingjin data 1.1G  7月  2 18:34 wiki.zh.seg.utf.txt

补充：未分词的wiki语料，有需要的朋友可以下载

参考

1. licstar的博客：维基百科简体中文语料的获取

2. 52nlp：中英文维基百科语料上的word2vec实验

相关阅读:
现代软件工程第一周博客作业
 现代软件工程课题初步调研在线即时评教系统
 php判断网页是否gzip压缩
 采集练习(五) php 获得chrome扩展微度新标签页下的云壁纸（主要是美女壁纸)
<raspberry pi> raspberry pi 设置wlan 静态ip
采集练习(三) php 采集当当网图书的数据（初版）
采集练习(六) python获得chrome扩展微度新标签页下的云壁纸
 采集练习(七) php 获得电视节目预告（一周节目）
采集练习(八) php 获得网易精彩跟贴数据
 采集练习(四) python 获得hao123导航图片分类下的美女图片
原文地址：https://www.cnblogs.com/chenbjin/p/5635853.html