• [python] 基于词云的关键词提取:wordcloud的使用、源码分析、中文词云生成和代码重写


    1. 词云简介

    词云,又称文字云、标签云,是对文本数据中出现频率较高的“关键词”在视觉上的突出呈现,形成关键词的渲染形成类似云一样的彩色图片,从而一眼就可以领略文本数据的主要表达意思。常见于博客、微博、文章分析等。

    除了网上现成的Wordle、Tagxedo、Tagul、Tagcrowd等词云制作工具,在python中也可以用wordcloud包比较轻松地实现(官网github项目):

    from wordcloud import WordCloud
    import matplotlib.pyplot as plt
    
    # Read the whole text.
    text = open('constitution.txt').read()
    
    # Generate a word cloud image
    wordcloud = WordCloud().generate(text)
    
    # Display the generated image:
    # the matplotlib way:
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")

    生成的词云如下:

    还可以设置图片作为mask:

    alice_mask = np.array(Image.open(path.join(d, "alice_mask.png")))
    wc = WordCloud(background_color="white", max_words=2000, mask=alice_mask, stopwords=stopwords, contour_width=3, contour_color='steelblue')
    wc.generate(text)

    2. 安装

    pip install wordcloud

    词云:解决pip install wordcloud安装过程中报错“error: command 'x86_64-linux-gnu-gcc' failed with exit status 1”问题

      

    3. 根据源码分析wordcloud的实现原理

    总的来说,wordcloud做的是三件事:

    (1) 文本预处理

    (2) 词频统计

    (3) 将高频词以图片形式进行彩色渲染

    从上面的代码可以看到,用 wordcloud.generate(text) 就完成了这三项工作。

    源码:

    def generate(self, text):
        """Generate wordcloud from text.
    
        The input "text" is expected to be a natural text. If you pass a sorted
        list of words, words will appear in your output twice. To remove this
        duplication, set ``collocations=False``.
    
        Alias to generate_from_text.
    
        Calls process_text and generate_from_frequencies.
    
        Returns
        -------
        self
        """
        return self.generate_from_text(text)
    
    def generate_from_text(self, text):
        """Generate wordcloud from text.
    
        The input "text" is expected to be a natural text. If you pass a sorted
        list of words, words will appear in your output twice. To remove this
        duplication, set ``collocations=False``.
    
        Calls process_text and generate_from_frequencies.
    
        ..versionchanged:: 1.2.2
            Argument of generate_from_frequencies() is not return of
            process_text() any more.
    
        Returns
        -------
        self
        """
        words = self.process_text(text)
        self.generate_from_frequencies(words)
        return self
    generate()和generate_from_text()

     它的调用顺序是:

    generate(self, text)
    =>
    self.generate_from_text(text)
    =>
    words = self.process_text(text)
    self.generate_from_frequencies(words)

    其中 process_text(text) 对应的是文本预处理和词频统计,而 generate_from_frequencies(words) 对应的是根据词频中生成词云

    (1) process_text(text) 主要是进行分词和去噪。

    具体地,它做了以下操作:

    • 检测文本编码
    • 分词(根据规则进行tokenize)、保留单词字符(A-Za-z0-9_)和单引号(')、去除单字符
    • 去除停用词
    • 去除后缀('s) -- 针对英文
    • 去除纯数字
    • 统计一元和二元词频计数(unigrams_and_bigrams) -- 可选

    返回的结果是一个字典 dict(string, int) ,表示的是分词后的token以及对应出现的次数

    这里有一些需要注意的地方,文章后面会再提到。 

    源码如下:

    def process_text(self, text):
        """Splits a long text into words, eliminates the stopwords.
    
        Parameters
        ----------
        text : string
            The text to be processed.
    
        Returns
        -------
        words : dict (string, int)
            Word tokens with associated frequency.
    
        ..versionchanged:: 1.2.2
            Changed return type from list of tuples to dict.
    
        Notes
        -----
        There are better ways to do word tokenization, but I don't want to
        include all those things.
        """
    
        stopwords = set([i.lower() for i in self.stopwords])
    
        flags = (re.UNICODE if sys.version < '3' and type(text) is unicode
                 else 0)
        regexp = self.regexp if self.regexp is not None else r"w[w']+"
    
        words = re.findall(regexp, text, flags)
        # remove stopwords
        words = [word for word in words if word.lower() not in stopwords]
        # remove 's
        words = [word[:-2] if word.lower().endswith("'s") else word
                 for word in words]
        # remove numbers
        words = [word for word in words if not word.isdigit()]
    
        if self.collocations:
            word_counts = unigrams_and_bigrams(words, self.normalize_plurals)
        else:
            word_counts, _ = process_tokens(words, self.normalize_plurals)
    
        return word_counts
    def process_text(self, text)

    (2) generate_from_frequencies(words) 主要是根据上一步的结果生成词云分布。

    具体地,它做了以下操作:

    • 对词计数结果进行排序,并归一化(normalized)到0~1之间,得到词频
    • 创建图像并确定font_size初始值
    • 给self.words_赋值,记录的是出现频率最高的前max_words个词,以及对应的归一化后的词频,即dict(token, normalized_frequency)
    • 画出灰度图:词频越大,font_size越大;根据生成的随机数来决定字的水平/垂直方向
      • 若随机数小于self.prefer_horizontal则为水平方向,否则为垂直方向;
      • 如果空间不足,优先考虑旋转方向,其次考虑将字体变小
    • 给self.layout_赋值,记录的是词和词频、字体大小、位置、方向、以及颜色,即list(zip(frequencies, font_sizes, positions, orientations, colors)) 

    可以看到,这个函数的主要目的在于得到self.layout_的值,记录了要生成词云分布图所需要的信息。

    后面wordcloud.to_file(filename)或者plt.imshow(wordcloud)会把结果以图像的形式呈现出来。其中to_file()函数就会先检测是否已经给self.layout_赋值,如果没有的话会报错。

    源码如下:

    def generate_from_frequencies(self, frequencies, max_font_size=None):
        """Create a word_cloud from words and frequencies.
    
        Parameters
        ----------
        frequencies : dict from string to float
            A contains words and associated frequency.
    
        max_font_size : int
            Use this font-size instead of self.max_font_size
    
        Returns
        -------
        self
    
        """
        # make sure frequencies are sorted and normalized
        frequencies = sorted(frequencies.items(), key=itemgetter(1), reverse=True)
        if len(frequencies) <= 0:
            raise ValueError("We need at least 1 word to plot a word cloud, "
                             "got %d." % len(frequencies))
        frequencies = frequencies[:self.max_words]
    
        # largest entry will be 1
        max_frequency = float(frequencies[0][1])
    
        frequencies = [(word, freq / max_frequency)
                       for word, freq in frequencies]
    
        if self.random_state is not None:
            random_state = self.random_state
        else:
            random_state = Random()
    
        if self.mask is not None:
            mask = self.mask
            width = mask.shape[1]
            height = mask.shape[0]
            if mask.dtype.kind == 'f':
                warnings.warn("mask image should be unsigned byte between 0"
                              " and 255. Got a float array")
            if mask.ndim == 2:
                boolean_mask = mask == 255
            elif mask.ndim == 3:
                # if all channels are white, mask out
                boolean_mask = np.all(mask[:, :, :3] == 255, axis=-1)
            else:
                raise ValueError("Got mask of invalid shape: %s"
                                 % str(mask.shape))
        else:
            boolean_mask = None
            height, width = self.height, self.width
        occupancy = IntegralOccupancyMap(height, width, boolean_mask)
    
        # create image
        img_grey = Image.new("L", (width, height))
        draw = ImageDraw.Draw(img_grey)
        img_array = np.asarray(img_grey)
        font_sizes, positions, orientations, colors = [], [], [], []
    
        last_freq = 1.
    
        if max_font_size is None:
            # if not provided use default font_size
            max_font_size = self.max_font_size
    
        if max_font_size is None:
            # figure out a good font size by trying to draw with
            # just the first two words
            if len(frequencies) == 1:
                # we only have one word. We make it big!
                font_size = self.height
            else:
                self.generate_from_frequencies(dict(frequencies[:2]),
                                               max_font_size=self.height)
                # find font sizes
                sizes = [x[1] for x in self.layout_]
                try:
                    font_size = int(2 * sizes[0] * sizes[1] 
                                    / (sizes[0] + sizes[1]))
                # quick fix for if self.layout_ contains less than 2 values
                # on very small images it can be empty
                except IndexError:
                    try:
                        font_size = sizes[0]
                    except IndexError:
                        raise ValueError('canvas size is too small')
        else:
            font_size = max_font_size
    
        # we set self.words_ here because we called generate_from_frequencies
        # above... hurray for good design?
        self.words_ = dict(frequencies)
    
        # start drawing grey image
        for word, freq in frequencies:
            # select the font size
            rs = self.relative_scaling
            if rs != 0:
                font_size = int(round((rs * (freq / float(last_freq))
                                       + (1 - rs)) * font_size))
            if random_state.random() < self.prefer_horizontal:
                orientation = None
            else:
                orientation = Image.ROTATE_90
            tried_other_orientation = False
            while True:
                # try to find a position
                font = ImageFont.truetype(self.font_path, font_size)
                # transpose font optionally
                transposed_font = ImageFont.TransposedFont(
                    font, orientation=orientation)
                # get size of resulting text
                box_size = draw.textsize(word, font=transposed_font)
                # find possible places using integral image:
                result = occupancy.sample_position(box_size[1] + self.margin,
                                                   box_size[0] + self.margin,
                                                   random_state)
                if result is not None or font_size < self.min_font_size:
                    # either we found a place or font-size went too small
                    break
                # if we didn't find a place, make font smaller
                # but first try to rotate!
                if not tried_other_orientation and self.prefer_horizontal < 1:
                    orientation = (Image.ROTATE_90 if orientation is None else
                                   Image.ROTATE_90)
                    tried_other_orientation = True
                else:
                    font_size -= self.font_step
                    orientation = None
    
            if font_size < self.min_font_size:
                # we were unable to draw any more
                break
    
            x, y = np.array(result) + self.margin // 2
            # actually draw the text
            draw.text((y, x), word, fill="white", font=transposed_font)
            positions.append((x, y))
            orientations.append(orientation)
            font_sizes.append(font_size)
            colors.append(self.color_func(word, font_size=font_size,
                                          position=(x, y),
                                          orientation=orientation,
                                          random_state=random_state,
                                          font_path=self.font_path))
            # recompute integral image
            if self.mask is None:
                img_array = np.asarray(img_grey)
            else:
                img_array = np.asarray(img_grey) + boolean_mask
            # recompute bottom right
            # the order of the cumsum's is important for speed ?!
            occupancy.update(img_array, x, y)
            last_freq = freq
    
        self.layout_ = list(zip(frequencies, font_sizes, positions,
                                orientations, colors))
        return self       
    def generate_from_frequencies(self, frequencies, max_font_size=None)

    4. 应用到中文语料应该要注意的点

    wordcloud包是由Andreas Mueller在2015-03-20发布1.0.0版本,现在最新的是2018-03-13发布的1.4.1版本。

    英文语料可以直接输入到wordcloud中,但是对于中文语料,仅仅用wordcloud不能直接生成中文词云图。

    原因:

    英文单词以空格分隔,而我们从前面process_text(text)看到源码中是直接用正则表达式(默认为r"w[w']+")进行处理:

    In  : re.findall(r"w[w']+", "It's Monday today.")
    Out: ["It's", 'Monday', 'today']

    但是中文里面词与词之间一般不用字符分隔:

    In : re.findall(r"w[w']+", "今天天气不错,蓝天白云,还有温暖的阳光 哈 哈哈")
    Out: ['今天天气不错', '蓝天白云', '还有温暖的阳光', '哈哈']

    可以看出,原生的wordcloud是为英文服务的,去除标点符号(单符号'除外)并分割成token;

    而应用到中文语料上的时候,注意要先分好词,再用空格分隔连接成字符串,最后输入到wordcloud。

    另外要注意的是,无论是对英文还是中文,默认是把单字符剔除掉(因为 regexp = self.regexp if self.regexp is not None else r"w[w']+" ),如果想要保留单字符,将regexp参数讲表达式设置为 r"w[w']*" 即可。

    from wordcloud import WordCloud
    from scipy.misc import imread
    
    def generate_wordcloud(text, max_words=200, pic_path=None):
        """
        生成词云
        :param text: 一段以空格为间断的字符串
        :param max_words: 词数目上限
        :param pic_path: 输出图片路径
        :return:
        """
        mk = imread("tuoyuan.jpg")
        wc = WordCloud(font_path="/usr/share/fonts/myfonts/msyh.ttf", background_color="white", max_words=max_words,
                       mask=mk, width=1000, height=500, max_font_size=100, prefer_horizontal=0.95, collocations=False)
        wc.generate(text=text)
        if pic_path:
            wc.to_file(pic_path)
        else:
            plt.imshow(wc)
            plt.axis("off")
            plt.show()
        return wc.words_
    
    def run_wordcloud(corpus, max_words, pic_path=None):
        text = " ".join([" ".join(line) for line in corpus])   # 将分词后的结果用空格连接
        word2weight = generate_wordcloud(text=text, max_words=max_words, pic_path=pic_path)
        word2weight_sorted = sorted(word2weight.items(), key=lambda x: x[1], reverse=True)
        logging.info([(k, float("%.5f" % v)) for k, v in word2weight_sorted]) 

    更多参考:word_cloud/examples/wordcloud_cn.py

    5. 重写代码

    用词云是为了直观地看语料的关键信息,在本人的实际工作应用中,主要目的在于获取关键信息,而不太关注界面的呈现方式。

    所以在了解wordcloud源码实现原理之后,决定自己用代码实现。

    一方面,使得代码的实现更公开透明,在效率相当的情况下尽量避免使用第三方库,效果可控,甚至还可以提升效率;

    另一方面,能结合实际情况更灵活地处理问题。

    针对中文的预处理,可以和分词结合一起完成。这里主要进行:分词和词性标注、小写化、去停用词、去数字、去单字符、以及保留指定词性

    import jieba
    import jieba.posseg as pseg
    
    class Utils(object):
        def __init__(self, utils_data=None):
            self.stopwords = self.init_utils(utils_data)
            self.pos_save = {
                "n", "an", "Ng", "nr", "ns", "nt", "nz", "vn", "un",  #
                "v", "vg", "vd",  #
                "a", "ag", "ad",  #
                "j", "l", "i", "z", "b", "g", "s", "h",  # j简称略语、l习用语、i成语、z状态词、b区别词、g语素、s处所词、h前接成分
                "zg", "eng",
                "x"}  # 未知(自定义词)
    
        def _init_utils(self, utils_data):
            for wd in utils_data["user_dict"]:
                jieba.add_word(wd)
            return set(utils_data["stopwords"])
    
        def _token_filter(self, token):  # 去停用词; 去数字; 去单字
            return token not in self.stopwords and not token.isdigit() and len(token) >= 2
    
        def _token_filter_with_flag(self, pair_word_flag):  # 保留指定词性
            return self.token_filter(pair_word_flag.word) and pair_word_flag.flag in self.pos_save
    
        def cut(self, text):
            return list(filter(self._token_filter, list(jieba.cut(text.lower()))))  # 分词; 小写化;
    
        def cut_with_flag(self, text):
            pairs = list(filter(self._token_filter_with_flag,  list(pseg.cut(text.lower()))))  # 分词和词性标注; 小写化;
            return [p.word for p in pairs]

    做完文本分词和其它预处理之后,直接统计词及对应的出现次数即可。为了更直观,这里输出的是词计数,而不是归一化后的词频。排序结果与wordcloud等同。

        def word_count(corpus, n_gram=1, n=None):
            counter = Counter()
            if n_gram == 1:
                for line in corpus:
                    counter.update(line)
            elif n_gram == 2:
                for line in corpus:
                    size = len(line)
                    counter.update(["%s_%s" % (line[idx], line[idx + 1]) for idx in range(size) if idx + 1 < size])  # 有序
            else:
                logging.info("[Error] Invalid value of param n_gram: %s (only 1 or 2 accepted)" % n_gram)
            return counter.most_common(n=n)

    另外还可以统计高频词的共现情况、把高频词/词共现反向映射到对应的句子等等,便于从高频词层面到高频句子类型层面的归纳。

    参考:

    https://pypi.org/project/wordcloud/

    https://github.com/amueller/word_cloud

    http://python.jobbole.com/87496/

    https://www.jianshu.com/p/ead991a08563

    https://blog.csdn.net/qq_34739497/article/details/78285972

    https://www.cnblogs.com/sunnyeveryday/p/7043399.html

    https://www.cnblogs.com/naraka/p/8992058.html

    https://www.cnblogs.com/franklv/p/6995150.html

    https://blog.csdn.net/Tang_Chuanlin/article/details/79862505

    https://www.cnblogs.com/zjutlitao/archive/2016/08/04/5734876.html

  • 相关阅读:
    MSSQLSERVER服务无法启动的解决方案
    引用类型和原始类型的对比(java)
    MVC中关于Membership类跟数据库的问题
    ASP.NET协作应用集成到trsids身份验证服务器的开发流程
    oracle的一知半解
    辨明你所从事的软件的类型
    sql数据库之间数据的转录
    设为首页 收藏(IE可用)
    如何缩减Try{}Catch{}Finally{}代码----定义一个公用的Try{}Catch{}Finally{}
    如何判断字符串是否存在数字
  • 原文地址:https://www.cnblogs.com/bymo/p/9334981.html
Copyright © 2020-2023  润新知