• 提取数据之goose使用


    1.简介

    Python-goose项目是用Python重写的Goose,Goose原来是用Java写的文章提取工具。Python-goose的目标是给定任意资讯文章或者任意文章类的网页,不仅提取出文章的主体,同时提取出所有元信息以及图片等信息,支持中文网页。
    Python-goose可提取的信息包括:

    • 文章主体内容
    • 文章主要图片
    • 文章中嵌入的任何Youtube/Vimeo视频
    • 元描述
    • 元标签

    2.安装

    virtualenv --no-site-packages goose
    cd goose
    #windows下
    Scriptsactivate
    #linux下使用/bin/acitvate
    git clone https://github.com/grangier/python-goose.git
    cd python-goose
    pip install -r requirements.txt
    python setup.py install

    3.使用

    >>> from goose import Goose
    >>> url = 'http://edition.cnn.com/2012/02/22/world/europe/uk-occupy-london/index.html?hpt=ieu_c2'
    >>> g = Goose()
    >>> article = g.extract(url=url)
    >>> article.title
    u'Occupy London loses eviction fight'
    >>> article.meta_description
    "Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoid eviction Wednesday in a decision made by London's Court of Appeal."
    >>> article.cleaned_text[:150]
    (CNN) -- Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoi
    >>> article.top_image.src
    http://i2.cdn.turner.com/cnn/dam/assets/111017024308-occupy-london-st-paul-s-cathedral-story-top.jpg
    

      对于中文文章,需要

    g = Goose({'browser_user_agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.3
     6','stopwords_class':StopWordsChinese})

    参考:

    https://pypi.python.org/pypi/goose-extractor/

  • 相关阅读:
    Xftp6 和 Xshell 6 下载与安装使用
    Oracle 11 安装教程(桌面类)
    Oracle 11 安装 提示环境不满足最低要求解决方案
    FICO年终完全手册
    SAP月结操作讲解
    ABAP-FI常用BAPI
    FB01与F-02的区别(转载)
    SAP应用创新-维护控制表、视图统一路径
    FI 业务
    SAP 财务模块 FI-TV 差旅管理
  • 原文地址:https://www.cnblogs.com/hupeng1234/p/6685395.html
Copyright © 2020-2023  润新知