• 利用wikipedia 的API实现对其内容的查询


    wikipedia提供了api可以供我们对其内容进行操作。其API文档地址为:

    http://en.wikipedia.org/w/api.php

    列举一些常见用法:

    1、全文搜索

    http://en.wikipedia.org/w/api.php?action=query&list=search&srsearch=fluoxetine

    srsearch为要检索的内容

    结果:

    Xml代码  收藏代码
    1. <?xml version="1.0"?>  
    2. <api>  
    3.   <query>  
    4.     <searchinfo totalhits="224" />  
    5.     <search>  
    6.       <ns="0" title="Fluoxetine" snippet="&lt;span class=&#039;searchmatch&#039;&gt;Fluoxetine&lt;/span&gt; (also known by the tradenames Prozac, Sarafem) is an antidepressant  of the selective serotonin reuptake inhibitor  (SSRI) class &lt;b&gt;...&lt;/b&gt; " size="53978" wordcount="7052" timestamp="2010-10-31T23:22:00Z" />  
    7.       <ns="0" title="Olanzapine/fluoxetine" snippet="The drug  combination olanzapine/&lt;span class=&#039;searchmatch&#039;&gt;fluoxetine&lt;/span&gt; (trade name Symbyax, created by Eli Lilly and Company ) is a single capsule containing the  &lt;b&gt;...&lt;/b&gt; " size="5703" wordcount="629" timestamp="2010-09-21T09:10:34Z" />  
    8.       <ns="0" title="Sertraline" snippet="Evidence suggests that sertraline may work better than &lt;span class=&#039;searchmatch&#039;&gt;fluoxetine&lt;/span&gt;  (Prozac) for some subtypes of depression.  Sertraline is highly  &lt;b&gt;...&lt;/b&gt; " size="104510" wordcount="13933" timestamp="2010-10-28T22:13:04Z" />  
    9.       <ns="0" title="Antidepressant" snippet="The first such compound to be patented was zimelidine  in 1971, while the first released clinically was indalpine . &lt;span class=&#039;searchmatch&#039;&gt;Fluoxetine&lt;/span&gt;  was  &lt;b&gt;...&lt;/b&gt; " size="128712" wordcount="17532" timestamp="2010-10-30T08:05:06Z" />  
    10.       <ns="0" title="Selective serotonin reuptake inhibitor" snippet="four newer antidepressants (including the SSRIs paroxetine  and &lt;span class=&#039;searchmatch&#039;&gt;fluoxetine&lt;/span&gt; , and two non-SSRI antidepressants nefazodone  and venlafaxine ).  &lt;b&gt;...&lt;/b&gt; " size="78327" wordcount="10398" timestamp="2010-11-01T00:11:30Z" />  
    11.       <ns="0" title="Paroxetine" snippet="Unlike two other popular SSRI antidepressants, &lt;span class=&#039;searchmatch&#039;&gt;fluoxetine&lt;/span&gt;  and sertraline , paroxetine is associated with clinically significant weight  &lt;b&gt;...&lt;/b&gt; " size="48886" wordcount="6491" timestamp="2010-10-31T23:11:12Z" />  
    12.       <ns="0" title="Venlafaxine" snippet="Its efficacy is similar to or better than sertraline  (Zoloft) and &lt;span class=&#039;searchmatch&#039;&gt;fluoxetine&lt;/span&gt;  (Prozac), depending on the criteria and rating scales used &lt;b&gt;...&lt;/b&gt; " size="49655" wordcount="6574" timestamp="2010-11-01T00:38:00Z" />  
    13.       <ns="0" title="Olanzapine" snippet="Olanzapine (trade names Zyprexa, Zalasta, Zolafren, Olzapin, Oferta, Zypadhera or in combination with &lt;span class=&#039;searchmatch&#039;&gt;fluoxetine&lt;/span&gt;  Symbyax ) is an atypical  &lt;b&gt;...&lt;/b&gt; " size="34028" wordcount="4540" timestamp="2010-10-30T17:45:42Z" />  
    14.       <ns="0" title="Prozac (disambiguation)" snippet="Prozac  is a proprietary name for the antidepressant drug &lt;span class=&#039;searchmatch&#039;&gt;fluoxetine&lt;/span&gt;. Prozac may also refer to:  Prozac+ , an Italian punk band &lt;b&gt;...&lt;/b&gt; " size="581" wordcount="78" timestamp="2010-04-23T20:24:31Z" />  
    15.       <ns="0" title="SSRI discontinuation syndrome" snippet="paroxetine  having the highest number of withdrawal syndrome reports and &lt;span class=&#039;searchmatch&#039;&gt;fluoxetine&lt;/span&gt;  the highest number of drug dependence reports; the note &lt;b&gt;...&lt;/b&gt; " size="41099" wordcount="5444" timestamp="2010-09-23T06:19:55Z" />  
    16.     </search>  
    17.   </query>  
    18.   <query-continue>  
    19.     <search sroffset="10" />  
    20.   </query-continue>  
    21. </api>  

    2、列举wikipedia 的 category:

    http://en.wikipedia.org/w/api.php?action=query&list=allcategories&acprefix=drug&aclimit=10

    返回10条以drug开头的category;

    结果:

    Xml代码  收藏代码
    1. <?xml version="1.0"?>  
    2. <api>  
    3.   <query>  
    4.     <allcategories>  
    5.       <xml:space="preserve">Drug-induced Suicide</c>  
    6.       <xml:space="preserve">Drug-realted suicides</c>  
    7.       <xml:space="preserve">Drug-related Films</c>  
    8.       <xml:space="preserve">Drug-related Suicides</c>  
    9.       <xml:space="preserve">Drug-related death in California</c>  
    10.       <xml:space="preserve">Drug-related deaths</c>  
    11.       <xml:space="preserve">Drug-related deaths by country</c>  
    12.       <xml:space="preserve">Drug-related deaths in Alabama</c>  
    13.       <xml:space="preserve">Drug-related deaths in Alaska</c>  
    14.       <xml:space="preserve">Drug-related deaths in Arizona</c>  
    15.     </allcategories>  
    16.   </query>  
    17.   <query-continue>  
    18.     <allcategories acfrom="Drug-related deaths in Arkansas" />  
    19.   </query-continue>  
    20. </api>  

     3、返回具有相应title页面的timestamp|user|comment|content 信息;

    http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=api&rvprop=timestamp|user|comment|content

    结果:

    Xml代码  收藏代码
    1. <?xml version="1.0"?>  
    2. <api>  
    3.   <query>  
    4.     <pages>  
    5.       <page pageid="27697087" ns="0" title="API">  
    6.         <revisions>  
    7.           <rev user="Graham87" timestamp="2010-06-13T08:41:17Z" comment="Protected API: restore protection ([edit=sysop] (indefinite) [move=sysop] (indefinite))" xml:space="preserve">#REDIRECT [[Application programming interface]]{{R from abbreviation}}</rev>  
    8.         </revisions>  
    9.       </page>  
    10.     </pages>  
    11.   </query>  
    12. </api>  

     4、解析页面:

    http://en.wikipedia.org/w/api.php?action=parse&format=xml&page=fluoxetine

    用上面的查询返回的[content]是wikipedia的标记格式,这个api返回的是html格式的文本:

    可以用xpath="api/parse/text" 返回html内容。

    * action=parse *
      This module parses wikitext and returns parser output

    This module requires read rights.
    Parameters:
      title          - Title of page the text belongs to
                       Default: API
      text           - Wikitext to parse
      summary        - Summary to parse
      page           - Parse the content of this page. Cannot be used together with text and title
      redirects      - If the page parameter is set to a redirect, resolve it
      oldid          - Parse the content of this revision. Overrides page
      prop           - Which pieces of information to get.
                       NOTE: Section tree is only generated if there are more than 4 sections, or if the __TOC__ keyword is present
                       Values (separate with '|'): text, langlinks, categories, links, templates, images, externallinks, sections, revid, displaytitle, headitems, headhtml
                       Default: text|langlinks|categories|links|templates|images|externallinks|sections|revid|displaytitle
      pst            - Do a pre-save transform on the input before parsing it.
                       Ignored if page or oldid is used.
      onlypst        - Do a PST on the input, but don't parse it.
                       Returns PSTed wikitext. Ignored if page or oldid is used.
    Example:
      api.php?action=parse&text={{Project:Sandbox}}

    来源:http://john2007.iteye.com/blog/800446

  • 相关阅读:
    WRF rsl.out文件研究
    ERA-Interim 的变量TCW和VIWV可降水量
    sudo apt update 没有 Release 文件
    线性斜压模式LBM学习&安装实录
    PGI 用户手册之 Site-Specific Customization of the Compilers
    ERA5气压层数据驱动WRF的一些问题
    OpenMP fortran 学习
    crontab计划运行shell脚本,调用ncl执行失败
    CDO学习2 CDO 入门教程Tutorial
    guide, manual, tutorial之间的区别
  • 原文地址:https://www.cnblogs.com/DIMON/p/5219995.html
Copyright © 2020-2023  润新知