• Scrapy Selectors 选择器


    0.

    1.参考

    《用Python写网络爬虫》——2.2 三种网页抓取方法  re / lxml / BeautifulSoup

    需要注意的是,lxml在内部实现中,实际上是将CSS选择器转换为等价的XPath选择器。

    从结果中可以看出,在抓取我们的示例网页时,Beautiful Soup比其他两种方法慢了超过6倍之多。实际上这一结果是符合预期的,因为lxml和正则表达式模块都是C语言编写的,而BeautifulSoup``则是纯Python编写的。一个有趣的事实是,lxml表现得和正则表达式差不多好。由于lxml在搜索元素之前,必须将输入解析为内部格式,因此会产生额外的开销。而当抓取同一网页的多个特征时,这种初始化解析产生的开销就会降低,lxml也就更具竞争力。这真是一个令人惊叹的模块!

    2.Scrapy Selectors 选择器

    https://doc.scrapy.org/en/latest/topics/selectors.html#topics-selectors

    • BeautifulSoup缺点:慢
    • lxml:基于 ElementTree
    • Scrapy seletors: parsel library,构建于 lxml 库之上,这意味着它们在速度和解析准确性上非常相似。

    .css() .xpath() 返回 SelectorList,即 a list of new selectors
    .extract() .re() 提取 过滤 tag data

    import scrapy

    C:Program FilesAnaconda2Libsite-packagesscrapy\__init__.py

    from scrapy.selector import Selector

    C:Program FilesAnaconda2Libsite-packagesscrapyselector\__init__.py

    from scrapy.selector.unified import *

    C:Program FilesAnaconda2Libsite-packagesscrapyselectorunified.py

    from parsel import Selector as _ParselSelector

    class Selector(_ParselSelector, object_ref):

    >>> from scrapy.selector import Selector
    >>> from scrapy.http import HtmlResponse
    如此导入 Selector,实例化 Selector 的时候第一个参数是 HtmlResponse 实例,如果要通过 str 实例化 Selector ,需要 sel = Selector(text=doc)

    xx

    In [926]: from parsel import Selector
    
    In [927]: Selector?
    Init signature: Selector(self, text=None, type=None, namespaces=None, root=None, base_url=None, _expr=None)
    Docstring:
    :class:`Selector` allows you to select parts of an XML or HTML text using CSS
    or XPath expressions and extract data from it.
    
    ``text`` is a ``unicode`` object in Python 2 or a ``str`` object in Python 3
    
    ``type`` defines the selector type, it can be ``"html"``, ``"xml"`` or ``None`` (default).
    If ``type`` is ``None``, the selector defaults to ``"html"``.
    File:           c:program filesanaconda2libsite-packagesparselselector.py
    Type:           type 

    xx

    doc=u"""
    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
            <span class="text" itemprop="text">“I have not failed. I've just found 10,000 ways that won't work.”</span>>>
            <span>by <small class="author" itemprop="author">Thomas A. Edison</small>
            <a href="/author/Thomas-A-Edison">(about)</a>
    """
    
    sel = Selector(doc)
    
    sel.css('div.quote')
    [<Selector xpath=u"descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data=u'<div class="quote" itemscope itemtype="h'>]

    3.使用 scrapy shell 调试

    https://doc.scrapy.org/en/latest/intro/tutorial.html#extracting-data

    G:pydatapycodescrapysplash_cnblogs>scrapy shell "http://quotes.toscrape.com/page/1/"

    3.1Xpath VS CSS

    对比

      CSS Xpath 备注
    含有属性 response.css('div[class]') response.xpath('//div[@class]') css可以简写为 div.class 甚至 .class,div#abc 或 #abc 则对应于id=abc
    匹配属性值 response.css('div[class="quote"]') response.xpath('//div[@class="quote"]') response.xpath('//small[text()="Albert Einstein"]')
    匹配部分属性值 response.css('div[class*="quo"]') response.xpath('//div[contains(@class,"quo")]') response.xpath('//small[contains(text(),"Einstein")]')
    提取属性值 response.css('small::attr(class)') response.xpath('//small/@class') css里面text排除在attr以外,所以不支持上面两个过滤text???
    提取文字 response.css('small::text') response.xpath('//small/text()')  
           

    使用

    In [135]: response.xpath('//small[@class="author"]').extract_first()
    In [122]: response.css('small.author').extract_first()
    Out[122]: u'<small class="author" itemprop="author">Albert Einstein</small>'
    
    In [136]: response.xpath('//small[@class="author"]/text()').extract_first()
    In [123]: response.css('small.author::text').extract_first()
    Out[123]: u'Albert Einstein'
    
    In [137]: response.xpath('//small[@class="author"]/@class').extract_first()  #class也是属性
    In [124]: response.css('small.author::attr(class)').extract_first()
    Out[124]: u'author'
    
    In [138]: response.xpath('//small[@class="author"]/@itemprop').extract_first()
    In [125]: response.css('small.author::attr(itemprop)').extract_first()
    Out[125]: u'author' 

    class 是一个特殊属性,允许多值  class="row header-box"

    # 匹配多值中的某一个值
    In [228]: response.css('div.row') Out[228]: [<Selector xpath=u"descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' row ')]" data=u'<div class="row header-box"> '>, <Selector xpath=u"descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' row ')]" data=u'<div class="row"> <div class="col-md'>] In [232]: response.css('div.ro') Out[232]: []
    # 整个class属性值,匹配全部字符串 In [
    226]: response.css('div[class="row"]') Out[226]: [<Selector xpath=u"descendant-or-self::div[@class = 'row']" data=u'<div class="row"> <div class="col-md'>]

     In [240]: response.xpath('//div[@class="row header-box"]')
     Out[240]: [<Selector xpath='//div[@class="row header-box"]' data=u'<div class="row header-box"> '>]

    
    # 整个class属性值,匹配部分字符串
    In [229]: response.css('div[class*="row"]')
    Out[229]:
    [<Selector xpath=u"descendant-or-self::div[@class and contains(@class, 'row')]" data=u'<div class="row header-box">
               '>,
     <Selector xpath=u"descendant-or-self::div[@class and contains(@class, 'row')]" data=u'<div class="row">
        <div class="col-md'>]
    
    In [230]: response.xpath('//div[contains(@class,"row")]')
    Out[230]:
    [<Selector xpath='//div[contains(@class,"row")]' data=u'<div class="row header-box">
               '>,
     <Selector xpath='//div[contains(@class,"row")]' data=u'<div class="row">
        <div class="col-md'>]
    
    In [234]: response.css('div[class*="w h"]')
    Out[234]: [<Selector xpath=u"descendant-or-self::div[@class and contains(@class, 'w h')]" data=u'<div class="row header-box">
               '>]
    
    In [235]: response.xpath('//div[contains(@class,"w h")]')
    Out[235]: [<Selector xpath='//div[contains(@class,"w h")]' data=u'<div class="row header-box">
               '>]

    3.2提取数据

    • 提取data

      selectorList / selector.extract(),extract_frist()

        selector.extract() 返回一个str,selector.extract_first() 报错

        selectorList.extract() 对每一个selector执行selector.extract,返回 list of str,selectorList.extract_frist() 取前面list的第一个。

    • 提取data同时过滤

      selectorList / selector.re(r'xxx'),re_frist(r'xxx')

        selector.re() 返回 list,selector.re_first() 取第一个str

        selectorList.re() 对每一个selector执行selector.re,每个list结果(注意并非每个selector都会match)合并为一个list,selectorList.re_first()取前面合并list的第一个str。。。

    使用 extract

    In [21]: response.css('.author')  #内部转为 xpath,返回 SelectorList 实例
    Out[21]:
    [<Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>,
     <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>,
     <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>,
     <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>,
     <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>,
     <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>,
     <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>,
     <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>,
     <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>,
     <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>]
    
    In [22]: response.css('.author').extract()  #提取上面的 data 部分,对 SelectorList 中的每个 Selector 执行 extract(),
    Out[22]:
    [u'<small class="author" itemprop="author">Albert Einstein</small>',
     u'<small class="author" itemprop="author">J.K. Rowling</small>',
     u'<small class="author" itemprop="author">Albert Einstein</small>',
     u'<small class="author" itemprop="author">Jane Austen</small>',
     u'<small class="author" itemprop="author">Marilyn Monroe</small>',
     u'<small class="author" itemprop="author">Albert Einstein</small>',
     u'<small class="author" itemprop="author">Andrxe9 Gide</small>',
     u'<small class="author" itemprop="author">Thomas A. Edison</small>',
     u'<small class="author" itemprop="author">Eleanor Roosevelt</small>',
     u'<small class="author" itemprop="author">Steve Martin</small>']
    
    In [23]: response.css('.author').extract_first()  #只取第一个,可能返回 None ,可能报错 response.css('.author')[0].extract()
    Out[23]: u'<small class="author" itemprop="author">Albert Einstein</small>'
    
    In [24]: response.css('.author::text').extract_first()  #定位到 tag 内部的 text
    Out[24]: u'Albert Einstein'

    使用 re

    In [46]: response.css('.author::text')[0]
    Out[46]: <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]/text()" data=u'Albert Einstein'>
    
    In [47]: response.css('.author::text')[0].re(r'w+')
    Out[47]: [u'Albert', u'Einstein']
    
    In [48]: response.css('.author::text')[0].re_first(r'w+')
    Out[48]: u'Albert'
    
    In [49]: response.css('.author::text')[0].re(r'((w+)s(w+))')  #按照左边括号顺序输出
    Out[49]: [u'Albert Einstein', u'Albert', u'Einstein']

    3.

  • 相关阅读:
    Git
    linux下利用virtualenv搭建虚拟环境
    Session和Cookie
    Redis
    从零开始学Go之基本(二):包、函数声明与格式化输出
    从零开始学Go之HelloWorld
    C++ vector容器使用
    FIRST集和FOLLOW集的计算
    go编译错误:runnerw.exe:CreateProcess failed with error 216:
    Linux下vi编辑器常用命令
  • 原文地址:https://www.cnblogs.com/my8100/p/scrapy_selectors.html
Copyright © 2020-2023  润新知