• scrapy选择器主要用法


    # 命令行输入:scrapy shell +链接,会自动请求url,得到的相应默认为response,开启命令行交互模式
    scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html
    
    In [1]: response#response为默认相应
    Out[1]: <200 https://doc.scrapy.org/en/latest/_static/selectors-sample1.html>
    
    In [2]: response.text#response.text相应的源代码
    # 标准结构图如下:
    response.text = '''
    <html>
     <head>
      <base href='http://example.com/' />
      <title>Example website</title>
     </head>
     <body>
      <div id='images'>
       <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
       <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
       <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
       <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
       <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
      </div>
     </body>
    </html>
    '''
    # 1:使用选择器response.selector.xpath()/response.selector.css()
    In [5]: response.selector.xpath('//title/text()').extract_first()
    Out[5]: 'Example website'
    
    In [6]: response.selector.css('title::text').extract_first()
    Out[6]: 'Example website'
    # 2:使用选择器也可以简写为:response.xpath() / response.css()
    
    In [9]: response.css('title::text')
    Out[9]: [<Selector xpath='descendant-or-self::title/text()' data='Example website'>]
    
    In [10]: response.xpath('//title/text()')
    Out[10]: [<Selector xpath='//title/text()' data='Example website'>]
    
    # 3:以上可知使用.xpath() .css()返回仍然是一个选择器,若要提取里面的数据,可以用extract()提取全部,extract_first提取首个
    In [7]: response.xpath('//title/text()').extract_first()
    Out[7]: 'Example website'
    
    In [8]: response.css('title::text').extract_first()
    Out[8]: 'Example website'
    
    # 4:可以循环进行选择
    # 获取div标签里面,id = 'images'的元素, 然后继续查找img标签属性为src的内容,最终提取出来
    # 就是说,包含关系用中括号[],从属关系用斜杠 /
    In [14]: response.xpath("//div[@id='images']").css('img::attr(src)').extract()
    Out[14]:
    ['image1_thumb.jpg',
     'image2_thumb.jpg',
     'image3_thumb.jpg',
     'image4_thumb.jpg',
     'image5_thumb.jpg']
    
    # extract_first还有default属性,如果查找不到对应的元素即返回default指定的值
    In [16]: response.xpath("//div[@id='images']").css('img::attr(src)').extract_first(default='')
    Out[16]: 'image1_thumb.jpg'
    
    # 查找a标签下,属性为href的元素,提取出来
    In [18]: response.xpath('//a/@href').extract()
    Out[18]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
    
    In [19]: response.css('a::attr(href)').extract()
    Out[19]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
    
    # 5:查找标签的文本
    In [20]: response.xpath('//a/text()').extract()
    Out[20]:
    ['Name: My image 1 ',
     'Name: My image 2 ',
     'Name: My image 3 ',
     'Name: My image 4 ',
     'Name: My image 5 ']
    
    In [21]: response.css('a::text').extract()
    Out[21]:
    ['Name: My image 1 ',
     'Name: My image 2 ',
     'Name: My image 3 ',
     'Name: My image 4 ',
     'Name: My image 5 ']
    
    # 6:选取标签的属性
    In [34]: response.css('a::attr(href)').extract()
    Out[34]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
    
    In [39]: response.xpath('//a/@href').extract()
    Out[39]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
    
    # 查找属性名称为href 包含image的标签的属性
    
    In [24]: response.xpath('//a[contains(@href,"image")]/@href').extract()
    Out[24]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
    
    In [25]: response.css('a[href*=image]::attr(href)').extract()
    Out[25]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
    
    # 查找a标签里面属性名为href,包含image,包含img,属性为src的属性
    In [27]: response.xpath('//a[contains(@href,"image")]/img/@src').extract()
    Out[27]:
    ['image1_thumb.jpg',
     'image2_thumb.jpg',
     'image3_thumb.jpg',
     'image4_thumb.jpg',
     'image5_thumb.jpg']
    
    In [28]: response.css('a[href*=image] img::attr(src)').extract()
    Out[28]:
    ['image1_thumb.jpg',
     'image2_thumb.jpg',
     'image3_thumb.jpg',
     'image4_thumb.jpg',
     'image5_thumb.jpg']
    
    
    # 7:可配合正则表达式,re_first表示取第一个满足正则表达式的
    In [30]: response.css('a::text').re('Name:(.*)')
    Out[30]:
    [' My image 1 ',
     ' My image 2 ',
     ' My image 3 ',
     ' My image 4 ',
     ' My image 5 ']
    
    In [31]: response.css('a::text').re_first('Name:(.*)')
    Out[31]: ' My image 1 '
    
    In [32]: response.css('a::text').re_first('Name:(.*)').strip()#去除空格
    Out[32]: 'My image 1'
  • 相关阅读:
    高程图 GridMap
    VINS-Mono代码分析与总结(二) 系统初始化
    IMU误差模型与校准
    VINS-Mono代码分析与总结(一) IMU预积分
    XJTU 大一上
    iOS路由最佳选择是什么
    正向代理、反向代理、透明代理
    centos7国内镜像glbc版安装
    IntelliJ idea 中使用Git
    Mongo DB 2.6 需要知道的一些自身限定
  • 原文地址:https://www.cnblogs.com/themost/p/7009585.html
Copyright © 2020-2023  润新知