• 【python爬虫】scrapy入门5--xpath等后面接正则


    比如我们要调试某网页:https://g.widora.cn/

    shell不依赖工程环境

    scrapy shell https://g.widora.cn/

    类似页面F12,可用对象都列出来了,一般常用response

    前面省略
    
    2020-05-08 21:07:18 [asyncio] DEBUG: Using selector: KqueueSelector
    [s] Available Scrapy objects:
    [s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
    [s]   crawler    <scrapy.crawler.Crawler object at 0x1118626d0>
    [s]   item       {}
    [s]   request    <GET https://g.widora.cn/>
    [s]   response   <200 https://g.widora.cn/>
    [s]   settings   <scrapy.settings.Settings object at 0x111bd7890>
    [s]   spider     <DefaultSpider 'default' at 0x112103250>
    [s] Useful shortcuts:
    [s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
    [s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
    [s]   shelp()           Shell help (print this help)
    [s]   view(response)    View response in a browser
    2020-05-08 21:07:18 [asyncio] DEBUG: Using selector: KqueueSelector

    查找某群号:xpath等支持re,extract、get等后面不支持re

    In [1]: response.xpath("/html/body/div/div[5]/p/a").extract()                   
    
    Out[1]: ['<a target="_blank" href="//shang.qq.com/wpa/qunwpa?idkey=f65cb90612db81ef9bee771440adb40c004933a18b7c0466a279486936aedc79" src="title=" style="color:#00a1d6">G.widora.cn 群(1031687050)</a>']
    
    In [2]: response.xpath("/html/body/div/div[5]/p/a/text()").extract()            
    
    Out[2]: ['G.widora.cn 群(1031687050)']
    
    In [3]: response.xpath("/html/body/div/div[5]/p/a/text()")                      
    
    Out[3]: [<Selector xpath='/html/body/div/div[5]/p/a/text()' data='G.widora.cn 群(1031687050)'>]
    
    In [4]: response.xpath("/html/body/div/div[5]/p/a/text()").re('d+')            
    
    Out[4]: ['1031687050']

    终端写这个很麻烦,还是在浏览器上先调试通过再写代码 

     

  • 相关阅读:
    java读取ldif文件并创建新的节点
    AngularJS的基本概念和用法
    前端开发环境需要的工具
    解决:使用ajax验证登录信息返回前端页面时,当前整个页面刷新。
    js中switch语句不执行
    使用html5中required属性
    H-ui.admin v3.1学习之路(一):导航栏信息无法在内容区显示
    解决:@Auarowired为null
    scrapy框架整理
    django项目的部署
  • 原文地址:https://www.cnblogs.com/hightech/p/12853158.html
Copyright © 2020-2023  润新知