• Scrapy和Headless Chrome采集动态网站数据


    Headless Chrome是无头Chrome浏览器,可以利用Chrome V8引擎的高效。
    可以代替phantomjs,Scrapy也不建议使用phantomjs了。
    启用无头Chrome,必须使用Chrome对应版本的WebDriver。

    准备
    windows10
    Anaconda3
    python 3.6.2
    Selenium
    WebDriver
    Scrapy

    下载WebDriver:https://sites.google.com/a/chromium.org/chromedriver/downloads

    实现
    首先创建Scrapy项目:

    scrapy startproject xintong

    创建一个Spider:

    # -*- coding: utf-8 -*-
    import scrapy
    
    
    class WeiboSpider(scrapy.Spider):
        name = 'weibo'
        allowed_domains = ['weibo.com']
        start_urls = ['https://m.weibo.cn/u/2937210565']
    
        def parse(self, response):
            print("返回渲染过的页面内容")
            for sel in response.xpath('//div[@id="app"]//div[contains(@class, "card9")]'):
                title = sel.xpath('.//div[@class="weibo-text"]/text()').extract_first()
                print('标题:', title)

    配置中间件:

    # -*- coding: utf-8 -*-
    from selenium import webdriver
    from scrapy.http import HtmlResponse
    import time
    from selenium.webdriver.chrome.options import Options
    
    
    class XintongSpiderMiddleware(object):
    
        def __init__(self):
            option = Options()
            option.add_argument('--headless')
            self.driver = webdriver.Chrome(executable_path="D:/Python/test3/chromedriver.exe",
                                           chrome_options=option)
    
        def process_request(self, request, spider):
            self.driver.get(request.url)
            print("页面开始渲染。。。")
            self.driver.execute_script("scroll(0, 1000);")
            time.sleep(1)
            rendered_body = self.driver.page_source
            print("页面完成渲染。。。")
            return HtmlResponse(request.url, body=rendered_body, encoding="utf-8")
    
        def spider_closed(self, spider, reason):
            print('驱动关闭')
            self.driver.close()

    配置setting.py,修改如下配置

    USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
    ROBOTSTXT_OBEY = False
    DOWNLOADER_MIDDLEWARES = {
        'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,  # 关闭默认下载器
        'xintong.middlewares.XintongSpiderMiddleware': 543,
    }

    运行结果:

    scrapy crawl weibo
    2018-05-31 22:43:06 [scrapy.middleware] INFO: Enabled item pipelines:
    []
    2018-05-31 22:43:06 [scrapy.core.engine] INFO: Spider opened
    2018-05-31 22:43:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2018-05-31 22:43:06 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
    2018-05-31 22:43:06 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54161/session/60af686269ddbb5b2f03ea11de7afb1b/url {"url": "https://m.weibo.cn/u/2937210565", "sessionId": "60af686269ddbb5b2f03ea11de7afb1b"}
    2018-05-31 22:43:07 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
    页面开始渲染。。。
    2018-05-31 22:43:07 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54161/session/60af686269ddbb5b2f03ea11de7afb1b/execute {"script": "scroll(0, 1000);", "args": [], "sessionId": "60af686269ddbb5b2f03ea11de7afb1b"}
    2018-05-31 22:43:07 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
    [0531/224307.637:INFO:CONSOLE(0)] "The SSL certificate used to load resources from https://tva2.sinaimg.cn will be distrusted in the future. Once distrusted, users will be prevented from loading these resources. See https://g.co/chrome/symantecpkicerts for more information.", source: https://m.weibo.cn/u/2937210565 (0)
    [0531/224307.653:INFO:CONSOLE(0)] "The SSL certificate used to load resources from https://tva1.sinaimg.cn will be distrusted in the future. Once distrusted, users will be prevented from loading these resources. See https://g.co/chrome/symantecpkicerts for more information.", source: https://m.weibo.cn/u/2937210565 (0)
    2018-05-31 22:43:08 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54161/session/60af686269ddbb5b2f03ea11de7afb1b/source {"sessionId": "60af686269ddbb5b2f03ea11de7afb1b"}
    2018-05-31 22:43:08 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
    页面完成渲染。。。
    2018-05-31 22:43:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://m.weibo.cn/u/2937210565> (referer: None)
    返回渲染过的页面内容
    标题:  把您的答案留下来吧 ​​​~ ​​​
    标题: 突然看到了很早之前吃过的零食,是不是暴露年龄了
    标题: 【二哈撞伤骑车大爷逃走,主人赔三万多】5月21日,江苏淮安。一条哈士奇从主人电动车掉下后,狂奔追赶,撞上骑电动车的马大爷。大爷摔倒骨折,哈士奇“肇事逃逸”。交警追踪找到狗主人,大爷获赔33200元。
    标题:  所有收费站恢复通行
    标题: 各位老师er吃完晚饭了吗?要不要出来遛弯跟蜀黍偶遇一波
    标题: 【网红化妆品真能“火”!靠近火焰“蹭蹭”往上蹿火?】注意啦!“网红防晒喷雾、杀虫喷雾、非常易燃助燃,一瓶花露水的酒精含量就达70%。”5月30日,扬州一个消防实验 让我们了解:一些化妆品如果使用不当,很有可能起不到保护皮肤的作用,反而会引发火灾,造成烧伤烫伤。
    标题:  机场收费站恢复通行!
    标题: 据说出去玩的时候,每个位置职责都不一样!你是哪一个? ​​​​
    标题: 做好笔记了!汽车常见易损零部件的常识![并不简单]
    标题: 这是什么黑科技?不过我更关注下雨咋办?
    2018-05-31 22:43:08 [scrapy.core.engine] INFO: Closing spider (finished)

    参考:
    https://developers.google.com/web/updates/2017/04/headless-chrome
    https://docs.scrapy.org/en/latest/
    https://intoli.com/blog/installing-google-chrome-on-centos/
    ————————————————
    版权声明:本文为CSDN博主「小龙在山东」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。
    原文链接:https://blog.csdn.net/lilongsy/article/details/80531378

  • 相关阅读:
    自学入门 Python 优质中文资源索引
    Crawlab Lite 正式发布,更轻量的爬虫管理平台
    一款被大厂选用的 Hexo 博客主题
    源码解读 Golang 的 sync.Map 实现原理
    探究 Go 语言 defer 语句的三种机制
    一道快速考察 Python 基础的面试题
    编写自己的 GitHub Action,体验自动化部署
    Python 2 与 3 共存了 11 年,新年就要和它道别
    30 年前的圣诞节,Python 序章被谱写
    文言文编程火了,可我完全学不懂
  • 原文地址:https://www.cnblogs.com/Im-Victor/p/14777816.html
Copyright © 2020-2023  润新知