Scrapy和Headless Chrome采集动态网站数据

Headless Chrome是无头Chrome浏览器，可以利用Chrome V8引擎的高效。
可以代替phantomjs，Scrapy也不建议使用phantomjs了。
启用无头Chrome，必须使用Chrome对应版本的WebDriver。

准备
windows10
Anaconda3
python 3.6.2
Selenium
WebDriver
Scrapy

下载WebDriver：https://sites.google.com/a/chromium.org/chromedriver/downloads

实现
首先创建Scrapy项目：

scrapy startproject xintong

创建一个Spider：

# -*- coding: utf-8 -*-
import scrapy


class WeiboSpider(scrapy.Spider):
    name = 'weibo'
    allowed_domains = ['weibo.com']
    start_urls = ['https://m.weibo.cn/u/2937210565']

    def parse(self, response):
        print("返回渲染过的页面内容")
        for sel in response.xpath('//div[@id="app"]//div[contains(@class, "card9")]'):
            title = sel.xpath('.//div[@class="weibo-text"]/text()').extract_first()
            print('标题：', title)

配置中间件：

# -*- coding: utf-8 -*-
from selenium import webdriver
from scrapy.http import HtmlResponse
import time
from selenium.webdriver.chrome.options import Options


class XintongSpiderMiddleware(object):

    def __init__(self):
        option = Options()
        option.add_argument('--headless')
        self.driver = webdriver.Chrome(executable_path="D:/Python/test3/chromedriver.exe",
                                       chrome_options=option)

    def process_request(self, request, spider):
        self.driver.get(request.url)
        print("页面开始渲染。。。")
        self.driver.execute_script("scroll(0, 1000);")
        time.sleep(1)
        rendered_body = self.driver.page_source
        print("页面完成渲染。。。")
        return HtmlResponse(request.url, body=rendered_body, encoding="utf-8")

    def spider_closed(self, spider, reason):
        print('驱动关闭')
        self.driver.close()

配置setting.py，修改如下配置

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
ROBOTSTXT_OBEY = False
DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,  # 关闭默认下载器
    'xintong.middlewares.XintongSpiderMiddleware': 543,
}

运行结果：

scrapy crawl weibo

2018-05-31 22:43:06 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-05-31 22:43:06 [scrapy.core.engine] INFO: Spider opened
2018-05-31 22:43:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-05-31 22:43:06 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-05-31 22:43:06 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54161/session/60af686269ddbb5b2f03ea11de7afb1b/url {"url": "https://m.weibo.cn/u/2937210565", "sessionId": "60af686269ddbb5b2f03ea11de7afb1b"}
2018-05-31 22:43:07 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
页面开始渲染。。。
2018-05-31 22:43:07 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54161/session/60af686269ddbb5b2f03ea11de7afb1b/execute {"script": "scroll(0, 1000);", "args": [], "sessionId": "60af686269ddbb5b2f03ea11de7afb1b"}
2018-05-31 22:43:07 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
[0531/224307.637:INFO:CONSOLE(0)] "The SSL certificate used to load resources from https://tva2.sinaimg.cn will be distrusted in the future. Once distrusted, users will be prevented from loading these resources. See https://g.co/chrome/symantecpkicerts for more information.", source: https://m.weibo.cn/u/2937210565 (0)
[0531/224307.653:INFO:CONSOLE(0)] "The SSL certificate used to load resources from https://tva1.sinaimg.cn will be distrusted in the future. Once distrusted, users will be prevented from loading these resources. See https://g.co/chrome/symantecpkicerts for more information.", source: https://m.weibo.cn/u/2937210565 (0)
2018-05-31 22:43:08 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54161/session/60af686269ddbb5b2f03ea11de7afb1b/source {"sessionId": "60af686269ddbb5b2f03ea11de7afb1b"}
2018-05-31 22:43:08 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
页面完成渲染。。。
2018-05-31 22:43:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://m.weibo.cn/u/2937210565> (referer: None)
返回渲染过的页面内容
标题：  把您的答案留下来吧 ~ 
标题： 突然看到了很早之前吃过的零食，是不是暴露年龄了
标题： 【二哈撞伤骑车大爷逃走，主人赔三万多】5月21日，江苏淮安。一条哈士奇从主人电动车掉下后，狂奔追赶，撞上骑电动车的马大爷。大爷摔倒骨折，哈士奇“肇事逃逸”。交警追踪找到狗主人，大爷获赔33200元。
标题：  所有收费站恢复通行
标题： 各位老师er吃完晚饭了吗？要不要出来遛弯跟蜀黍偶遇一波
标题： 【网红化妆品真能“火”！靠近火焰“蹭蹭”往上蹿火?】注意啦！“网红防晒喷雾、杀虫喷雾、非常易燃助燃，一瓶花露水的酒精含量就达70%。”5月30日，扬州一个消防实验 让我们了解：一些化妆品如果使用不当，很有可能起不到保护皮肤的作用，反而会引发火灾，造成烧伤烫伤。
标题：  机场收费站恢复通行！
标题： 据说出去玩的时候，每个位置职责都不一样！你是哪一个？ 
标题： 做好笔记了！汽车常见易损零部件的常识！[并不简单]
标题： 这是什么黑科技？不过我更关注下雨咋办？
2018-05-31 22:43:08 [scrapy.core.engine] INFO: Closing spider (finished)

参考：
https://developers.google.com/web/updates/2017/04/headless-chrome
https://docs.scrapy.org/en/latest/
https://intoli.com/blog/installing-google-chrome-on-centos/
————————————————
版权声明：本文为CSDN博主「小龙在山东」的原创文章，遵循CC 4.0 BY-SA版权协议，转载请附上原文出处链接及本声明。
原文链接：https://blog.csdn.net/lilongsy/article/details/80531378

相关阅读:
自学入门 Python 优质中文资源索引
 Crawlab Lite 正式发布，更轻量的爬虫管理平台
 一款被大厂选用的 Hexo 博客主题
 源码解读 Golang 的 sync.Map 实现原理
 探究 Go 语言 defer 语句的三种机制
 一道快速考察 Python 基础的面试题
 编写自己的 GitHub Action，体验自动化部署
 Python 2 与 3 共存了 11 年，新年就要和它道别
 30 年前的圣诞节，Python 序章被谱写
 文言文编程火了，可我完全学不懂
原文地址：https://www.cnblogs.com/Im-Victor/p/14777816.html