• scrapy 实战操作


    Date: 2019-07-07

    Author: Sun

    1. Pycharm调试scrapy代码流程

    ​ 由于Pycharm本身是没有自带scrapy代码包的,所以正常情况是不好调试scrapy代码的,那我们想要学习scrapy,调试scrapy时,会怎么处理呢?

    ​ 本节给你带来处理方法:

    本节以建立爬取 http://books.toscrape.com/ 网站为例

    (1)创建scrapy工程

    ​ scrapy startproject books_toscrape

    (2) 创建爬虫

    ​ cd books_toscrape

    ​ scrapy genspider toscrape

    此时会在spiders目录下产生 toscrape.py的爬虫spider

    (3) 在工程目录下创建调试文件main.py

    books_toscrape/main.py

    内容如下:

    # -*- coding: utf-8 -*-  
    __author__ = 'sun'
    __date__ = '2019/07/07 下午9:04' 
    
    import os, sys
    from scrapy.cmdline import execute
    
    
    sys.path.append(os.path.dirname(os.path.abspath(__file__)))  #当前main.py的文件夹路径
    
    
    SPIDER_NAME = "toscrape"   #此名称是我们采用 scrapy genspider  spider_name 指定的spider_name
    
    execute(["scrapy", "crawl",  SPIDER_NAME])
    
    
    

    (4) 配置文件settings.py中的修改

    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False
    
    

    (5) 开始调试

    进入main.py文件,点击右键调试,进入调试模式。

    在spiders/toscrape.py文件中的parse函数中设置断点,尝试采用xpath解析此页面中的部分书籍数据。

    开始进入调试模式,就可以进入scrapy了

    2. 案例分析

    采用scrapy分析并爬取http://books.toscrape.com/ 网站书籍信息

    (1)创建项目

    ​ scrapy startproject BookToscrape

    (2) 创建爬虫

    ​ 创建一个基于basic模板的爬虫

    ​ scrapy genspider toscrape books.toscrape.com

    ​ 此时会在spiders目录下产生一个爬虫文件toscrape.py

    (3) 修改配置文件 settings.py

    ​ 修改两个选项USER_AGENT和ROBOTSTXT_OBEY,具体配置文件选项说明见day02

    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
    
    
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False
    
    

    (4) 编写爬虫文件逻辑

    ​ spiders/toscrape.py

    ​ 内容如下:

    class ToscrapeSpider(scrapy.Spider):
        name = 'toscrape'   #spider name 爬虫名称
        allowed_domains = ['books.toscrape.com']   #爬虫的作用域,爬取范围
        start_urls = ['http://books.toscrape.com/']  #待爬取的初始化URL地址
    
        def parse(self, response):
            '''
            start_urls 被基类爬虫scrapy.Spider进行遍历后,封装成Request(url, callback=parse)
            发射给sheduler ---》 downloader ---》 parse
            :param response:
            :return:
            '''
            
            article_list = response.xpath('//article[@class="product_pod"]')
            for article in article_list:
                book_title = article.xpath("./h3/a/text()").extract_first()
                book_detail_url = article.xpath("./h3/a/@href").extract_first()
                if p_book_detail.match(book_detail_url) == None:
                    book_detail =  'http://books.toscrape.com/' + 'catalogue/' + book_detail_url
                else:
                    book_detail = 'http://books.toscrape.com/' + book_detail_url
                
                book_image = article.xpath("./div[@class='image_container']/a/img/@src").extract_first()
                if p_img_pre.match(book_image) == None:
                    book_image = self.start_urls[0] + book_image
                else:
                    book_image = book_image.split("../")[-1]
                    book_image = self.start_urls[0] + book_image
                book_price = article.xpath("./div[@class='product_price']/p/text()").extract_first()
                book_price = p_price.findall(book_price)[0]
                print(f"book_title:{book_title}, book_detail:{book_detail}, book_image:{book_image},"
                      f" book_price:{book_price}")
                
    
    
    

    (5)引入上述调试文件 books_toscrape/main.py

    ​ 设置断点调试并运行此爬虫系统

  • 相关阅读:
    置换及Polya定理
    题解 UVa10943
    Error applying site theme: A theme with the name "Jet 1011" and version already exists on the server.
    用shtml来include网页文件
    SQL 2005 附加数据库出错"解决方法
    SVN 配置 入门教程
    Oracle .Net Develoer
    JdbcTemplate完全学习
    SVD外积展开式
    初识 Nslookup 命令
  • 原文地址:https://www.cnblogs.com/sunBinary/p/11148341.html
Copyright © 2020-2023  润新知