• 【Scrapy】Scrapy爬虫框架的基本用法


    Scrapy爬虫框架的基本用法

    Scrapy爬虫框架是一个好东西,可以十分简单快速爬取网站,特别适合那些不分离前后端的,数据直接生成在html文件内的网站。本文以爬取 杭电OJ http://acm.hdu.edu.cn 的题目ID和标题为例,做一个基本用法的记录

    可参考 https://www.jianshu.com/p/7dee0837b3d2

    安装Scrapy

    • 使用pip安装

      pip install scrapy
      

    代码编写

    • 建立项目 myspider

      scrapy startproject myspider
      
    • 创建爬虫 hdu,网站是 acm.hdu.edu.cn

      scrapy genspider hdu acm.hdu.edu.cn
      
    • 执行上面的命令后,会在spiders文件夹下建立一个 hdu.py,修改代码为:

      class HduSpider(scrapy.Spider):
      # 爬虫名
      name = 'hdu'
      # 爬取的目标地址
      allowed_domains = ['acm.hdu.edu.cn']
      # 爬虫开始的页面
      start_urls = ['http://acm.hdu.edu.cn/listproblem.php?vol=1']
      
      # 爬取逻辑
      def parse(self, response):
          # 题目列表是写在页面的第二个script下的,先全部取出script到problem_list列表中
          problem_list = response.xpath('//script/text()').extract()
          # 取题目列表,为第二个,index为1,并使用分号分割
          problems = str.split(problem_list[1], ";")
          # 循环在控制台输出。这里没有交给管道进行操作
          for item in problems:
              print(item)
      
    • 在 items.py 里新建题目的对应类

      class ProblemItem(scrapy.Item):
          id = scrapy.Field()
          title = scrapy.Field()
      
    • 在 pipelines.py 里建立一个数据管道来保存数据到 hdu.json文件内

      # -*- coding: utf-8 -*-
      
      # Define your item pipelines here
      #
      # Don't forget to add your pipeline to the ITEM_PIPELINES setting
      # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
      import json
      
      
      class ItcastPipeline(object):
          def __init__(self):
              self.filename = open("teacher.json", "wb+")
      
          def process_item(self, item, spider):
              jsontext = json.dumps(dict(item), ensure_ascii=False) + "
      "
              self.filename.write(jsontext.encode("utf-8"))
              return item
      
          def close_spider(self, spider):
              self.filename.close()
      
      
      class HduPipeline(object):
          full_json = ''
      
          def __init__(self):
              self.filename = open("hdu.json", "wb+")
              self.filename.write("[".encode("utf-8"))
      
          def process_item(self, item, spider):
              json_text = json.dumps(dict(item), ensure_ascii=False) + ",
      "
              self.full_json += json_text
              return item
      
          def close_spider(self, spider):
              self.filename.write(self.full_json.encode("utf-8"))
              self.filename.write("]".encode("utf-8"))
              self.filename.close()
      
      
    • setting.py 中给管道进行配置

      ITEM_PIPELINES = {
         'myspider.pipelines.HduPipeline': 300
      }
      # 不遵循网站的爬虫君子约定
      ROBOTSTXT_OBEY = False
      
    • 修改 hdu.py 让其交由管道处理

      # -*- coding: utf-8 -*-
      import scrapy
      import re
      from myspider.items import ProblemItem
      
      
      class HduSpider(scrapy.Spider):
          name = 'hdu'
          allowed_domains = ['acm.hdu.edu.cn']
          start_urls = ['http://acm.hdu.edu.cn/listproblem.php?vol=1']
      
          def parse(self, response):
              hdu = ProblemItem()
              problem_list = response.xpath('//script/text()').extract()
              problems = str.split(problem_list[1], ";")
              for item in problems:
                  # print(item)
                  p = re.compile(r'[(](.*)[)]', re.S)
                  str1 = re.findall(p, item)[0]
                  # print(str1)
                  detail = str.split(str1, ",")
                  hdu['id'] = detail[1]
                  hdu['title'] = detail[3]
                  yield hdu
      
      
    • 运行命令,这里把日志输出到 all.log 中

      scrapy crawl hdu  -s  LOG_FILE=all.log
      
    • 在hdu.json文件中看到了爬取的第一页题目标题

      {"id": "1000", "title": ""A + B Problem""}
      {"id": "1001", "title": ""Sum Problem""}
      {"id": "1002", "title": ""A + B Problem II""}
      {"id": "1003", "title": ""Max Sum""}
      {"id": "1004", "title": ""Let the Balloon Rise""}
      {"id": "1005", "title": ""Number Sequence""}
      
      ...
      
      {"id": "1099", "title": ""Lottery ""}
      
      
    • 再次修改 hdu.py 让其能够爬取全部有效页码的内容

      # -*- coding: utf-8 -*-
      import scrapy
      import re
      from myspider.items import ProblemItem
      
      
      class HduSpider(scrapy.Spider):
          name = 'hdu'
          allowed_domains = ['acm.hdu.edu.cn']
          # download_delay = 1
          base_url = 'http://acm.hdu.edu.cn/listproblem.php?vol=%s'
          start_urls = ['http://acm.hdu.edu.cn/listproblem.php']
      
          # 爬虫入口
          def parse(self, response):
              # 首先拿到全部有效页码
              real_pages = response.xpath('//p[@class="footer_link"]/font/a/text()').extract()
              for page in real_pages:
                  url = self.base_url % page
                  yield scrapy.Request(url, callback=self.parse_problem)
      
          def parse_problem(self, response):
              # 从字符串中抽取有用内容
              hdu = ProblemItem()
              problem_list = response.xpath('//script/text()').extract()
              problems = str.split(problem_list[1], ";")
              for item in problems:
                  # hdu有无效空题,进行剔除
                  if str.isspace(item) or len(item) == 0:
                      return
                  p = re.compile(r'[(](.*)[)]', re.S)
                  str1 = re.findall(p, item)
                  detail = str.split(str1[0], ",")
                  hdu['id'] = detail[1]
                  hdu['title'] = detail[3]
                  yield hdu
      
      
    • 再次运行命令,这里把日志输出到 all.log 中

      scrapy crawl hdu  -s  LOG_FILE=all.log
      
    • 现在能爬到全部页码的全部题目标题信息了。但是特别注意的是,爬取到的内容并不是按顺序排列的,有多种原因决定了顺序

      [{"id": "4400", "title": ""Mines""},
      {"id": "4401", "title": ""Battery""},
      {"id": "4402", "title": ""Magic Board""},
      {"id": "4403", "title": ""A very hard Aoshu problem""},
      {"id": "4404", "title": ""Worms""},
      {"id": "4405", "title": ""Aeroplane chess""},
      {"id": "4406", "title": ""GPA""},
      {"id": "4407", "title": ""Sum""},
      
      ...
      
      {"id": "1099", "title": ""Lottery ""},
      ]
      
      
    • 以上只是爬取到文本文件中,后续将放置到数据库中,本教程暂时略过

  • 相关阅读:
    JAVA设计模式之桥接模式
    Pycharm新建模板默认添加作者时间等信息
    Handler机制(2)转载
    内部类学习
    设计模式-1依赖倒置原则示例
    正则表达式
    Service原理及例子
    Serializable接口
    设计模式之静态工厂模式
    Handler机制post方法使用
  • 原文地址:https://www.cnblogs.com/axiangcoding/p/12096894.html
Copyright © 2020-2023  润新知