Scrapy爬虫框架的基本用法
Scrapy爬虫框架是一个好东西,可以十分简单快速爬取网站,特别适合那些不分离前后端的,数据直接生成在html文件内的网站。本文以爬取 杭电OJ http://acm.hdu.edu.cn 的题目ID和标题为例,做一个基本用法的记录
安装Scrapy
-
使用pip安装
pip install scrapy
代码编写
-
建立项目 myspider
scrapy startproject myspider
-
创建爬虫 hdu,网站是 acm.hdu.edu.cn
scrapy genspider hdu acm.hdu.edu.cn
-
执行上面的命令后,会在spiders文件夹下建立一个 hdu.py,修改代码为:
class HduSpider(scrapy.Spider): # 爬虫名 name = 'hdu' # 爬取的目标地址 allowed_domains = ['acm.hdu.edu.cn'] # 爬虫开始的页面 start_urls = ['http://acm.hdu.edu.cn/listproblem.php?vol=1'] # 爬取逻辑 def parse(self, response): # 题目列表是写在页面的第二个script下的,先全部取出script到problem_list列表中 problem_list = response.xpath('//script/text()').extract() # 取题目列表,为第二个,index为1,并使用分号分割 problems = str.split(problem_list[1], ";") # 循环在控制台输出。这里没有交给管道进行操作 for item in problems: print(item)
-
在 items.py 里新建题目的对应类
class ProblemItem(scrapy.Item): id = scrapy.Field() title = scrapy.Field()
-
在 pipelines.py 里建立一个数据管道来保存数据到 hdu.json文件内
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html import json class ItcastPipeline(object): def __init__(self): self.filename = open("teacher.json", "wb+") def process_item(self, item, spider): jsontext = json.dumps(dict(item), ensure_ascii=False) + " " self.filename.write(jsontext.encode("utf-8")) return item def close_spider(self, spider): self.filename.close() class HduPipeline(object): full_json = '' def __init__(self): self.filename = open("hdu.json", "wb+") self.filename.write("[".encode("utf-8")) def process_item(self, item, spider): json_text = json.dumps(dict(item), ensure_ascii=False) + ", " self.full_json += json_text return item def close_spider(self, spider): self.filename.write(self.full_json.encode("utf-8")) self.filename.write("]".encode("utf-8")) self.filename.close()
-
setting.py 中给管道进行配置
ITEM_PIPELINES = { 'myspider.pipelines.HduPipeline': 300 } # 不遵循网站的爬虫君子约定 ROBOTSTXT_OBEY = False
-
修改 hdu.py 让其交由管道处理
# -*- coding: utf-8 -*- import scrapy import re from myspider.items import ProblemItem class HduSpider(scrapy.Spider): name = 'hdu' allowed_domains = ['acm.hdu.edu.cn'] start_urls = ['http://acm.hdu.edu.cn/listproblem.php?vol=1'] def parse(self, response): hdu = ProblemItem() problem_list = response.xpath('//script/text()').extract() problems = str.split(problem_list[1], ";") for item in problems: # print(item) p = re.compile(r'[(](.*)[)]', re.S) str1 = re.findall(p, item)[0] # print(str1) detail = str.split(str1, ",") hdu['id'] = detail[1] hdu['title'] = detail[3] yield hdu
-
运行命令,这里把日志输出到 all.log 中
scrapy crawl hdu -s LOG_FILE=all.log
-
在hdu.json文件中看到了爬取的第一页题目标题
{"id": "1000", "title": ""A + B Problem""} {"id": "1001", "title": ""Sum Problem""} {"id": "1002", "title": ""A + B Problem II""} {"id": "1003", "title": ""Max Sum""} {"id": "1004", "title": ""Let the Balloon Rise""} {"id": "1005", "title": ""Number Sequence""} ... {"id": "1099", "title": ""Lottery ""}
-
再次修改 hdu.py 让其能够爬取全部有效页码的内容
# -*- coding: utf-8 -*- import scrapy import re from myspider.items import ProblemItem class HduSpider(scrapy.Spider): name = 'hdu' allowed_domains = ['acm.hdu.edu.cn'] # download_delay = 1 base_url = 'http://acm.hdu.edu.cn/listproblem.php?vol=%s' start_urls = ['http://acm.hdu.edu.cn/listproblem.php'] # 爬虫入口 def parse(self, response): # 首先拿到全部有效页码 real_pages = response.xpath('//p[@class="footer_link"]/font/a/text()').extract() for page in real_pages: url = self.base_url % page yield scrapy.Request(url, callback=self.parse_problem) def parse_problem(self, response): # 从字符串中抽取有用内容 hdu = ProblemItem() problem_list = response.xpath('//script/text()').extract() problems = str.split(problem_list[1], ";") for item in problems: # hdu有无效空题,进行剔除 if str.isspace(item) or len(item) == 0: return p = re.compile(r'[(](.*)[)]', re.S) str1 = re.findall(p, item) detail = str.split(str1[0], ",") hdu['id'] = detail[1] hdu['title'] = detail[3] yield hdu
-
再次运行命令,这里把日志输出到 all.log 中
scrapy crawl hdu -s LOG_FILE=all.log
-
现在能爬到全部页码的全部题目标题信息了。但是特别注意的是,爬取到的内容并不是按顺序排列的,有多种原因决定了顺序
[{"id": "4400", "title": ""Mines""}, {"id": "4401", "title": ""Battery""}, {"id": "4402", "title": ""Magic Board""}, {"id": "4403", "title": ""A very hard Aoshu problem""}, {"id": "4404", "title": ""Worms""}, {"id": "4405", "title": ""Aeroplane chess""}, {"id": "4406", "title": ""GPA""}, {"id": "4407", "title": ""Sum""}, ... {"id": "1099", "title": ""Lottery ""}, ]
-
以上只是爬取到文本文件中,后续将放置到数据库中,本教程暂时略过