scrapy安装和简单使用
scrapy是一个大而全的爬虫组件,依赖twisted,内部基于事件循环的机制实现爬虫的并发
下载安装:
- Win: 下载:http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted pip3 install wheel pip install Twisted‑18.4.0‑cp36‑cp36m‑win_amd64.whl #有些64位安装不了的,可以试下32位的 pip3 install pywin32 pip3 install scrapy - Linux: pip3 install scrapy
twisted是什么以及和requests的区别? requests是一个Python实现的可以伪造浏览器发送Http请求的模块。 - 封装socket发送请求 twisted是基于事件循环的异步非阻塞网络框架。 - 封装socket发送请求 - 单线程完成并发请求 PS: 三个相关词 - 非阻塞:不等待 - 异步:回调 - 事件循环:一直循环去检查状态。
组件以及执行流程?
- 引擎找到要执行的爬虫,并执行爬虫的 start_requests 方法,并的到一个 迭代器。
- 迭代器循环时会获取Request对象,而request对象中封装了要访问的URL和回调函数。
- 将所有的request对象(任务)放到调度器中,用于以后被下载器下载。
- 下载器去调度器中获取要下载任务(就是Request对象),下载完成后执行回调函数。
- 回到spider的回调函数中,
yield Request()
yield Item()
基础命令
# 创建project scrapy startproject xdb cd xdb # 创建爬虫 scrapy genspider chouti chouti.com scrapy genspider cnblogs cnblogs.com # 启动爬虫 scrapy crawl chouti scrapy crawl chouti --nolog
HTML解析:xpath
- response.text - response.encoding - response.body - response.request # response.xpath('//div[@href="x1"]/a').extract_first() # response.xpath('//div[@href="x1"]/a').extract() # response.xpath('//div[@href="x1"]/a/text()').extract() # response.xpath('//div[@href="x1"]/a/@href').extract()
再次发起请求:yield Request对象
class ChoutiSpider(scrapy.Spider): name = 'chouti' allowed_domains = ['chouti.com'] start_urls = ['http://chouti.com/'] def parse(self, response): # print(response,type(response)) # 对象 # print(response.text) """ from bs4 import BeautifulSoup soup = BeautifulSoup(response.text,'html.parser') content_list = soup.find('div',attrs={'id':'content-list'}) """ # 去子孙中找div并且id=content-list f = open('news.log', mode='a+') item_list = response.xpath('//div[@id="content-list"]/div[@class="item"]') for item in item_list: text = item.xpath('.//a/text()').extract_first() href = item.xpath('.//a/@href').extract_first() print(href,text.strip()) f.write(href+' ') f.close() page_list = response.xpath('//div[@id="dig_lcpage"]//a/@href').extract() for page in page_list: from scrapy.http import Request page = "https://dig.chouti.com" + page yield Request(url=page,callback=self.parse) # https://dig.chouti.com/all/hot/recent/2
注意:如果爬虫过程有编码报错,尝试加上下面这句代码
# import sys,os,io # sys.stdout=io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')
如果爬虫不执行parse,可修改配置文件这项
ROBOTSTXT_OBEY = False
对于上面的执行过程存在两个缺点:1.每次发起请求,都有打开连接和关闭连接的过程 2.分工不明确,既然解析过程,又有存储过程
针对上面两个问题,scrapy提供了持久化
持久化pipeline/items
1.定义pipeline类,这里编写你的存储过程
class XXXPipeline(object): def process_item(self, item, spider): return item
2.定义Item类,这里定义你要接受的数据
class XdbItem(scrapy.Item): href = scrapy.Field() title = scrapy.Field()
3.settings里配置
ITEM_PIPELINES = { 'xdb.pipelines.XdbPipeline': 300, }
爬虫每次执行yield Item对象,process_item就会调用一次
编写pipeline:
''' 源码内容: 1. 判断当前XdbPipeline类中是否有from_crawler 有: obj = XdbPipeline.from_crawler(....) 否: obj = XdbPipeline() 2. obj.open_spider() 3. obj.process_item()/obj.process_item()/obj.process_item()/obj.process_item()/obj.process_item() 4. obj.close_spider() ''' from scrapy.exceptions import DropItem class FilePipeline(object): def __init__(self,path): self.f = None self.path = path @classmethod def from_crawler(cls, crawler): """ 初始化时候,用于创建pipeline对象 :param crawler: :return: """ print('File.from_crawler') path = crawler.settings.get('HREF_FILE_PATH') return cls(path) def open_spider(self,spider): """ 爬虫开始执行时,调用 :param spider: :return: """ print('File.open_spider') self.f = open(self.path,'a+') def process_item(self, item, spider): # f = open('xx.log','a+') # f.write(item['href']+' ') # f.close() print('File',item['href']) self.f.write(item['href']+' ') # return item # 交给下一个pipeline的process_item方法 raise DropItem()# 后续的 pipeline的process_item方法不再执行 def close_spider(self,spider): """ 爬虫关闭时,被调用 :param spider: :return: """ print('File.close_spider') self.f.close()
注意:pipeline是所有爬虫公用,如果想要给某个爬虫定制需要使用spider参数自己进行处理。
pipeline持久化,from_crawler指定写入路径,open_spider打开链接,close_spider关闭链接,
process_item中执行持久化操作,return item就交给下一个pipeline的process_item方法,如果
raise DropItem(),后续的pipeline的process_item方法不再执行
去重规则
编写类
from scrapy.dupefilter import BaseDupeFilter from scrapy.utils.request import request_fingerprint class XdbDupeFilter(BaseDupeFilter): def __init__(self): self.visited_fd = set() @classmethod def from_settings(cls, settings): return cls() def request_seen(self, request): fd = request_fingerprint(request=request) if fd in self.visited_fd: return True self.visited_fd.add(fd) def open(self): # can return deferred print('开始') def close(self, reason): # can return a deferred print('结束') # def log(self, request, spider): # log that a request has been filtered # print('日志')
配置
# 修改默认的去重规则 # DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter' DUPEFILTER_CLASS = 'xdb.dupefilters.XdbDupeFilter'
爬虫使用
class ChoutiSpider(scrapy.Spider): name = 'chouti' allowed_domains = ['chouti.com'] start_urls = ['https://dig.chouti.com/'] def parse(self, response): print(response.request.url) # item_list = response.xpath('//div[@id="content-list"]/div[@class="item"]') # for item in item_list: # text = item.xpath('.//a/text()').extract_first() # href = item.xpath('.//a/@href').extract_first() page_list = response.xpath('//div[@id="dig_lcpage"]//a/@href').extract() for page in page_list: from scrapy.http import Request page = "https://dig.chouti.com" + page # yield Request(url=page,callback=self.parse,dont_filter=False) # https://dig.chouti.com/all/hot/recent/2 yield Request(url=page,callback=self.parse,dont_filter=True) # https://dig.chouti.com/all/hot/recent/2
注意:
- request_seen中编写正确逻辑
- dont_filter=False
如果想实现去重,可以自定义dupefilter类,在request_seen方法中执行去重操作,
还可以这么做,在yield request时dont_filter改成False,默认也是False
深度与优先级
- 深度
- 最开始是0
- 每次yield时,会根据原来请求中的depth + 1
配置:DEPTH_LIMIT 深度控制
- 优先级
- 请求被下载的优先级 -= 深度 * 配置 DEPTH_PRIORITY
配置:DEPTH_PRIORITY
获取深度:response.meta.get("depth", 0)
cookie设置
方式一,携带和解析
cookie_dict = {} def parse(self, response): # 携带 解析的方式 #去响应头里获取cookie,cookie保存在cookie_jar对象 from scrapy.http.cookies import CookieJar from urllib.parse import urlencode cookie_jar = CookieJar() cookie_jar.extract_cookies(response, response.request) # 去对象中将cookie解析到字典 for k, v in cookie_jar._cookies.items(): for i, j in v.items(): for m, n in j.items(): self.cookie_dict[m] = n.value yield Request( url="https://dig.chouti.com/login", method="POST", #body 可以自定拼接,也可以使用urlencode拼接 body="phone=8613121758648&password=woshiniba&oneMonth=1", cookies=self.cookie_dict, headers={ "Content-Type":'application/x-www-form-urlencoded; charset=UTF-8' }, callback=self.check_login ) def check_login(self, response): print(response.text) yield Request( url="https://dig.chouti.com/all/hot/recent/1", cookies=self.cookie_dict, callback=self.index ) def index(self, response): news_list = response.xpath("//div[@id='content-list']/div[@class='item']") for new in news_list: link_id = new.xpath(".//div[@class='part2']/@share-linkid").extract_first() yield Request( url="http://dig.chouti.com/link/vote?linksId=%s"%(link_id, ), method="POST", cookies=self.cookie_dict, callback=self.check_result ) page_list = response.xpath("//div[@id='dig_lcpage']//a/@href").extract() for page in page_list: page = "https://dig.chouti.com" + page yield Request(url=page, callback=self.index) def check_result(self, response): print(response.text)
方式二,meta
meta={'cookiejar': True}
start_urls
scrapy引擎获取start_requests函数返回的结果(Request列表)封装成一个迭代器,放入的调度器中,
下载器从调度器中,通过__next__来获取Request对象
- 定制:可以去redis中获取,也可以设置代理(os.envrion)
- 内部原理: """ scrapy引擎来爬虫中取起始URL: 1. 调用start_requests并获取返回值 2. v = iter(返回值) 3. req1 = 执行 v.__next__() req2 = 执行 v.__next__() req3 = 执行 v.__next__() ... 4. req全部放到调度器中 """
- 编写
class ChoutiSpider(scrapy.Spider): name = 'chouti' allowed_domains = ['chouti.com'] start_urls = ['https://dig.chouti.com/'] cookie_dict = {} def start_requests(self): # 方式一: for url in self.start_urls: yield Request(url=url) # 方式二: # req_list = [] # for url in self.start_urls: # req_list.append(Request(url=url)) # return req_list
代理
问题:scrapy如何加代理?
- 环境变量 start_requests在爬虫启动时,提前在os.envrion中设置代理
- meta yield Request的时候meta属性设置
- 自定义下载中间件,在process_request中加入,这种方式可以实现随机代理
内置代理:在爬虫启动时,提前在os.envrion中设置代理即可。
class ChoutiSpider(scrapy.Spider): name = 'chouti' allowed_domains = ['chouti.com'] start_urls = ['https://dig.chouti.com/'] cookie_dict = {} def start_requests(self): import os os.environ['HTTPS_PROXY'] = "http://root:woshiniba@192.168.11.11:9999/" os.environ['HTTP_PROXY'] = '19.11.2.32', for url in self.start_urls: yield Request(url=url,callback=self.parse)
meta设置代理:yield Request的时候设置meta属性
class ChoutiSpider(scrapy.Spider): name = 'chouti' allowed_domains = ['chouti.com'] start_urls = ['https://dig.chouti.com/'] cookie_dict = {} def start_requests(self): for url in self.start_urls: yield Request(url=url,callback=self.parse,meta={'proxy':'"http://root:woshiniba@192.168.11.11:9999/"'})
自定义下载中间件,在process_request中加代理,在这里你可以实现随机代码的过程
import base64 import random from six.moves.urllib.parse import unquote try: from urllib2 import _parse_proxy except ImportError: from urllib.request import _parse_proxy from six.moves.urllib.parse import urlunparse from scrapy.utils.python import to_bytes class XdbProxyMiddleware(object): def _basic_auth_header(self, username, password): user_pass = to_bytes( '%s:%s' % (unquote(username), unquote(password)), encoding='latin-1') return base64.b64encode(user_pass).strip() def process_request(self, request, spider): PROXIES = [ "http://root:woshiniba@192.168.11.11:9999/", "http://root:woshiniba@192.168.11.12:9999/", "http://root:woshiniba@192.168.11.13:9999/", "http://root:woshiniba@192.168.11.14:9999/", "http://root:woshiniba@192.168.11.15:9999/", "http://root:woshiniba@192.168.11.16:9999/", ] url = random.choice(PROXIES) orig_type = "" proxy_type, user, password, hostport = _parse_proxy(url) proxy_url = urlunparse((proxy_type or orig_type, hostport, '', '', '', '')) if user: creds = self._basic_auth_header(user, password) else: creds = None request.meta['proxy'] = proxy_url if creds: request.headers['Proxy-Authorization'] = b'Basic ' + creds
选择器和解析器
html = """<!DOCTYPE html> <html> <head lang="en"> <meta charset="UTF-8"> <title></title> </head> <body> <ul> <li class="item-"><a id='i1' href="link.html">first item</a></li> <li class="item-0"><a id='i2' href="llink.html">first item</a></li> <li class="item-1"><a href="llink2.html">second item<span>vv</span></a></li> </ul> <div><a href="llink2.html">second item</a></div> </body> </html> """ from scrapy.http import HtmlResponse from scrapy.selector import Selector response = HtmlResponse(url='http://example.com', body=html,encoding='utf-8') # hxs = Selector(response) # hxs.xpath() response.xpath('')
下载中间件
在process_request里,你可以干这些事:
- 返回HtmlResponse对象,不执行下载,但是还是会执行process_response
- 返回Request对象,发起下次请求
- 抛出异常IgnoreRequest,废弃当前请求,会执行process_exception
- 对请求进行加工,比如设置User-Agent
写中间件:
from scrapy.http import HtmlResponse from scrapy.http import Request class Md1(object): @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() return s def process_request(self, request, spider): # Called for each request that goes through the downloader # middleware. # Must either: # - return None: continue processing this request # - or return a Response object # - or return a Request object # - or raise IgnoreRequest: process_exception() methods of # installed downloader middleware will be called print('md1.process_request',request) # 1. 返回Response # import requests # result = requests.get(request.url) # return HtmlResponse(url=request.url, status=200, headers=None, body=result.content) # 2. 返回Request # return Request('https://dig.chouti.com/r/tec/hot/1') # 3. 抛出异常 # from scrapy.exceptions import IgnoreRequest # raise IgnoreRequest # 4. 对请求进行加工(*) # request.headers['user-agent'] = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36" pass def process_response(self, request, response, spider): # Called with the response returned from the downloader. # Must either; # - return a Response object # - return a Request object # - or raise IgnoreRequest print('m1.process_response',request,response) return response def process_exception(self, request, exception, spider): # Called when a download handler or a process_request() # (from other downloader middleware) raises an exception. # Must either: # - return None: continue processing this exception # - return a Response object: stops process_exception() chain # - return a Request object: stops process_exception() chain pass
配置
DOWNLOADER_MIDDLEWARES = { #'xdb.middlewares.XdbDownloaderMiddleware': 543, # 'xdb.proxy.XdbProxyMiddleware':751, 'xdb.md.Md1':666, 'xdb.md.Md2':667, }
应用:
- user-agent
- 代理
爬虫中间件
-
process_start_requests只在爬虫启动时执行一次,在下载中间件之前执行
-
process_spider_input在下载中间件执行完后,调用回调函数时执行
-
process_spider_output在回调函数执行完后执行
编写:
class Sd1(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the spider middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() return s def process_spider_input(self, response, spider): # Called for each response that goes through the spider # middleware and into the spider. # Should return None or raise an exception. return None def process_spider_output(self, response, result, spider): # Called with the results returned from the Spider, after # it has processed the response. # Must return an iterable of Request, dict or Item objects. for i in result: yield i def process_spider_exception(self, response, exception, spider): # Called when a spider or process_spider_input() method # (from other spider middleware) raises an exception. # Should return either None or an iterable of Response, dict # or Item objects. pass # 只在爬虫启动时,执行一次。 def process_start_requests(self, start_requests, spider): # Called with the start requests of the spider, and works # similarly to the process_spider_output() method, except # that it doesn’t have a response associated. # Must return only requests (not items). for r in start_requests: yield r
配置
SPIDER_MIDDLEWARES = { # 'xdb.middlewares.XdbSpiderMiddleware': 543, 'xdb.sd.Sd1': 666, 'xdb.sd.Sd2': 667, }
应用:
- 深度
- 优先级
定制命令
单爬虫运行:
import sys from scrapy.cmdline import execute if __name__ == '__main__': execute(["scrapy","crawl","chouti","--nolog"])
- 所有爬虫:
- 在spiders同级创建任意目录,如:commands
- 在其中创建 crawlall.py 文件 (此处文件名就是自定义的命令)
- 在settings.py 中添加配置 COMMANDS_MODULE = '项目名称.目录名称'
- 在项目目录执行命令:scrapy crawlall
信号
使用框架预留的位置,帮助你自定义一些功能
class MyExtend(object): def __init__(self): pass @classmethod def from_crawler(cls, crawler): self = cls() crawler.signals.connect(self.x1, signal=signals.spider_opened) crawler.signals.connect(self.x2, signal=signals.spider_closed) return self def x1(self, spider): print('open') def x2(self, spider): print('close')
配置
EXTENSIONS = { 'xdb.ext.MyExtend':666, }