cookie
cookie: 获取百度翻译某个词条的结果
一定要对start_requests方法进行重写。
两种解决方案:
1. Request()方法中给method属性赋值成post
2. FormRequest()进行post请求的发送
爬虫相关操作
# -*- coding: utf-8 -*- import scrapy # 需求:将百度翻译中指定词条对应的翻译结果进行获取 class PostdemoSpider(scrapy.Spider): name = 'postDemo' # allowed_domains = ['www.baidu.com'] start_urls = ['https://fanyi.baidu.com/sug'] # 该方法(默认是发送get请求)其实是父类中的一个方法:该方法可以对start_urls列表中的元素进行get请求的发送 # 发起post: # 1.将Request方法中method参数赋值成post(不建议) # 2.FormRequest()可以发起post请求(推荐) def start_requests(self): print('start_requests()') # post请求的参数 data = { 'kw': 'dog', } for url in self.start_urls: # formdata:请求参数对应的字典 yield scrapy.FormRequest(url=url, formdata=data, callback=self.parse) def parse(self, response): print(response.text)
配置
BOT_NAME = 'postPro' SPIDER_MODULES = ['postPro.spiders'] NEWSPIDER_MODULE = 'postPro.spiders' USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36' # Obey robots.txt rules ROBOTSTXT_OBEY = False
代理:
下载中间件作用:拦截请求,可以将请求的ip进行更换。
流程:
1. 下载中间件类的自制定
object
重写process_request(self,request,spider)的方法
2. 配置文件中进行下载中间价的开启。
代码实现
爬虫相关操作
# -*- coding: utf-8 -*- import scrapy class ProxySpider(scrapy.Spider): name = 'proxy' # allowed_domains = ['www.baidu.com'] start_urls = ['http://www.baidu.com/s?wd=ip'] def parse(self, response): fp = open('proxy.html', 'w', encoding='utf-8') fp.write(response.text)
中间件
# -*- coding: utf-8 -*- # Define here the models for your spider middleware # # See documentation in: # https://doc.scrapy.org/en/latest/topics/spider-middleware.html from scrapy import signals # 自定义一个下载中间件的类,在类中实现process_request(处理中间件拦截到的请求)方法 class Myproxy(object): def process_request(self, request, spider): # 请求ip的更换 request.meta['proxy'] = 'http://60.217.137.218:8060' # 默认的用不到,可以删除
配置(开启中间件)
BOT_NAME = 'proxyDemo' SPIDER_MODULES = ['proxyDemo.spiders'] NEWSPIDER_MODULE = 'proxyDemo.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent # USER_AGENT = 'proxyDemo (+http://www.yourdomain.com)' USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36' # Obey robots.txt rules ROBOTSTXT_OBEY = False DOWNLOADER_MIDDLEWARES = { 'proxyDemo.middlewares.Myproxy': 543, }