• scrapy之 downloader middleware


    一. 功能说明

    Downloader Middleware有三个核心的方法

    process_request(request, spider)

    process_response(request, response, spider)

    process_exception(request, exception, spider)  

    二. 本次实验实现两个功能

    1. 修改请求时的user-agent

    方法一:修改settings里面的USER_AGENT变量,加一行USER_AGENT = '....'即可

    方法二:修改middleware.py,这里实现得到一个随机的user-agent,在里面定义一个RandomUserAgentMiddleware类,并写一个process_request()函数

    2. 修改网页响应时的返回码

    在middleware.py中定义一个process_response()函数

    三. 具体实现

    scrapy startproject httpbintest

    cd httpbintest && scrapy genspider httpbin httpbin.org

    修改httpbin.py代码

     -*- coding: utf-8 -*-
    import scrapy
    
    
    class HttpbinSpider(scrapy.Spider):
        name = 'httpbin'
        allowed_domains = ['httpbin.org']
        start_urls = ['http://httpbin.org/get']
    
        def parse(self, response):
           # print(response.text)
            self.logger.debug(response.text)
            self.logger.debug('status code: ' + str(response.status))

    在middlewares.py添加如下代码

    其中的process_request函数是得到一个随机的user-agent; process_response函数是修改网页返回码为201

    import random
    
    
    class RandomUserAgentMiddleware():
        def __init__(self):
            self.user_agents = [
                'Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)',
                'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.2 (KHTML, like Gecko) Chrome/22.0.1216.0 Safari/537.2',
                'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:15.0) Gecko/20100101 Firefox/15.0.1'
            ]
    
        def process_request(self, request, spider):
            request.headers['User-Agent'] = random.choice(self.user_agents)
    
        def process_response(self, request, response, spider):
            response.status = 201
            return response

    settings.py中添加如下代码,使上面修改生效

    DOWNLOADER_MIDDLEWARES = {
        'httpbintest.middlewares.RandomUserAgentMiddleware': 543,
    }
  • 相关阅读:
    强制转换改变了对象的框架大小
    android应用程序fps meter[帧数显示]的分析 —— 浅谈root的风险 (1)
    父类virtual和overload,子类reintroduce; overload;
    MySQL版本与工具
    Verilog HDL实用教程笔记
    XE2安装JVCL
    解决Raize日历控件显示的问题
    hdu3415 Max Sum of Max-K-sub-sequence
    MFC重绘原理的关键理解
    常用代码页与BOM
  • 原文地址:https://www.cnblogs.com/regit/p/9406279.html
Copyright © 2020-2023  润新知