• Scrapy模拟登录信息


    携带cookie模拟登录

    • 需要在爬虫里面自定义一个start_requests()的函数
      • 里面的内容:
    def start_requests(self):
    	cookies = '真实有效的cookie'
    	yield scrapy.Request(
    		self.start_urls[0],
    		callback = self.paese,
    		cookies = cookies
    	)
    

    下载中间件

    • 只需在文件最下面定义自己的中间件即可

    下载中间键里可以做很多内容:携带登录信息,设置user-agent,添加代理等

    • 使用前要在settings里面设置一下
      • 数字代表权重
      • projectname.middlewares.DownloadMiddlewareName
    DOWNLOADER_MIDDLEWARES = {
       'superspider.middlewares.SuperspiderDownloaderMiddleware': 543,
    }
    
    1. 设置user-agent process_request
      定义一个名为 RandomUserAgentMiddleware 的下载中间件
    from fake_useragent import UserAgent
    
    class RandomUserAgentMiddleware:
        def process_request(self, request, spider):
       	##### 还可以为不同爬虫指定不同的中间件
            if spider.name == 'spider1':
                ua = UserAgent()
                request.headers["User-Agent"] = ua.random
    

    在settings里导入

    DOWNLOADER_MIDDLEWARES = {
       'superspider.middlewares.RandomUserAgentMiddleware': 543
    }
    
    • 审核user-agent process_response
      • 需要返回response
      • 在settings里导入
    class UserAgentCheck:
        def process_response(self, request, response, spider):
            print(request.headers['User-Agent'])
            return response
    
    DOWNLOADER_MIDDLEWARES = {
       'superspider.middlewares.RandomUserAgentMiddleware': 543,
       'superspider.middlewares.UserAgentCheck': 544
    }
    
    1. 设置代理
    • 需要在request的meta信息中添加proxy字段
    • 添加代理的形式:协议+IP+端口
    • settings里导入
    class ProxyMiddleware:
    	def process_request(self, request, spider):
             if spider.name == 'spider0':
                 request.meta["proxy"] = "http://ip:port"
    

    模拟登录GitHub

    自己构造表单模拟登录 使用 FormRequest
    • 明确要yield的内容,并交给下一个函数处理
    # -*- coding: utf-8 -*-
    import scrapy
    
    class Spider0Spider(scrapy.Spider):
        name = 'spider0'
        allowed_domains = ['github.com']
        start_urls = ['https://github.com/session']
    
        def parse(self, response):
            yield scrapy.FormRequest(
                "https://github.com/session",
                formdata=,
                callback=self.after_login
            )
        def after_login(self, response):
            pass
    
    • formdate的构造
      在这里插入图片描述
    # -*- coding: utf-8 -*-
    import scrapy
    
    class Spider0Spider(scrapy.Spider):
        name = 'spider0'
        allowed_domains = ['github.com']
        start_urls = ['https://github.com/session']
    
        def parse(self, response):
            form = {
                'utf8': "✓",
                'authenticity_token': response.xpath(
                    "//*[@id='unsupported-browser']/div/div/div[2]/form/input[2]/text()").extract_first(),
    
                'ga_id': response.xpath('//*[@id="login"]/form/input[3]/@value').extract_first(),
                'login': "",
                'password': "",
                'webauthn-support': 'supported',
                'webauthn-iuvpaa-support': 'supported',
                response.xpath('//*[@id="login"]/form/div[3]/input[5]/@name').extract_first(): "",
                'timestamp': response.xpath("//*[@id='login']/form/div[3]/input[6]/@value").extract_first(),
                "timestamp_secret": response.xpath("//*[@id='login'']/form/div[3]/input[7]/@value").extract_first(),
                "commit": "Sign in"
            }
            print(form)
            
            yield scrapy.FormRequest(
                "https://github.com/session",
                formdata=form,
                callback=self.after_login
            )
            
        def after_login(self, response):
             pass
             
    
    自动寻找form表单中的信息
    # -*- coding: utf-8 -*-
    import scrapy
    
    class Spider0Spider(scrapy.Spider):
        name = 'spider0'
        allowed_domains = ['github.com']
        start_urls = ['https://github.com/session']
        
        def parse(self, response):
            yield scrapy.FormRequest.from_response(
                response,
                formdata={"login": "", "password":""},
                callback=self.after_login
            )
            
        def after_login(self, response):
            print(response.text)
    
  • 相关阅读:
    c++ --> #define中的三个特殊符号:#,##,#@
    网络通信 --> ZMQ安装和使用
    利用 mount 指令解决 Read-only file system的问题
    sed学习总结
    Verilog中锁存器与多路选择器
    Debian耳机声音问题
    MM32/STM32中断和事件梳理
    MM32 备份域学习(兼容STM32)
    有限状态机FSM(自动售报机Verilog实现)
    MM32 RTC学习(兼容STM32)
  • 原文地址:https://www.cnblogs.com/l0nmar/p/12553849.html
Copyright © 2020-2023  润新知