• python的爬虫代理设置


    现在网站大部分都是反爬虫技术,最简单就是加代理,写了一个代理小程序。

    # -*- coding: utf-8 -*-
    #__author__ = "雨轩恋i"
    #__date__ = "2018年10月30日"
    
    # 导入random模块
    import random
    # 导入useragent用户代理模块中的UserAgentMiddleware类
    from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
    
    # RotateUserAgentMiddleware类,继承 UserAgentMiddleware 父类
    # 作用:创建动态代理列表,随机选取列表中的用户代理头部信息,伪装请求。
    #       绑定爬虫程序的每一次请求,一并发送到访问网址。
    
    # 发爬虫技术:由于很多网站设置反爬虫技术,禁止爬虫程序直接访问网页,
    #             因此需要创建动态代理,将爬虫程序模拟伪装成浏览器进行网页访问。
    class RotateUserAgentMiddleware(UserAgentMiddleware):
        def __init__(self, user_agent=''):
            self.user_agent = user_agent
    
        def process_request(self, request, spider):
            #这句话用于随机轮换user-agent
            ua = random.choice(self.user_agent_list)
            if ua:
                # 输出自动轮换的user-agent
                print(ua)
                request.headers.setdefault('User-Agent', ua)
    
        # the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape
        # for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php
        # 编写头部请求代理列表
        user_agent_list = [
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"
            "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
            "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
           ]

    可以在自己的爬虫程序中加入这个程序,每次动态的使用代理,将爬虫程序伪装成浏览器,这样就不会被网站禁止了

  • 相关阅读:
    Spring Boot----整合SpringCloud
    python apscheduler学习
    Java8 Lambda
    Java Json操作
    python 环境相关
    大数据 Hadoop 单机版安装
    大数据 CDH 6.2 安装
    Java8 时间API
    python 字符串方法
    《JavaScript高级程序设计》笔记——第一章到第三章
  • 原文地址:https://www.cnblogs.com/yuxuanlian/p/9877550.html
Copyright © 2020-2023  润新知