• python 爬虫 重复下载 二次请求


    在写爬虫的时候,难免会遇到报错,比如 4XX ,5XX,有些可能是网络的原因,或者一些其他的原因,这个时候我们希望程序去做第二次下载,

    有一种很low的解决方案,比如是用  try  except

      

    try:
        -------
    except:
        try:
            --------
        except:
            try:
                ------
            except:
                try:
                    ------
                except:
                    try:
                        ------
                    except:
                        try:
                            ------
                        except:
                            ------

    有没有看起来更舒服的写法呢?

    我们可以用递归实现这个过程

    代码如下

    request_urls = [
    "https://www.baidu.com/",
    "https://www.baidu.com/",
    "https://www.baidu.com/",
    "https://www.ba111111idu.com/",
    "https://www.baidu.com/",
    "https://www.baidu.com/",
    ]
    
    def down_load(url,request_max=3):
        print "正在请求的URL是:",url
        result_html = ""
        result_status_code = ""
        try:
            result = session.get(url=url)
            result_html = result.content
            result_status_code = result.status_code
            print result_status_code
        except Exception as e:
            print e
            if request_max >0:
                if result_status_code != 200:
                    return down_load(url,request_max-1)
        return result_html
    
    for url in request_urls:
        down_load(url=url,request_max=13)

     输出结果:

    C:Python27python.exe C:/Users/xuchunlin/PycharmProjects/A9_25/auction/test.py
    正在请求的URL是: https://www.baidu.com/
    200
    正在请求的URL是: https://www.baidu.com/
    200
    正在请求的URL是: https://www.baidu.com/
    200
    正在请求的URL是: https://www.ba111111idu.com/
    HTTPSConnectionPool(host='www.ba111111idu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000003AA6208>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
    正在请求的URL是: https://www.ba111111idu.com/
    HTTPSConnectionPool(host='www.ba111111idu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000003AA6438>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
    正在请求的URL是: https://www.ba111111idu.com/
    HTTPSConnectionPool(host='www.ba111111idu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000003AA65F8>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
    正在请求的URL是: https://www.ba111111idu.com/
    HTTPSConnectionPool(host='www.ba111111idu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000003AA6828>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
    正在请求的URL是: https://www.ba111111idu.com/
    HTTPSConnectionPool(host='www.ba111111idu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000003AA6A90>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
    正在请求的URL是: https://www.ba111111idu.com/
    HTTPSConnectionPool(host='www.ba111111idu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000003AA62E8>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
    正在请求的URL是: https://www.ba111111idu.com/
    HTTPSConnectionPool(host='www.ba111111idu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000003AA6D30>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
    正在请求的URL是: https://www.ba111111idu.com/
    HTTPSConnectionPool(host='www.ba111111idu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000003AA6DD8>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
    正在请求的URL是: https://www.ba111111idu.com/
    HTTPSConnectionPool(host='www.ba111111idu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000003B682B0>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
    正在请求的URL是: https://www.ba111111idu.com/
    HTTPSConnectionPool(host='www.ba111111idu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000003B68080>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
    正在请求的URL是: https://www.ba111111idu.com/
    HTTPSConnectionPool(host='www.ba111111idu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000003B685C0>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
    正在请求的URL是: https://www.ba111111idu.com/
    HTTPSConnectionPool(host='www.ba111111idu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000003B687F0>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
    正在请求的URL是: https://www.ba111111idu.com/
    HTTPSConnectionPool(host='www.ba111111idu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000003B68A20>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
    正在请求的URL是: https://www.ba111111idu.com/
    HTTPSConnectionPool(host='www.ba111111idu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000003B68C50>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
    正在请求的URL是: https://www.baidu.com/
    200
    正在请求的URL是: https://www.baidu.com/
    200
    
    Process finished with exit code 0
  • 相关阅读:
    一片非常有趣的文章 三分钟读懂TT猫分布式、微服务和集群之路
    mysql+mycat搭建稳定高可用集群,负载均衡,主备复制,读写分离
    mycat读写分离+垂直切分+水平切分+er分片+全局表 测试
    LVS Nginx HAProxy 优缺点
    nginx map配置根据请求头不同分配流量到不同后端服务
    Javamail发送邮件
    java发送html模板的高逼格邮件
    学习openresty时,nginx的一个坑
    mysql数据导出golang实现
    mysql支持原生json使用说明
  • 原文地址:https://www.cnblogs.com/xuchunlin/p/8565952.html
Copyright © 2020-2023  润新知