• Python 多进程 一分钟下载二百张图片 是什么样子的体验


    需要爬取国内某个网站,但是这个网站封ip,没办法,只能用代理了,然后构建自己的代理池,代理池维护了20条进程,

    所用的网络是20M带宽,实际的网速能达到2.5M,考虑到其他原因,网速未必能达到那么多。爬虫对网速的要求挺高的。

    首先把 URL 图片的链接  抓取下来 保存到数据库中去,然后使用多进程进行图片的抓取。

    经过测试   开40个进程,一分钟能采集200张图片,但是开60个进程,图片下降到了一分钟120张。

    注意: 抓取图片的时候,或者抓取视频的时候,一定要加上请求头,实现图片的压缩传输。

    下面直接粘贴出来代码:

    # coding:utf-8
    from common.contest import *
    
    def save_img(source_url, dir_path, file_name,maxQuests= 11):
        maxQuests =maxQuests
    
        headers = {
    
                    "Host":"img5.artron.net",
                    "Connection":"keep-alive",
                    "User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.91 Safari/537.36",
                    "Accept":"image/webp,image/apng,image/*,*/*;q=0.8",
                    "Referer":"http://auction.artron.net/paimai-art5113610001/",
                    "Accept-Encoding":"gzip, deflate",
                    "Accept-Language":"zh-CN,zh;q=0.8",
    
                    }
        proxies = r.get(str(random.randint(1,10)))
        proxies = {"http": "http://" + str(proxies)}
        print "使用的代理是:",proxies
        try:
            response = requests.get(url=source_url, headers=headers,verify=False,proxies=proxies,timeout=15)
            if response.status_code == 200:
                if not os.path.exists(dir_path):
                    os.makedirs(dir_path)
                total_path = dir_path + '/' + file_name
    
                with open(total_path, 'wb') as f:
                    for chunk in response.iter_content(1024):
                        f.write(chunk)
                print "图片保存到本地"
                return "1"
            else:
                print "图片没有保存到本地"
                return "0"
        except Exception as e:
            print e
            if maxQuests > 0 and response.status_code != 200:
                save_img(source_url, dir_path, file_name, maxQuests-1)
    
    
    
    
    def getUpdataImage(item):
    
        item_imgurl = item['item_imgurl']
        url = item_imgurl
        item_href = item_imgurl
        print "正在采集的 url 是",url
    
        filenamelist = url.split('/')
    
        filename1 = filenamelist[len(filenamelist) - 4]
        filename2 = filenamelist[len(filenamelist) - 3]
        filename3 = filenamelist[len(filenamelist) - 2]
        filename4 = filenamelist[len(filenamelist) - 1]
    
        filename = filename1 + "_" + filename2 + "_" + filename3 + "_" + filename4
    
        filenamestr = filename.replace('.jpg', '')
        filenamestr = filenamestr.replace('.JPG', '')
        filenamestr = filenamestr.replace('.JPEG', '')
        filenamestr = filenamestr.replace('.jpeg', '')
        filenamestr = filenamestr.replace('.png', '')
        filenamestr = filenamestr.replace('.bmp', '')
        filenamestr = filenamestr.replace('.tif', '')
        filenamestr = filenamestr.replace('.gif', '')
    
        localpath = 'G:/helloworld/' + filenamestr
    
        save_localpath = localpath + "/" + filename
        print "图片保存路径是:",save_localpath
    
    
        try:
            result = save_img(url, localpath, filename,item_href)
    
            if result == "1":
                print "图片采集成功"
            else:
                print "图片采集失败"
    
    
        except IOError:
            pass
    
    
    
    if __name__ == "__main__":
    
        time1 = time.time()
        sql = """SELECT item_id, item_imgurl FROM 2017_xia_erci_pic  """
        resultList = select_data(sql)
        print len(resultList)
        pool = multiprocessing.Pool(60)
        for item in resultList:
            pool.apply_async(getUpdataImage, (item,))
        pool.close()
        pool.join()
  • 相关阅读:
    Muddy Fields
    LightOJ 1321
    LightOJ 1085
    LightOJ 1278
    LightOJ 1341
    LightOJ 1340
    vijos 1426 背包+hash
    vijos 1071 01背包+输出路径
    vijos 1907 DP+滚动数组
    vijos 1037 背包+标记
  • 原文地址:https://www.cnblogs.com/xuchunlin/p/7615590.html
Copyright © 2020-2023  润新知