• 爬MEIZITU网站上的图片


    爬MEIZITU网站上的图片

    练手用的小爬虫。

    微信截图_20200616202549

    直接用两个正则匹配来找到图片的真正链接。

    # -*- coding: utf-8 -*-
    # @Time    : 2020/6/15 16:18
    # @Author  : banshaohuan
    # @Site    :
    # @File    : jiandan.py
    # @Software: PyCharm
    import re
    import urllib.request
    from http import cookiejar
    
    headers = {
        "Accept": " text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Encoding": " gb2312,utf-8",
        "Accept-Language": " zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3",
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0",
        "Connection": "keep-alive",
        "referer": "qq.com",
    }
    cjar = cookiejar.CookieJar()
    opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cjar))
    headall = []
    
    for key, value in headers.items():
        item = (key, value)
        headall.append(item)
    
    opener.addheaders = headall
    urllib.request.install_opener(opener)
    
    
    def craw(url, page):
        html1 = urllib.request.urlopen(url).read()
        html1 = str(html1)
    
        pat1 = '<ul id="pins">.+?</ul>'
        result1 = re.compile(pat1, re.S).findall(html1)
        result1 = result1[0]
        pat2 = "data-original=\\\'(.*?)\\\'"
        image_list = re.compile(pat2, re.S).findall(result1)
    
        index = 1
        path_name = "D:/Images/jiandan/xinggan/"
        import os
    
        if not os.path.exists(path_name):
            os.mkdir(path_name)
        for image_url in image_list:
            file_name = path_name + str(page) + "_" + str(index) + ".jpg"
            try:
                urllib.request.urlretrieve(image_url, filename=file_name)
            except urllib.error.URLError as e:
                if hasattr(e, "code"):
                    index += 1
                if hasattr(e, "reason"):
                    index += 1
    
            index += 1
    
    
    for i in range(1, 10):
        url = "http://www.mzitu.com/xinggan/page/" + str(i)
        import time
        time.sleep(6)
        print(url)
        craw(url, i)
    
    

    小网站,学习为主,不想给服务器太大压力,爬的时候最好设置个时间间隔。

  • 相关阅读:
    RabbitMQ的ACK机制
    Flex保存文件 FileReference.save(data,filename)
    Flex Builder cannot locate the required debugger version of Flash Player
    Flex每日小记
    IT民工
    R读取文件内容到Frame
    ArcGIS9.2 9.3
    超时空的心情
    ArcMap中设置.mxd相对路径
    MyEclipse Flex Tomcat BlazeDS
  • 原文地址:https://www.cnblogs.com/banshaohuan/p/13144412.html
Copyright © 2020-2023  润新知