• py3+requests+re+urllib,爬取并下载不得姐视频


    实现原理及思路请参考我的另外几篇爬虫实践博客

    py3+urllib+bs4+反爬,20+行代码教你爬取豆瓣妹子图:http://www.cnblogs.com/UncleYong/p/6892688.html
    py3+requests+json+xlwt,爬取拉勾招聘信息:http://www.cnblogs.com/UncleYong/p/6960044.html
    py3+urllib+re,轻轻松松爬取双色球最近100期中奖号码:http://www.cnblogs.com/UncleYong/p/6958242.html

    实现代码如下:

    import urllib.request, re, requests
    
    url_name = []
    def get():
        hd = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
        url = 'http://www.budejie.com/video/'
        html = requests.get(url, headers=hd).text
        # print(html)
        url_content = re.compile(r'(<div class="j-r-list-c">.*?</div>.*?</div>)',re.S) 
        url_contents = re.findall(url_content,html)
        # print(url_contents)
        for i in url_contents: # 大盒子里面的html
            url_reg = r'data-mp4="(.*?)"'
            url_item = re.findall(url_reg,i) 
            # print(type(url_items)) # <class 'list'>
            # print(url_item)
            if url_item:
                name_reg = re.compile(r'<a href="/detail-.{8}?.html">(.*?)</a>',re.S) # .{8}?匹配8位数字
                name_item = re.findall(name_reg,i) # findall返回的是一个列表
                # print(type(name_items)) # <class 'list'>
                # print(name_items)
                for i,k in zip(name_item,url_item):
                    url_name.append([i,k]) # 将列表添加到列表中,其实,也可以将元组存入列表,url_name.append((i,k))
                    # print(url_name)
                    # print(i,k)
        for i in url_name:
            print('正在下载>>>>>  '+i[0]+':'+i[1])
            # 每个元素的i[0]是名称,i[1]是视频url
            urllib.request.urlretrieve(i[1],'video/%s.mp4'%(i[0])) # video\%s
    
    if __name__ == '__main__':
        get()
    

  • 相关阅读:
    3、SpringBoot执行原理
    10、@Controller跟@RestController注解的使用
    2、Spring项目的创建【官网跟IDEA】
    1、了解SpringBoot
    PHP算法之IP 地址无效化
    PHP算法之宝石与石头
    MYSQL查询查找重复的电子邮箱
    PHP算法之猜数字
    PHP算法之盛最多水的容器
    PHP算法之回文数
  • 原文地址:https://www.cnblogs.com/uncleyong/p/6973861.html
Copyright © 2020-2023  润新知