• python标准库Beautiful Soup与MongoDb爬喜马拉雅电台的总结


    Beautiful Soup标准库是一个可以从HTML/XML文件中提取数据的Python库,它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式,Beautiful Soup将会节省数小时的工作时间。pymongo标准库是MongoDb NoSql数据库与python语言之间的桥梁,通过pymongo将数据保存到MongoDb中。结合使用这两者来爬去喜马拉雅电台的数据...

    Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml。本文使用的就是lxml,对于这个的安装,请看 python 3.6 lxml标准库lxml的安装及etree的使用注意
    同时,本文使用了XPath来解析我们想要的部分,对于XPath与Beautiful Soup的介绍与使用请看 Beautiful Soup 4.4.0 文档 XPath 简介
    本文涉及到的Beautiful Soup与XPath的知识不是很深,看看官方文档就能理解,而且我还加上了注释...
    对于pymongo标准库,我就不多扯淡了,详情请看 python标准库之pymongo模块次体验

    有时候,我们需要判断当前向服务器发出请求的客户端的类型,也就是通常所说的User-Agent,简称UA,我们在浏览网页时所使用的浏览器就是UA的一种,换言之,UA就是浏览器,在HTTP协议中,通过User-Agent请求头说明用户浏览器的类型,操作系统,浏览器内核等信息的标识。通过这个标识,用过所访问的网站可以显示不同的版本,从而为用户提供更好的体验或者进行信息统计。而有些网站正式利用UA来防止黑客或是像我们这种无聊的人来爬去网站的数据信息。
    因此,本文代码首先就把所有的UA都给列取出来,以方便后续的爬取工作。

    好了,下面来明确下我们要爬取得数据是什么:

    我们需要的是图片的链接,alt等

    随后我们点击图片链接之后,获取里面的详情,如果有些电台是多页的,那么我们用过xpath来依次访问。同时我们获取页面中专辑里的声音模块的sound_id...

    程序如下:

    import random
    import requests
    from bs4 import BeautifulSoup
    import json
    from lxml import etree
    import pymongo
    
    
    clients = pymongo.MongoClient("localhost", 27017)
    db = clients["XiMaLaYa"]
    collection_1 = db["album"]
    collection_2 = db["detail"]
    
    UA_LIST = [
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
        "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
        "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
    ]
    headers1 = {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Encoding': 'gzip, deflate, sdch',
        'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6',
        'Cache-Control': 'max-age=0',
        'Proxy-Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1',
        'User-Agent': random.choice(UA_LIST)  # User_agence表示用户代理
    }
    headers2 = {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Encoding': 'gzip, deflate, sdch',
        'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6',
        'Cache-Control': 'max-age=0',
        'Proxy-Connection': 'keep-alive',
        'Referer': 'http://www.ximalaya.com/dq/all/2',
        'Upgrade-Insecure-Requests': '1',
        'User-Agent': random.choice(UA_LIST)
    }
    
    
    # Beautiful库用来处理XML和HTML...
    # 主要就是利用BeautifulSoup模块来处理requests模块获取的Html源码
    # 利用lxml模块将html源码解析成树结构,xpath来处理树节点.
    def get_url():
        start_urls = ["http://www.ximalaya.com/dq/all/{}".format(num) for num in range(1,85)]
        # start_urls = ["http://www.ximalaya.com/dq/all/1"]
        for start_url in start_urls:
            html = requests.get(start_url, headers=headers1).text
            soup = BeautifulSoup(html, "lxml")  # 使用lxml来处理
            for item in soup.find_all(class_="albumfaceOutter"):  # 解析并查找xml节点
                content = {
                    'href': item.a["href"],
                    'title': item.img['alt'],
                    'img_url': item.img['src']
                }
                collection_1.insert(content)
                # another(item.a["href"])
        print('写入完成...')
    
    
    # 进入电台具体页面 http://www.ximalaya.com/15836959/album/303085,并处理分页录音...
    def another(url):
        html = requests.get(url, headers=headers1).text
        # / :表示从根节点选取....
        # // :表示匹配选择的当前节点选择文档中的节点,而不考虑他们的位置...
        ifanother = etree.HTML(html).xpath('//div[@class="pagingBar_wrapper"]/a[last()-1]/@data-page')  # 页面链接地址  ifanother是list类型...
        if len(ifanother):  # 判断一个video的录音是否分割成了多页....
            num = ifanother[0]  # 获取页面数...
            print('本频道保存在' + num + '个页面')
            for n in range(1, int(num)):
                url2 = url + '?page={}'.format(n)
                get_m4a(url2)
            get_m4a(url)
    
    
    # 获取分页录音页面的详细数据...
    def get_m4a(url):
        html = requests.get(url, headers=headers2).text
        numlist = etree.HTML(html).xpath('//div[@class="personal_body"]/@sound_ids')[0].split(',')
        for i in numlist:
            murl = 'http://www.ximalaya.com/tracks/{}.json'.format(i)
            html = requests.get(murl, headers=headers1).text
            dic = json.loads(html)
            collection_2.insert(dic)
    
    
    if __name__ == "__main__":
        get_url()
  • 相关阅读:
    设计模式:组合模式
    对技术的认识及思考
    设计模式:策略模式
    java集合:常用集合的数据结构
    设计模式:代理模式
    java反射
    Spring事务管理
    在Spring使用junit注解进行单元测试
    tomcat限制ip访问
    获取openid回调两次
  • 原文地址:https://www.cnblogs.com/zhiyong-ITNote/p/7172433.html
Copyright © 2020-2023  润新知