• scrapy--json(喜马拉雅Fm)


    已经开始听喜马拉雅Fm电台有2个月,听里面的故事,感觉能听到自己,特别是蕊希电台,始于声音,陷于故事,忠于总结。感谢喜马拉雅Fm陪我度过了这2个月,应该是太爱了,然后就开始对Fm下手了。QAQ

    该博客基于以下博客,提取和修改。

    https://www.jianshu.com/p/8ff95111b18a
    https://www.imooc.com/article/48315

    需要解决问题

    1.m4a文件储存在json文本中             --f12审查元素,使用json.loads读取信息 
    2.将其他主播的所有音频文件也下载
    3.下载文件时,对提取的文件进行分类       --提取主播id,使用meta进行传递                    

    三、先给大家看看成果

    一、提取网页源码

    1.1_提取trackId:"https://www.ximalaya.com/qinggan/321787/130991924"

    1.2_提取其他主播Id

    1.3_主播所有作品的trackId:"http://www.ximalaya.com/revision/album/getTracksList?albumId=321787&pageNum=13"

    1.4_提取.m4a文件:https://www.ximalaya.com/revision/play/tracks?trackIds=35217881

    二、代码设置:middlewares.py,settings.py,items.py就不细讲了,可以看我之前的博客。

    2.1_pipelines.py

    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    import scrapy
    from os.path import join,basename,dirname
    import os
    import urlparse
    from scrapy.pipelines.files import FilesPipeline
    from Xima.settings import FILES_STORE
    from scrapy.exceptions import DropItem
    
    
    class XimaPipeline(FilesPipeline):
        def get_media_requests(self,item,info):
            yield scrapy.Request(item['m4_urls'],meta={"file_name":item['file_name'],'m4_urls':item['m4_urls']})
    
        def file_path(self,request,response=None,info=None):
            #get_media_requests函数是返回了一个request对象的,而这个request对象就是file_path函数接收的那个
            item = request.meta
            return join(FILES_STORE, item['file_name'] + '\' + basename(item['m4_urls']))
    
        def item_completed(self, results, item, info):
            file_paths = [x['path'] for ok, x in results if ok]
            if not file_paths:
                raise DropItem("Item contains no files")
    
            return item

    2.2_爬取代码

    # -*- coding: utf-8 -*-
    import scrapy
    from Xima.items import XimaItem
    import json
    import pdb
    from Xima.settings import USER_AGENT
    import random
    
    
    class XimaSpider(scrapy.Spider):
        name = 'xima'
        allowed_domains = ['www.ximalaya.com']
        start_urls = ['https://www.ximalaya.com/revision/seo/hotWordAlbums?id=321787&queryType=1']
    
        headers = {
            'Accept': 'application/json, text/javascript, */*; q=0.01',
            'Accept-Encoding': 'gzip, deflate',
            'Accept-Language': 'zh-CN,zh;q=0.9',
            'Connection': 'keep-alive',
            'Content-Length': '11',
            'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
            'Host': 'www.ximalaya.com',
            'Origin': 'www.ximalaya.com',
            'Referer': 'https://www.ximalaya.com/revision/seo/hotWordAlbums?id=321787&queryType=1',
            'User-Agent': random.choice(USER_AGENT),
            'X-Requested-With': 'XMLHttpRequest',
        }
    
        def start_requests(self):
            yield scrapy.Request(self.start_urls[0],callback=self.parse_1)
    
        def parse_1(self,response):
            for each_url in json.loads(response.body)['data']['hotWordAlbums']:
                for i in xrange(20):
                    new_url = 'http://www.ximalaya.com/revision/album/getTracksList?albumId='+str(each_url['id'])+'&pageNum='+str(i)
                    yield scrapy.Request(new_url,callback=self.parse,meta={'trackid':str(each_url['id'])})
    
        def parse(self, response):
            if json.loads(response.body)['data']['tracks']:
                for sel in json.loads(response.body)['data']['tracks']:
                    stackids = sel['trackId']
                    meta1 = response.meta
                    yield scrapy.Request('https://www.ximalaya.com/revision/play/tracks?trackIds=%s'%stackids,callback=self.m4a,meta=meta1)
    
        def m4a(self,response):
            xima = XimaItem()
            if json.loads(response.body)['data']['tracksForAudioPlay'][0]['src']:
                xima['file_name']   = response.meta['trackid']
                xima['m4_urls']     = json.loads(response.body)['data']['tracksForAudioPlay'][0]['src']
    
                yield xima
  • 相关阅读:
    Service Workers里的CacheStorage和Cache
    application cache和localstorage的区别
    localStorage和sessionStorage区别
    Ubuntu软件的安装和使用
    C++之数据类型
    C++之C++的词法单位
    C++之语言概述
    Ubuntu双系统无法挂载Windows10 硬盘的解决方法
    opencv 显示摄像头数据
    Ubuntu 中使用git 上传代码
  • 原文地址:https://www.cnblogs.com/eilinge/p/9843979.html
Copyright © 2020-2023  润新知