• 爬虫入门六 总结 资料 与Scrapy实例-bibibili番剧信息



    title: 爬虫入门六 总结 资料 与Scrapy实例-bibibili番剧信息
    date: 2020-03-16 20:00:00
    categories: python
    tags: crawler

    学习资料的补充。
    和Scrapy的一个实例 bilibili番剧信息爬取。

    1 总结与资料

    1.1 基本知识

    1、学习Python爬虫基础,安装PyCharm。
    2、学习Scrapy框架。
    相关官方链接:

    Scrapy官网tutorial: https://scrapy-chs.readthedocs.io/zh_CN/0.24/intro/tutorial.html

    Requests官网文档: http://cn.python-requests.org/zh_CN/latest/

    lxml官方文档: http://lxml.de/tutorial.html

    python核心模块之pickle和cPickle讲解:http://www.pythonclub.org/modules/pickle

    phantomJS进程不自动退出,需要手动添加:driver.quit(),
    参考:http://www.jianshu.com/p/9d408e21dc3a

    菜鸟教程(Python篇):https://www.runoob.com/python3/python3-tutorial.html

    1.2 Xpath与re

    1、学习Scrapy+XPath。
    2、学习正则式
    3、撰写实验报告1的part1.

    相关官方链接:
    XPath语法:http://www.w3school.com.cn/xpath/xpath_syntax.asp

    Selectors:https://scrapy-chs.readthedocs.io/zh_CN/0.24/topics/selectors.html

    2 实例 B站番剧信息爬取

    参考: https://www.jianshu.com/p/530acc7b50d1?utm_source=oschina-app

    2.1

    新建工程。
    crapy startproject bilibilianime

    2.2 番剧索引

    番剧索引链接 https://www.bilibili.com/anime/index/
    F12控制台查看
    可见相关信息:追番,会员,名称,话数都在

  • bilibili番剧索引页F12控制台查看.png

    在Pycharm命令行调试Scrapy Xpath能否获取相关信息

    scrapy shell "https://www.bilibili.com/anime/index" -s USER_AGENT='Mozilla/5.0'
    response.xpath('//*[@class="bangumi-item"]//*[class="bangumi-title"]').extract()
    
    out:[]
    

    返回为空
    View(response) 在浏览器查看,没问题
    Reponse.text 查看内容,发现并没有相关信息

    在浏览器F12,network抓包,xhr选项,右键保存为har到本地搜索。
    在har文件查找鬼灭之刃

    "text": "{"code":0,"data":{"has_next":1,"list":[{"badge":"会员专享","badge_type":0,"cover":"http://i0.hdslb.com/bfs/bangumi/9d9cd5a6a48428fe2e4b6ed17025707696eab47b.png","index_show":"全26话","is_finish":1,"link":"https://www.bilibili.com/bangumi/play/ss26801","media_id":22718131,"order":"758万追番","order_type":"fav_count","season_id":26801,"title":"鬼灭之刃","title_icon":
    

    向上查找最近的request

    "request": {
              "method": "GET",
              "url": "https://api.bilibili.com/pgc/season/index/result?season_version=-1&area=-1&is_finish=-1&copyright=-1&season_status=-1&season_month=-1&year=-1&style_id=-1&order=3&st=1&sort=0&page=1&season_type=1&pagesize=20&type=1",
              "httpVersion": "HTTP/1.1",
    

    浏览器访问
    https://api.bilibili.com/pgc/season/index/result?season_version=-1&area=-1&is_finish=-1&copyright=-1&season_status=-1&season_month=-1&year=-1&style_id=-1&order=3&st=1&sort=0&page=1&season_type=1&pagesize=20&type=1
    内容如下

    {"code":0,"data":{"has_next":1,"list":[{"badge":"会员专享","badge_type":0,"cover":"http://i0.hdslb.com/bfs/bangumi/9d9cd5a6a48428fe2e4b6ed17025707696eab47b.png","index_show":"全26话","is_finish":1,"link":"https://www.bilibili.com/bangumi/play/ss26801","media_id":22718131,"order":"758.1万追番","order_type":"fav_count","season_id":26801,"title":"鬼灭之刃","title_icon":""},{"badge":"会员专享","badge_type":0,"cover":"http://i0.hdslb.com/bfs/bangumi/f5d5f51b941c01f8b90b361b412dc75ecc2608d3.png","index_show":"全14话","is_finish":1,"link":"https://www.bilibili.com/bangumi/play/ss24588","media_id":102392,"order":"660.2万追番","order_type":"fav_count","season_id":24588,"title":"工作细胞","title_icon":""},{"badge":"会员专享",
    ...
    ...#略去
    

    可见bilibili番剧索引页面是以api形式获取相关信息。其中sort,page_size..为设置的格式。

    sort 0降序排列 1升序排列
    order 3 追番人数排列 0更新时间 4最高评分 2播放数量 5开播时间
    page 控制返回的Index
    pagesize 20为默认,和网页上的一致 不过最多也就25
    剩下的属性和网页右侧的筛选栏一致,也能猜出来了。
    

    bilibili番剧索引页筛选.png

    综上就可以用api获取索引

    2.3 番剧详细信息

    在上面查鬼灭之刃的时候有一条信息
    https://www.bilibili.com/bangumi/play/ss26801","media_id":22718131

    media_id为番剧id,而番剧详情页为

    https://www.bilibili.com/bangumi/media/md22718131

    可见只需要替换后面的id就可以获得每个番剧的详情

    在鬼灭之刃的详情页,F12查看信息的节点
    比如tags,,在class=media-tags的节点中

    鬼灭之刃详情页tags与节点.png

    再尝试在scrapy shell能否xpath获取

    >scrapy shell "https://www.bilibili.com/bangumi/media/md22718131" -s USER_AGENT='Mozilla/5.0'
    
    response.xpath('//*[@class="media-tag"]/text()').extract()
    

    发现可直接获得

    In [1]: response.xpath('//*[@class="media-tag"]/text()').extract()
    Out[1]: ['漫画改', '战斗', '热血', '声控']
    

    但是,测试发现staff和声优无法直接xpath获取

    就Reponse.text查看response

    ,"staff":"原作:吾峠呼世晴(集英社《周刊少年JUMP》连载)\n监督:外崎春雄\n角色设计:松岛晃\n副角色设计:佐藤美幸、
    梶山庸子、菊池美花\n脚本制作:ufotable\n概念美术:卫藤功二、矢中胜、竹内香纯、桦泽侑里\n摄影监督:
    
    "actors":"灶门炭治郎:花江夏树\n灶门祢豆子:鬼头明里\n我妻善逸:下野纮\n嘴平伊之助:松冈祯丞\n富冈义勇:樱井孝宏\n鳞泷左近次:大冢芳忠\n锖兔:
    梶裕贵\n真菰:加隈亚衣\n不死川玄弥:冈本信彦\n产屋敷耀哉:森川智之\n产屋敷辉利哉:悠木碧\n产屋敷雏衣:井泽诗织\n钢铁冢萤:浪川大辅\n鎹鸦:山崎巧\n佛堂鬼:绿川光\n手鬼:子安武人",
    

    这些在一串json中。用re提取
    比如声优

    Import re
    Actor=Re.compile(‘actors”:(.*?),’)   #一直到 , 结束
    Text=reponse.text
    Re.findall(actors,text)
    
    In [17]: actors=re.compile('actors":(.*?),')
    In [18]: re.findall(actors,text)
    Out[18]: ['"灶门炭治郎:花江夏树\n灶门祢豆子:鬼头明里\n我妻善逸:下野纮\n嘴平伊之助:松冈祯丞\n富冈义勇:樱井孝宏\n鳞泷左近次:大冢芳忠\n锖兔:梶裕贵\n真菰:加隈亚衣\n不死川玄弥:冈本信彦\n产屋敷耀哉:森川智之\n产屋敷辉利哉
    :悠木碧\n产屋敷雏衣:井泽诗织\n钢铁冢萤:浪川大辅\n鎹鸦:山崎巧\n佛堂鬼:绿川光\n手鬼:子安武人"']
    

    包括评论、每集的标题等等都可以用re提取

    2.4 索引页详细页面的转换处理

    API每页包含20个子页面,API中还有这20个番剧的信息,并且需要根据API来判断是否把所有番剧爬完了。

    https://blog.csdn.net/u012150179/article/details/34486677
    https://www.zhihu.com/question/30201428
    参考上述处理:

    如何获取http://a.com中的url,同时也获取http://a.com页面中的数据?
    可以直接在parse方法中将request和item一起“返回”,并不需要再指定一个parse_item例如:

    def parse(self, response):
        #do something
        yield scrapy.Request(url, callback=self.parse)
    
        #item[key] = value
        yield item
    

    2.5 其他

    -o输出注意编码 utf-8,gb18030

    2.6 code

    settings.py

    BOT_NAME = 'bilibilianime'
    SPIDER_MODULES = ['bilibilianime.spiders']
    NEWSPIDER_MODULE = 'bilibilianime.spiders'
    FEED_EXPORT_ENCODING = "gb18030"
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    #USER_AGENT = 'bilibilianime (+http://www.yourdomain.com)'
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = True
    

    items.py

    import scrapy
    
    
    class BilibilianimeItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        badge= scrapy.Field()
        badge_type= scrapy.Field()
        is_finish= scrapy.Field()
        media_id= scrapy.Field()
        index_show= scrapy.Field()
        follow= scrapy.Field()
        play= scrapy.Field()
        pub_date= scrapy.Field()
        pub_real_time= scrapy.Field()
        renewal_time= scrapy.Field()
        score= scrapy.Field()
        season_id= scrapy.Field()
        title = scrapy.Field()
        tags= scrapy.Field()
        brief= scrapy.Field()
        cv= scrapy.Field()
        staff= scrapy.Field()
        count= scrapy.Field()
        pass
    

    bilibili.py (spider)

    import scrapy
    import logging
    from scrapy import Request
    from bilibilianime.items import BilibilianimeItem
    import re
    import json
    class MySpider(scrapy.Spider):
        name = 'bilibili'
        allowed_domains = ['bilibili.com']
        url_head = 'https://bangumi.bilibili.com/media/web_api/search/result?season_version=-1&area=-1&is_finish=-1&copyright=-1&season_status=-1&season_month=-1&pub_date=-1&style_id=-1&order=3&st=1&sort=0&season_type=1'
        start_urls = [url_head+"&page=1"]
    
        # 先处理列表中的番剧信息
        def parse(self, response):
            self.log('Main page %s' % response.url,level=logging.INFO)
            data=json.loads(response.text)
            next_index=int(response.url[response.url.rfind("=")-len(response.url)+1:])+1
            if(len(data['result']['data'])>0):
                # 发出Request 处理下一个网址
                next_url = self.url_head+"&page="+str(next_index)
                yield Request(next_url, callback=self.parse)
                medias=data['result']['data']
                for m in medias:
                    media_id=m['media_id']
                    detail_url='https://www.bilibili.com/bangumi/media/md'+str(media_id)
                    yield Request(detail_url,callback=self.parse_detail,meta=m)
        # 再处理每个番剧的详细信息
        def parse_detail(self, response):
            item = BilibilianimeItem()
            item_brief_list=['badge','badge_type','is_finish','media_id','index_show','season_id','title']
            item_order_list=['follow','play','pub_date','pub_real_time','renewal_time','score']
            m=response.meta
            for key in item_brief_list:
                if (key in m):
                    item[key]=m[key]
                else:
                    item[key]=""
            for key in item_order_list:
                if (key in m['order']):
                    item[key]=m['order'][key]
                else:
                    item[key]=""
            tags=response.xpath('//*[@class="media-tag"]/text()').extract()
            tags_string=''
            for t in tags:
                tags_string=tags_string+" "+t
            item['tags']=tags_string
            item['brief'] = response.xpath('//*[@name="description"]/attribute::content').extract()
            #detail_text = response.xpath('//script')[4].extract()  这里原来的代码有bug,应该是想缩小搜索区域加速,但是出了问题
            detail_text = response.text
            actor_p = re.compile('actors":(.*?),')
            ratings_count_p = re.compile('count":(.*?),')
            staff_p = re.compile('staff":(.*?),')
            item['cv'] = re.findall(actor_p,detail_text)[0]
            item['staff'] = re.findall(staff_p,detail_text)[0]
            count_list=re.findall(ratings_count_p,detail_text)
            if(len(count_list)>0):
                item['count'] = count_list[0]
            else:
                item['count']=0
    #        self.log(item)
            return item
    

    2.7 输出

    scrapy crawl bilibili -o bilibilianime.csv

    调用api时使用默认参数,所以是按照追番人数排序

    结果:

    bilibilianime csv.png

    3 其他实例

    https://www.cnblogs.com/xinyangsdut/p/7628770.html 腾讯社招

    https://blog.csdn.net/u013830811/article/details/45793477 亚马逊.cn

    https://github.com/Mrrrrr10/Bilibili_Spider 哔哩哔哩
    https://blog.csdn.net/weixin_42471384/article/details/83049336

  • 相关阅读:
    ubuntu 升级到5.1kernel,打开bbr
    python json.loads json.dumps的区别
    centos7 install vim8
    Linux的Network Tunnel技术
    Linux内核网络数据包处理流程
    CAD2010 破解方法
    [原创]MSP430FR4133练习(一):GPIO输入电平状态判断
    [原创] Xinput_1.3.DLL / MSVCR100.DLL文件缺失解决办法
    [原创]找不到mswinsck.ocx的解决办法
    Windows7系统推荐
  • 原文地址:https://www.cnblogs.com/lqerio/p/13484336.html
  • Copyright © 2020-2023  润新知