• Scrapy爬虫之豆瓣TOP250


    最近学习了python的一个超级牛的库scrapy,写下一些心得。

    初看的时候,看的是官方文档,讲的有些晦涩,有些地方也有模糊不清的地方,而且完整的中高级的用例不多,再由于版本更新的问题,原来的一些方法发生了一些改变,所以在博客园上找到一篇博客结合官方文档,爬出一个自己的scrapy,主要目标豆瓣电影top250,接下来上源码:

    首先,在自己想要的目录下新建项目:

    scrapy startproject douban

    进入douban文件夹,看一下目录结构,其中result.txt文件是我的输出文件

    接下来,修改douban文件夹下的items.py,将返回项整合在一个item中:

     1 # -*- coding: utf-8 -*-
     2 
     3 # Define here the models for your scraped items
     4 #
     5 # See documentation in:
     6 # http://doc.scrapy.org/en/latest/topics/items.html
     7 
     8 import scrapy
     9 
    10 
    11 class DoubanItem(scrapy.Item):
    12     # define the fields for your item here like:
    13     # name = scrapy.Field()
    14     # pass
    15     movie_name = scrapy.Field()
    16     movie_director = scrapy.Field()
    17     movie_editor = scrapy.Field()
    18     movie_roles = scrapy.Field()
    19     movie_style = scrapy.Field()
    20     movie_date = scrapy.Field()
    21     movie_long = scrapy.Field()

    然后开始编写自己的爬虫程序,在spiders文件下新建douban_spider.py:

     1 # -*- coding: utf-8 -*-
     2 from scrapy.spiders import BaseSpider   # 这里新版使用spiders
     3 from scrapy.selector import HtmlXPathSelector
     4 from douban.items import DoubanItem
     5 import scrapy
     6 import re
     7 import sys
     8 reload(sys)
     9 sys.setdefaultencoding("utf-8")  # 设定文件字符编码utf-8
    10 
    11 
    12 class DoubanSpider(BaseSpider):
    13     """docstring for DoubanSpider"""
    14     name = "douban"    # scrapy爬虫名称
    15     allow_domains = ["movie.douban.com"]   # 允许域名
    16     # 开始检索的URL
    17     start_urls = ["http://movie.douban.com/top250" + "?start=" + str(yeshu * 25) + "&filter=&type=" for yeshu in range(0, 10)]
    18 
    19     # 回调函数
    20     def parse(self, response):
    21         hxs = HtmlXPathSelector(response)
    22         movie_link = hxs.xpath('//div[@class="hd"]/a/@href').extract()
    23         # movie_next = hxs.xpath('//span[@class="next"]/a/@href').extract()
    24         # nextmo = movie_next[0]
    25         # if nextmo:
    26         #     nextmo = "http://movie.douban.com/top" + nextmo
    27         #     start_urls.append(nextmo)
    28         for link in movie_link:
    29             # 给出进入二级页面的请求,并使用下面的回调函数
    30             yield scrapy.Request(link, callback=self.parse_item)
    31 
    32     # 自己写的回调函数,用于处理二级页面
    33     def parse_item(self, response):
    34         item_has = HtmlXPathSelector(response)
    35         movie_name = item_has.xpath('//h1/span/text()').extract()
    36         movie_director = item_has.xpath('//a[@rel="
    37             v:directedBy"]/text()').extract()
    38         movie_editor = item_has.xpath('//div[@id="
    39             info"]/span[2]/span[@class="attrs"]/a/text()').extract()
    40         movie_roles = item_has.xpath('//a[@rel="
    41             v:starring"]/text()').extract()
    42         movie_style = item_has.xpath('//span[@property="
    43             v:genre"]/text()').extract()
    44         movie_date = item_has.xpath('//span[@property="
    45             v:initialReleaseDate"]/text()').extract()
    46         movie_long = item_has.xpath('//span[@property="
    47             v:runtime"]/text()').extract()
    48         item = DoubanItem()
    49         item['movie_name'] = movie_name
    50         item['movie_director'] = movie_director
    51         item['movie_editor'] = movie_editor
    52         item['movie_roles'] = movie_roles
    53         item['movie_style'] = movie_style
    54         item['movie_date'] = movie_date
    55         item['movie_long'] = movie_long
    56         yield item

    最后更改douban下的pinelines.py文件,用于将爬到的数据存到文件中:

     1 # -*- coding: utf-8 -*-
     2 
     3 # Define your item pipelines here
     4 #
     5 # Don't forget to add your pipeline to the ITEM_PIPELINES setting
     6 # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
     7 NUM = 1
     8 
     9 
    10 class DoubanPipeline(object):
    11 
    12     # NUM = 1
    13 
    14     def process_item(self, item, spider):
    15         movie_name = item['movie_name']
    16         movie_director = item['movie_director']
    17         movie_editor = [line + '' for line in item['movie_editor']]
    18         movie_roles = [line + '' for line in item['movie_roles']]
    19         movie_style = [line + '' for line in item['movie_style']]
    20         movie_date = [line + '' for line in item['movie_date']]
    21         movie_long = item['movie_long']
    22         f = open("result.txt", "a")
    23         global NUM
    24         f.write(str(NUM))
    25         f.write("
    片名:")
    26         NUM += 1
    27         print "NAME:", movie_name
    28         # for it in movie_name:
    29         #     f.write(it)
    30         # f.write("
    ")
    31         f.writelines(movie_name)
    32         f.write("
    导演:")
    33         f.writelines(movie_director)
    34         f.write("
    编剧:")
    35         f.writelines(movie_editor)
    36         f.write("
    主角:")
    37         f.writelines(movie_roles)
    38         f.write("
    类型:")
    39         f.writelines(movie_style)
    40         f.write("
    上映时间:")
    41         f.writelines(movie_date)
    42         f.write("
    影片时长:")
    43         f.writelines(movie_long)
    44         f.write("
    ")
    45         f.close()
    46         return item

    最后的最后,别忘了修改douban下的settings.py中的ITEM_PIPELINES,将它设置成我们自己写的pineline,在默认情况下是被注释掉的:

    ITEM_PIPELINES = {
       'douban.pipelines.DoubanPipeline': 300,
    }

    程序工作就完成了,接下来你就可以运行了,需要在程序根目录运行,即第一个douban下:

    scrapy crawl douban

  • 相关阅读:
    在请求中使用XML Publisher生成文件报错
    Oracle Sourcing Implementation and Administration Guide(转)
    API To Import Negotiations(转)
    使用POI动态更新导出的EXCEL模板中的列
    使用POI设置导出的EXCEL锁定指定的单元格
    QML获取随机颜色
    Access导出excel
    Component
    QML中打印
    Qt Quick Dialogs
  • 原文地址:https://www.cnblogs.com/phil-chow/p/5347498.html
Copyright © 2020-2023  润新知