• scrapy爬虫实例(1)


    爬虫实例

    1. 预先设置好items
    import scrapy
    class SuperspiderItem(scrapy.Item):
        title = scrapy.Field()
        date = scrapy.Field()
        content = scrapy.Field()
    
    1. 爬取范围和start_url
    class Spider1Spider(scrapy.Spider):
        name = 'spider1'
        allowed_domains = ['http://wz.sun0769.com/']
        start_urls = ['http://wz.sun0769.com/html/top/report.shtml']
    
    1. parse实现三大大功能抓取具体内容url链接和下一页url链接,并提取title和date
        def parse(self, response):
            tr_list = response.xpath("//div[@class='newsHead clearfix']/table[2]//tr")
            for tr in tr_list:
                items = SuperspiderItem()
                items['title'] = tr.xpath("./td[3]/a[1]/@title").extract_first()  ##### 提取title  用xpath
                items['date'] = tr.xpath("./td[6]//text()").extract_first()    #### 同样的方法提取date
                content_href = tr.xpath("./td[3]/a[1]/@href").extract_first()   #### 提取内容链接
      ####---将提取的内容链接交给下一个函数,并将date和title也交给下一个函数最终数据统一处理---#########
      ####---有关yiled----####----content_url传url链接,callback指定回调函数----####
                yield scrapy.Request(
                    content_href,
                    callback=self.get_content,
         ####----meta-可以将数据转移----####
         ####----一个类字典的数据类型----####
                    meta={
                        'date': items['date'],
                        'title': items['title']
                          }
                )
            new_url = response.xpath("//div[contains(@align,'center')]//@href").extract()
            print(new_url[-2])
            if "page="+str(page_num*30) not in new_url[-2]:
       ####---指明爬取的页数---####
                yield scrapy.Request(
                    new_url[-2],
                    callback=self.parse
                )
    
    1. 第二个函数
      -汇集所有的函数并 传给piplines
        def get_content(self, response):
            items = SuperspiderItem()
            items['date'] = response.meta['date']
            items['title'] = response.meta['title']
            items['content'] = response.xpath("//td[@class='txt16_3']/text()").extract_first()
            yield items
    
    1. piplines里面并没做什么.因为没对数据进行什么处理,只是简单的将数据打印
    class SuperspiderPipeline(object):
        def process_item(self, item, spider):
            items = item
            print('*'*100)
            print(items['date'])
            print(items['title'])
            print(items['content'])
    

    完整代码

    • items里面的部分
    
    import scrapy
    
    class SuperspiderItem(scrapy.Item):
        title = scrapy.Field()
        date = scrapy.Field()
        content = scrapy.Field()
    
    • spider代码
    # -*- coding: utf-8 -*-
    import scrapy
    from superspider.items import SuperspiderItem
    page_num = 3
    class Spider1Spider(scrapy.Spider):
        name = 'spider1'
        allowed_domains = ['wz.sun0769.com']
        start_urls = ['http://wz.sun0769.com/html/top/report.shtml']
    
        def parse(self, response):
            tr_list = response.xpath("//div[@class='newsHead clearfix']/table[2]//tr")
            for tr in tr_list:
                items = SuperspiderItem()
                items['title'] = tr.xpath("./td[3]/a[1]/@title").extract_first()
                items['date'] = tr.xpath("./td[6]//text()").extract_first()
                content_href = tr.xpath("./td[3]/a[1]/@href").extract_first()
                yield scrapy.Request(
                    content_href,
                    callback=self.get_content,
                    meta={
                        'date': items['date'],
                        'title': items['title']
                          }
                )
            new_url = response.xpath("//div[contains(@align,'center')]//@href").extract()
            print(new_url[-2])
            if "page="+str(page_num*30) not in new_url[-2]:
                yield scrapy.Request(
                    new_url[-2],
                    callback=self.parse
                )
    
        def get_content(self, response):
            items = SuperspiderItem()
            items['date'] = response.meta['date']
            items['title'] = response.meta['title']
            items['content'] = response.xpath("//td[@class='txt16_3']/text()").extract_first()
            yield items
    
    • piplines代码
    class SuperspiderPipeline(object):
        def process_item(self, item, spider):
            items = item
            print('*'*100)
            print(items['date'])
            print(items['title'])
            print(items['content'])
    

    中间遇到的问题

    • 爬取范围写错而日志等级又设置为warning,导致找不出问题
    • yield相关内容不清楚
    • 要先导入并初始化一个SuperspiderItem()(加括号)
    • piplines中不需要导入SuperspiderItem()
    • extract()忘写
    • xpath://div[contains(@align,'center')注意写法
  • 相关阅读:
    centos
    ssh 登录 centos 服务器
    Sql NoSql
    Java
    PHP
    React Hooks使用
    前端优化tips
    Error:Node Sass version 5.0.0 is incompatible with ^4.x 解决
    css换行
    git 关联多个远程仓库
  • 原文地址:https://www.cnblogs.com/l0nmar/p/12553851.html
Copyright © 2020-2023  润新知