• python爬虫学习笔记(二十二)-Scrapy框架 案例实现


    爬取小说

    spider

    import scrapy
    from xiaoshuo.items import XiaoshuoItem
    
    
    class XiaoshuoSpiderSpider(scrapy.Spider):
        name = 'xiaoshuo_spider'
        allowed_domains = ['zy200.com']
        url = 'http://www.zy200.com/5/5943/'
        start_urls = [url + '11667352.html']
    
        def parse(self, response):
            info = response.xpath("/html/body/div[@id='content']/text()").extract()
            href = response.xpath("//div[@class='zfootbar']/a[3]/@href").extract_first()
            xs_item = XiaoshuoItem()
            xs_item['content'] = info
            yield xs_item
    
            if href != 'index.html':
                new_url = self.url + href
                yield scrapy.Request(new_url, callback=self.parse)
    

    items

    import scrapy
    
    
    class XiaoshuoItem(scrapy.Item):
        # define the fields for your item here like:
        content = scrapy.Field()
        href = scrapy.Field()
    
    

    pipeline

    class XiaoshuoPipeline(object):
        def __init__(self):
            self.filename = open("dp1.txt", "w", encoding="utf-8")
    
        def process_item(self, item, spider):
            content = item["title"] + item["content"] + '
    '
            self.filename.write(content)
            self.filename.flush()
            return item
    
        def close_spider(self, spider):
            self.filename.close()
    
  • 相关阅读:
    java smtp 发送邮件
    Java 数据库连接配置
    kettle基础操作数据同步
    Java上传文件到服务器指定位置
    Java 操作word
    Java 导出Zip文件
    Java 遍历二叉树字符串
    一款基于vue.js 和node构建个人博客项目
    CSS选择器介绍和优先级
    CSS Position属性
  • 原文地址:https://www.cnblogs.com/thresh/p/13349394.html
Copyright © 2020-2023  润新知