• scrapy中输出中文保存中文


    1.json文件中文解码:

    #!/usr/bin/python
    #coding=utf-8
    #author=dahu
    import json
    with open('huxiu.json','r') as f:
        data=json.load(f)
    print data[0]['title']
    for key in data[0]:
        print '"%s":"%s",'%(key,data[0][key])
    read_from_json

    中文写入json:

    #!/usr/bin/python
    #coding=utf-8
    #author=dahu
    import json
    data={
    "desc":"女友不是你想租想租就能租",
    "link":"/article/214877.html",
    "title":"押金8000元,共享女友门槛不低啊"
    }
    with open('tmp.json','w') as f:
        json.dump(data,f,ensure_ascii=False)        #指定ensure_ascii
    write_to_json

    2.scrapy在保存json文件时,容易乱码,

    例如:

    scrapy crawl huxiu --nolog -o huxiu.json
    $ head huxiu.json 
    [
    {"title": "u62bcu91d18000u5143uff0cu5171u4eabu5973u53cbu95e8u69dbu4e0du4f4eu554a", "link": "/article/214877.html", "desc": "u5973u53cbu4e0du662fu4f60u60f3u79dfu60f3u79dfu5c31u80fdu79df"},
    {"title": "u5f20u5634uff0cu817eu8bafu8981u5582u4f60u5403u836fu4e86", "link": "/article/214879.html", "desc": "u201cu8033u65c1u56deu8361u7740Ponyu9a6cu7684u6559u8bf2uff1au597du597du7528u8111u5b50u60f3u60f3uff0cu4e0du5145u94b1uff0cu4f60u4eecu4f1au53d8u5f3au5417uff1fu201d"},

    结合上面保存json文件为中文的技巧:

    settings.py文件改动:

    ITEM_PIPELINES = {
       'coolscrapy.pipelines.CoolscrapyPipeline': 300,
    }

    注释去掉

    pipelines.py改成如下:

    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
    import json
    # import codecs
    
    class CoolscrapyPipeline(object):
        # def __init__(self):
            # self.file = codecs.open('data_cn.json', 'wb', encoding='utf-8')
    
        def process_item(self, item, spider):
            # line = json.dumps(dict(item),ensure_ascii=False) + '
    '
            # self.file.write(line)
    
            with open('data_cn1.json', 'a') as f:
                json.dump(dict(item), f, ensure_ascii=False)
                f.write(',
    ')
            return item

    注释的部分是另一种写法,核心在于settings里启动pipeline,会自动运行process_item程序,所以就可以保存我们想要的任何格式

    此时终端输入命令

    scrapy crawl huxiu --nolog

    如果仍然加 -o file.json ,file和pipeline里定义文件都会生成,但是file的json格式仍然是乱码。

    3.进一步

    由上分析可以得出另一个结论,setting里的ITEM_PIPELINES 是控制着pipeline的,如果我们多开启几个呢:

    ITEM_PIPELINES = {
       'coolscrapy.pipelines.CoolscrapyPipeline': 300,
       'coolscrapy.pipelines.CoolscrapyPipeline1': 300,
    }
    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
    import json
    # import codecs
    
    class CoolscrapyPipeline(object):
        # def __init__(self):
            # self.file = codecs.open('data_cn.json', 'wb', encoding='utf-8')
    
        def process_item(self, item, spider):
            # line = json.dumps(dict(item),ensure_ascii=False) + '
    '
            # self.file.write(line)
    
            with open('data_cn1.json', 'a') as f:
                json.dump(dict(item), f, ensure_ascii=False)
                f.write(',
    ')
            return item
    class CoolscrapyPipeline1(object):
    
        def process_item(self, item, spider):
            with open('data_cn2.json', 'a') as f:
                json.dump(dict(item), f, ensure_ascii=False)
                f.write(',hehe
    ')
            return item
    pipelines.py

    运行:

    $ scrapy crawl huxiu --nolog
    $ head -n 2 data_cn*
    ==> data_cn1.json <==
    {"title": "押金8000元,共享女友门槛不低啊", "link": "/article/214877.html", "desc": "女友不是你想租想租就能租"},
    {"title": "张嘴,腾讯要喂你吃药了", "link": "/article/214879.html", "desc": "“耳旁回荡着Pony马的教诲:好好用脑子想想,不充钱,你们会变强吗?”"},
    
    ==> data_cn2.json <==
    {"title": "押金8000元,共享女友门槛不低啊", "link": "/article/214877.html", "desc": "女友不是你想租想租就能租"},hehe
    {"title": "张嘴,腾讯要喂你吃药了", "link": "/article/214879.html", "desc": "“耳旁回荡着Pony马的教诲:好好用脑子想想,不充钱,你们会变强吗?”"},hehe

    可以看到两个文件都生成了!而且还是按照我们想要的格式!

  • 相关阅读:
    Netty章节二十三:Netty自定义实现粘包与粘包
    Netty章节二十二:Netty自定义编解码器
    Netty章节二十一:Netty的ByteBuf
    Netty章节二十:Netty中的理论与实践
    Netty章节十八:Netty Server Start 源码分析
    Netty章节十七:Zero-copy,零拷贝
    Netty章节十六:Java NIO
    Netty章节十五:Nodejs使用gRPC与Java进行远程通信
    UML类图
    Java中虚函数和纯虚函数
  • 原文地址:https://www.cnblogs.com/dahu-daqing/p/7528642.html
Copyright © 2020-2023  润新知