• python插入Elasticsearch操作


    在用scrapy做爬虫的时候,需要将数据存入的es中。网上找了两种方法,照葫芦画瓢也能出来,暂记下来:

    首先安装了es,版本是5.6.1的较早版本

    用pip安装与es版本相对的es相关包

    pip install elasticsearch-dsl==5.1.0

    方法一:

    以下是pipelines.py模块的完整代码

    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    import chardet
    
    class SinafinancespiderPipeline(object):
        def process_item(self, item, spider):
            return item
    
    
    # 写入到es中,需要在settings中启用这个类 ExchangeratespiderESPipeline
    # 需要安装pip install elasticsearch-dsl==5.1.0  注意与es版本需要对应
    from elasticsearch_dsl import  Date,Nested,Boolean,analyzer,Completion,Keyword,Text,Integer,DocType
    from elasticsearch_dsl.connections import connections
    connections.create_connection(hosts=['192.168.52.138'])
    from elasticsearch import Elasticsearch
    es = Elasticsearch()
    
    class AticleType(DocType):
        page_from = Keyword()
        # domain报错
        domain=Keyword()
        cra_url=Keyword()
        spider = Keyword()
        cra_time = Keyword()
        page_release_time = Keyword()
        page_title = Text(analyzer="ik_max_word")
        page_content = Text(analyzer="ik_max_word")
    class Meta:
            index = "scrapy"
            doc_type = "sinafinance"
            # 以下settings和mappings都没起作用,暂且记下
            settings = {
                "number_of_shards": 3,
            }
            mappings = {
                '_id':{'path':'cra_url'}
            }
    
    
    class ExchangeratespiderESPipeline(DocType):
        from elasticsearch5 import  Elasticsearch
        ES = ['192.168.52.138:9200']
        es = Elasticsearch(ES,sniff_on_start=True)
    
        def process_item(self, item, spider):
    
            spider.logger.info("-----enter into insert ES")
            article = AticleType()
    
            article.page_from=item['page_from']
            article.domain=item['domain']
            article.cra_url =item['cra_url']
            article.spider =item['spider']
            article.cra_time =item['cra_time']
            article.page_release_time =item['page_release_time']
            article.page_title =item['page_title']
            article.page_content =item['page_content']
    
            article.save()
            return item

    以上方法能将数据写入es,但是如果重复爬取的话,会重复插入数据,因为 主键  ”_id”  是ES自己产生的,找不到自定义_id的入口。于是放弃。

    方法二:实现自定义主键写入,覆盖插入

    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    from elasticsearch5 import Elasticsearch
    
    class SinafinancespiderPipeline(object):
        def process_item(self, item, spider):
            return item
    
    
    # 写入到es中,需要在settings中启用这个类 ExchangeratespiderESPipeline
    # 需要安装pip install elasticsearch-dsl==5.1.0  注意与es版本需要对应
    class SinafinancespiderESPipeline():
        def __init__(self):
            self.ES = ['192.168.52.138:9200']
            # 创建es客户端
            self.es = Elasticsearch(
                self.ES,
                # 启动前嗅探es集群服务器
                sniff_on_start=True,
                # es集群服务器结点连接异常时是否刷新es结点信息
                sniff_on_connection_fail=True,
                # 每60秒刷新节点信息
                sniffer_timeout=60
            )
    
    
        def process_item(self, item, spider):
            spider.logger.info("-----enter into insert ES")
            doc = {
                'page_from': item['page_from'],
                'domain': item['domain'],
                'spider': item['spider'],
                'page_release_time': item['page_release_time'],
                'page_title': item['page_title'],
                'page_content': item['page_content'],
                'cra_url': item['cra_url'],
                'cra_time': item['cra_time']
            }
            self.es.index(index='scrapy', doc_type='sinafinance', body=doc, id=item['cra_url'])
    
            return item

    搜索数据的方法:

    # 字典形式设置body
    query = { 'query': { 'bool': { 'must': [ {'match': {'_all': 'python web'}} ], 'filter': [ {'term': {'status': 2}} ] } } } ret = es.search(index='articles', doc_type='article', body=query)

    # 查询数据
    data = es.search(index='articles', doc_type='article', body=body)
    print(data)
    # 增加
    es.index(...)
    # 修改
    es.update(...)
    # 删除
    es.delete()

    完成后

    在settings.py模块中注册自定义的类

    ITEM_PIPELINES = {
       # 'sinafinancespider.pipelines.SinafinancespiderPipeline': 300,
       'sinafinancespider.pipelines.SinafinancespiderESPipeline': 300,
    }
  • 相关阅读:
    Session的使用与Session的生命周期
    Long-Polling, Websockets, SSE(Server-Sent Event), WebRTC 之间的区别与使用
    十九、详述 IntelliJ IDEA 之 添加 jar 包
    十八、IntelliJ IDEA 常用快捷键 之 Windows 版
    十七、IntelliJ IDEA 中的 Maven 项目初体验及搭建 Spring MVC 框架
    十六、详述 IntelliJ IDEA 创建 Maven 项目及设置 java 源目录的方法
    十五、详述 IntelliJ IDEA 插件的安装及使用方法
    十四、详述 IntelliJ IDEA 提交代码前的 Code Analysis 机制
    十三、IntelliJ IDEA 中的版本控制介绍(下)
    十二、IntelliJ IDEA 中的版本控制介绍(中)
  • 原文地址:https://www.cnblogs.com/yoyowin/p/12209706.html
Copyright © 2020-2023  润新知