python插入Elasticsearch操作

在用scrapy做爬虫的时候，需要将数据存入的es中。网上找了两种方法，照葫芦画瓢也能出来，暂记下来：

首先安装了es，版本是5.6.1的较早版本

用pip安装与es版本相对的es相关包

pip install elasticsearch-dsl==5.1.0

方法一：

以下是pipelines.py模块的完整代码

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import chardet

class SinafinancespiderPipeline(object):
    def process_item(self, item, spider):
        return item


# 写入到es中,需要在settings中启用这个类 ExchangeratespiderESPipeline
# 需要安装pip install elasticsearch-dsl==5.1.0  注意与es版本需要对应
from elasticsearch_dsl import  Date,Nested,Boolean,analyzer,Completion,Keyword,Text,Integer,DocType
from elasticsearch_dsl.connections import connections
connections.create_connection(hosts=['192.168.52.138'])
from elasticsearch import Elasticsearch
es = Elasticsearch()

class AticleType(DocType):
    page_from = Keyword()
    # domain报错
    domain=Keyword()
    cra_url=Keyword()
    spider = Keyword()
    cra_time = Keyword()
    page_release_time = Keyword()
    page_title = Text(analyzer="ik_max_word")
    page_content = Text(analyzer="ik_max_word")
class Meta:
        index = "scrapy"
        doc_type = "sinafinance"
        # 以下settings和mappings都没起作用，暂且记下
        settings = {
            "number_of_shards": 3,
        }
        mappings = {
            '_id':{'path':'cra_url'}
        }


class ExchangeratespiderESPipeline(DocType):
    from elasticsearch5 import  Elasticsearch
    ES = ['192.168.52.138:9200']
    es = Elasticsearch(ES,sniff_on_start=True)

    def process_item(self, item, spider):

        spider.logger.info("-----enter into insert ES")
        article = AticleType()

        article.page_from=item['page_from']
        article.domain=item['domain']
        article.cra_url =item['cra_url']
        article.spider =item['spider']
        article.cra_time =item['cra_time']
        article.page_release_time =item['page_release_time']
        article.page_title =item['page_title']
        article.page_content =item['page_content']

        article.save()
        return item

以上方法能将数据写入es，但是如果重复爬取的话，会重复插入数据，因为主键 ”_id” 是ES自己产生的，找不到自定义_id的入口。于是放弃。

方法二：实现自定义主键写入，覆盖插入

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
from elasticsearch5 import Elasticsearch

class SinafinancespiderPipeline(object):
    def process_item(self, item, spider):
        return item


# 写入到es中,需要在settings中启用这个类 ExchangeratespiderESPipeline
# 需要安装pip install elasticsearch-dsl==5.1.0  注意与es版本需要对应
class SinafinancespiderESPipeline():
    def __init__(self):
        self.ES = ['192.168.52.138:9200']
        # 创建es客户端
        self.es = Elasticsearch(
            self.ES,
            # 启动前嗅探es集群服务器
            sniff_on_start=True,
            # es集群服务器结点连接异常时是否刷新es结点信息
            sniff_on_connection_fail=True,
            # 每60秒刷新节点信息
            sniffer_timeout=60
        )


    def process_item(self, item, spider):
        spider.logger.info("-----enter into insert ES")
        doc = {
            'page_from': item['page_from'],
            'domain': item['domain'],
            'spider': item['spider'],
            'page_release_time': item['page_release_time'],
            'page_title': item['page_title'],
            'page_content': item['page_content'],
            'cra_url': item['cra_url'],
            'cra_time': item['cra_time']
        }
        self.es.index(index='scrapy', doc_type='sinafinance', body=doc, id=item['cra_url'])

        return item

搜索数据的方法：

# 字典形式设置body
query = {
  'query': {
    'bool': {
      'must': [
        {'match': {'_all': 'python web'}}
      ],
      'filter': [
        {'term': {'status': 2}}
      ]
    }
  }
}
ret = es.search(index='articles', doc_type='article', body=query)

# 查询数据
data = es.search(index='articles', doc_type='article', body=body)
print(data)
# 增加
es.index(...)
# 修改
es.update(...)
# 删除
es.delete()

完成后

在settings.py模块中注册自定义的类

ITEM_PIPELINES = {
   # 'sinafinancespider.pipelines.SinafinancespiderPipeline': 300,
   'sinafinancespider.pipelines.SinafinancespiderESPipeline': 300,
}

相关阅读:
Session的使用与Session的生命周期
 Long-Polling, Websockets, SSE(Server-Sent Event), WebRTC 之间的区别与使用
 十九、详述 IntelliJ IDEA 之添加 jar 包
 十八、IntelliJ IDEA 常用快捷键之 Windows 版
 十七、IntelliJ IDEA 中的 Maven 项目初体验及搭建 Spring MVC 框架
 十六、详述 IntelliJ IDEA 创建 Maven 项目及设置 java 源目录的方法
 十五、详述 IntelliJ IDEA 插件的安装及使用方法
 十四、详述 IntelliJ IDEA 提交代码前的 Code Analysis 机制
 十三、IntelliJ IDEA 中的版本控制介绍（下）
十二、IntelliJ IDEA 中的版本控制介绍（中）
原文地址：https://www.cnblogs.com/yoyowin/p/12209706.html