• python爬取elasticsearch内容


    我们以上篇的elasticsearch添加的内容为例,对其内容进行爬取,并获得有用信息个过程。

    先来看一下elasticsearch中的内容:

    {
      "took": 88,
      "timed_out": false,
      "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
      },
      "hits": {
        "total": 3,
        "max_score": 1,
        "hits": [
          {
            "_index": "megacorp",
            "_type": "employee",
            "_id": "2",
            "_score": 1,
            "_source": {
              "first_name": "Jane",
              "last_name": "Smith",
              "age": 32,
              "about": "I like to collect rock albums",
              "interests": [
                "music"
              ]
            }
          },
          {
            "_index": "megacorp",
            "_type": "employee",
            "_id": "1",
            "_score": 1,
            "_source": {
              "first_name": "John",
              "last_name": "Smith",
              "age": 25,
              "about": "I love to go rock climbing",
              "interests": [
                "sports",
                "music"
              ]
            }
          },
          {
            "_index": "megacorp",
            "_type": "employee",
            "_id": "3",
            "_score": 1,
            "_source": {
              "first_name": "Douglas",
              "last_name": "Fir",
              "age": 35,
              "about": "I like to build cabinets",
              "interests": [
                "forestry"
              ]
            }
          }
        ]
      }
    }

    1.在python中,首先要用到urllib的包,其次对其进行读取的格式为json。

    import urllib.request as request
    import json

    2.接下来,我们获取相应的路径请求,并用urlopen打开请求的文件:

    if __name__ == '__main__':
        req = request.Request("http://localhost:9200/megacorp/employee/_search")
        resp = request.urlopen(req)

    3.对得到的resp,我们需要用json的格式迭代输出:(注意是字符串类型)

    jsonstr=""
        for line in resp:
            jsonstr+=line.decode()
        data=json.loads(jsonstr)
        print(data)

    4.但是我们得到的信息是包含内容和属性的,我们只想得到内容,那么久需要对每层的属性进行分解获取:

    employees = data['hits']['hits']
     
        for e in employees:
            _source=e['_source']
            full_name=_source['first_name']+"."+_source['last_name']
            age=_source["age"]
            about=_source["about"]
            interests=_source["interests"]
            print(full_name,'is',age,",")
            print(full_name,"info is",about)
            print(full_name,'likes',interests)

    得到的内容为:

    Jane.Smith is 32 ,
    Jane.Smith info is I like to collect rock albums
    Jane.Smith likes ['music']
    
    John.Smith is 25 ,
    John.Smith info is I love to go rock climbing
    John.Smith likes ['sports', 'music']
    
    Douglas.Fir is 35 ,
    Douglas.Fir info is I like to build cabinets
    Douglas.Fir likes ['forestry']

    对于需要聚合的内容,我们可以通过下面的方法进行获取:

    1:获取路径

    url="http://localhost:9200/megacorp/employee/_search"

    2.获取聚合的格式查询

    data='''
        {
        "aggs" : {
            "all_interests" : {
                "terms" : { "field" : "interests" },
                "aggs" : {
                    "avg_age" : {
                        "avg" : { "field" : "age" }
                    }
                }
            }
        }
    }
        '''

    3.标明头部信息

    headers={"Content-Type":"application/json"}

    4.同样,以请求和相应的方式获取信息并迭代为json格式

    req=request.Request(url=url,data=data.encode(),headers=headers,method="GET")
        resp=request.urlopen(req)
        jsonstr=""
        for line in resp:
            jsonstr+=line.decode()
        rsdata=json.loads(jsonstr)

    5.有用聚合信息内部依然是数组形式,所以依然需要迭代输出:

    agg = rsdata['aggregations']
    buckets = agg['all_interests']['buckets']
        
        for b in buckets:
            key = b['key']
            doc_count = b['doc_count']
            avg_age = b['avg_age']['value']        
    print('aihao',key,'gongyou',doc_count,'ren,tamenpingjuageshi',avg_age)

    最终得到信息:

    aihao music gongyou 2 ren,tamenpingjuageshi 28.5
    
    aihao forestry gongyou 1 ren,tamenpingjuageshi 35.0
    
    aihao sports gongyou 1 ren,tamenpingjuageshi 25.0
  • 相关阅读:
    django加载静态文件
    计算机网络-划分子网
    接口定义一个Kye.保证其安全性
    GridView中几个显示数据时! 数据停靠(靠左 or 居中)的问题!
    数据库SQL Case...when...then...end的用法!
    利用jQuery发送ajax异步请求
    利用索引进行数据查询优化(转载!)
    身份证的合法验证
    DataTable判断列是否为空!(实用)
    窗体美化,IrisSkin2.dll的使用!
  • 原文地址:https://www.cnblogs.com/qianshuixianyu/p/9287556.html
Copyright © 2020-2023  润新知