• Elasticsearch 之 Hello World (二)


        首先测试下分词尤其是中文分词功能,这个可是传统数据库如mysql,sqlserver的痛啊。

        打开浏览器,并登录到http://localhost:5601,点击Dev Tools项,在Console栏输入

    POST _analyze
    {
      "analyzer": "standard",
      "text":"Hello World ElasticSearch"
    }

        会在右面显示返回的结果

    {
      "tokens": [
        {
          "token": "hello",
          "start_offset": 0,
          "end_offset": 5,
          "type": "<ALPHANUM>",
          "position": 0
        },
        {
          "token": "world",
          "start_offset": 6,
          "end_offset": 11,
          "type": "<ALPHANUM>",
          "position": 1
        },
        {
          "token": "elasticsearch",
          "start_offset": 12,
          "end_offset": 25,
          "type": "<ALPHANUM>",
          "position": 2
        }
      ]
    }

        一切看上去都挺美好,等加入中文看看。

    POST _analyze
    {
      "analyzer": "standard",
      "text":"ElasticSearch是一个很不错的全文检索软件。"
    }

        结果是

    {
      "tokens": [
        {
          "token": "elasticsearch",
          "start_offset": 0,
          "end_offset": 13,
          "type": "<ALPHANUM>",
          "position": 0
        },
        {
          "token": "",
          "start_offset": 13,
          "end_offset": 14,
          "type": "<IDEOGRAPHIC>",
          "position": 1
        },
        {
          "token": "",
          "start_offset": 14,
          "end_offset": 15,
          "type": "<IDEOGRAPHIC>",
          "position": 2
        },
        {
          "token": "",
          "start_offset": 15,
          "end_offset": 16,
          "type": "<IDEOGRAPHIC>",
          "position": 3
        },
        {
          "token": "",
          "start_offset": 16,
          "end_offset": 17,
          "type": "<IDEOGRAPHIC>",
          "position": 4
        },
        {
          "token": "",
          "start_offset": 17,
          "end_offset": 18,
          "type": "<IDEOGRAPHIC>",
          "position": 5
        },
        {
          "token": "",
          "start_offset": 18,
          "end_offset": 19,
          "type": "<IDEOGRAPHIC>",
          "position": 6
        },
        {
          "token": "",
          "start_offset": 19,
          "end_offset": 20,
          "type": "<IDEOGRAPHIC>",
          "position": 7
        },
        {
          "token": "",
          "start_offset": 20,
          "end_offset": 21,
          "type": "<IDEOGRAPHIC>",
          "position": 8
        },
        {
          "token": "",
          "start_offset": 21,
          "end_offset": 22,
          "type": "<IDEOGRAPHIC>",
          "position": 9
        },
        {
          "token": "",
          "start_offset": 22,
          "end_offset": 23,
          "type": "<IDEOGRAPHIC>",
          "position": 10
        },
        {
          "token": "",
          "start_offset": 23,
          "end_offset": 24,
          "type": "<IDEOGRAPHIC>",
          "position": 11
        },
        {
          "token": "",
          "start_offset": 24,
          "end_offset": 25,
          "type": "<IDEOGRAPHIC>",
          "position": 12
        },
        {
          "token": "",
          "start_offset": 25,
          "end_offset": 26,
          "type": "<IDEOGRAPHIC>",
          "position": 13
        }
      ]
    }

        这显然不能忍啊,每个中文字都拆,基本就是不能用的节奏。google下,貌似其还有analyzer为chinese选项,测试发现结果一样。网上搜索发现这里一般用的是smartcn或是IKAnanlyzer插件,有的资料和书就推荐IKAnanlyzer,但这些资料都是基于老版本的es,我去IKAnanlyzer的github上去看了下,发现貌似太监了,所以还是用官方推荐的smartcn吧,下载安装的过程和安装其他插件一致,这里还是推荐离线包安装。安装完,应该要重启es服务才能生效。现在再试试

    POST _analyze
    {
      "analyzer": "smartcn",
      "text":"ElasticSearch是一个很不错的全文检索软件。"
    }
    {
      "tokens": [
        {
          "token": "elasticsearch",
          "start_offset": 0,
          "end_offset": 13,
          "type": "word",
          "position": 0
        },
        {
          "token": "",
          "start_offset": 13,
          "end_offset": 14,
          "type": "word",
          "position": 1
        },
        {
          "token": "一个",
          "start_offset": 14,
          "end_offset": 16,
          "type": "word",
          "position": 2
        },
        {
          "token": "",
          "start_offset": 16,
          "end_offset": 17,
          "type": "word",
          "position": 3
        },
        {
          "token": "不错",
          "start_offset": 17,
          "end_offset": 19,
          "type": "word",
          "position": 4
        },
        {
          "token": "",
          "start_offset": 19,
          "end_offset": 20,
          "type": "word",
          "position": 5
        },
        {
          "token": "全文",
          "start_offset": 20,
          "end_offset": 22,
          "type": "word",
          "position": 6
        },
        {
          "token": "检索",
          "start_offset": 22,
          "end_offset": 24,
          "type": "word",
          "position": 7
        },
        {
          "token": "软件",
          "start_offset": 24,
          "end_offset": 26,
          "type": "word",
          "position": 8
        }
      ]
    }

    这下看上去河蟹多了。:)

  • 相关阅读:
    selenium---常用元素等待的三种方法
    selenium---浏览器操作方法
    selenium---xpath轴定位
    requests---通过file_data方法请求yaml数据
    pywinauto客户端自动化---模拟键盘操作
    pywinauto客户端自动化---模拟鼠标操作
    开发摆摊网心理路程
    解决MVC提示未能加载文件或程序集“System.Web.Mvc, Version=4.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35”或它的某一个依赖项。
    ATH9K驱动支持2MHz,2.5Mhz,1Mhz等工作带宽
    javax.validation 参数验证
  • 原文地址:https://www.cnblogs.com/elasticsearch/p/6478799.html
Copyright © 2020-2023  润新知