elasticsearch 基础 —— _mget取回多个文档及_bulk批量操作

取回多个文档

Elasticsearch 的速度已经很快了，但甚至能更快。将多个请求合并成一个，避免单独处理每个请求花费的网络延时和开销。如果你需要从 Elasticsearch 检索很多文档，那么使用 multi-get 或者 mget API 来将这些检索请求放在一个请求中，将比逐个文档请求更快地检索到全部文档。

mget API 要求有一个 docs 数组作为参数，每个元素包含需要检索文档的元数据，包括 _index 、 _type和 _id 。如果你想检索一个或者多个特定的字段，那么你可以通过 _source 参数来指定这些字段的名字：

GET /_mget
{
   "docs" : [
      {
         "_index" : "website",
         "_type" :  "blog",
         "_id" :    2
      },
      {
         "_index" : "website",
         "_type" :  "pageviews",
         "_id" :    1,
         "_source": "views"
      }
   ]
}

该响应体也包含一个 docs 数组，对于每一个在请求中指定的文档，这个数组中都包含有一个对应的响应，且顺序与请求中的顺序相同。其中的每一个响应都和使用单个 get request 请求所得到的响应体相同：

   {
       "docs" : [
          {
             "_index" :   "website",
             "_id" :      "2",
             "_type" :    "blog",
             "found" :    true,
             "_source" : {
                "text" : "This is a piece of cake...",
                "title" : "My first external blog entry"
             },
             "_version" : 10
          },
          {
             "_index" :   "website",
             "_id" :      "1",
             "_type" :    "pageviews",
             "found" :    true,
             "_version" : 2,
             "_source" : {
                "views" : 2
             }
          }
       ]
   }

ElasticSearch reindex报错：the final mapping would have more than 1 type

在Elasticsearch 6.0.0或更高版本中创建的索引只包含一个mapping type。在5.x中使用multiple mapping types创建的索引将继续像以前一样在Elasticsearch 6.x中运行。 Mapping types将在Elasticsearch 7.0.0中完全删除。

Indices created in Elasticsearch 6.0.0 or later may only contain a single mapping type. Indices created in 5.x with multiple mapping types will continue to function as before in Elasticsearch 6.x. Mapping types will be completely removed in Elasticsearch 7.0.0.

https://www.elastic.co/guide/en/elasticsearch/reference/current/removal-of-types.html#_index_per_document_type

如果想检索的数据都在相同的 _index 中（甚至相同的 _type 中），则可以在 URL 中指定默认的 /_index或者默认的 /_index/_type 。

你仍然可以通过单独请求覆盖这些值：

GET /website/blog/_mget
{
   "docs" : [
      { "_id" : 2 },
      { "_type" : "pageviews", "_id" :   1 }
   ]
}

事实上，如果所有文档的 _index 和 _type 都是相同的，你可以只传一个 ids 数组，而不是整个 docs 数组：

GET /website/blog/_mget
{
   "ids" : [ "2", "1" ]
}

注意，我们请求的第二个文档是不存在的。我们指定类型为 blog ，但是文档 ID 1 的类型是 pageviews，这个不存在的情况将在响应体中被报告：

   {
      "docs" : [
        {
          "_index" :   "website",
          "_type" :    "blog",
          "_id" :      "2",
          "_version" : 10,
          "found" :    true,
          "_source" : {
            "title":   "My first external blog entry",
            "text":    "This is a piece of cake..."
          }
        },
        {
          "_index" :   "website",
          "_type" :    "blog",
          "_id" :      "1",
          "found" :    false
        }
      ]
   }

未找到该文档。

事实上第二个文档未能找到并不妨碍第一个文档被检索到。每个文档都是单独检索和报告的。

即使有某个文档没有找到，上述请求的 HTTP 状态码仍然是 200 。事实上，即使请求没有找到任何文档，它的状态码依然是 200 --因为 mget 请求本身已经成功执行。为了确定某个文档查找是成功或者失败，你需要检查 found 标记。

_source过滤

默认_source字段会返回所有的内容，你也可以通过_source进行过滤。比如使用_source,_source_include,_source_exclude.
比如：

POST _bulk
{ "create":  { "_index": "website", "_type": "blog", "_id": "1" }}
{ "text" :  "This is a piece of cake1", "title" : "My first external blog entry1","username.lastname":"lastname1","username.firstname":"firstname1"}
{ "create":  { "_index": "website", "_type": "blog", "_id": "2" }}
{ "text" :  "This is a piece of cake2", "title" : "My first external blog entry2","username.lastname":"lastname2","username.firstname":"firstname1"}
{ "create":  { "_index": "website", "_type": "blog", "_id": "3" }}
{ "text" :  "This is a piece of cake3", "title" : "My first external blog entry3","username.lastname":"lastname3","username.firstname":"firstname1"}

GET /website/blog/_mget 
{
    "docs" : [
        {
            "_id" : "1",
            "_source" : false
        },
        {
            "_id" : "2",
            "_source" : ["title", "text"]
        },
        {
            "_id" : "3",
            "_source" : {
                "include": ["username"],
                "exclude": ["username.lastname"]
            }
        }
    ]
}

   {
        "docs": [
            {
                "_index": "website",
                "_type": "blog",
                "_id": "1",
                "_version": 1,
                "found": true
            },
            {
                "_index": "website",
                "_type": "blog",
                "_id": "2",
                "_version": 1,
                "found": true,
                "_source": {
                    "text": "This is a piece of cake2",
                    "title": "My first external blog entry2"
                }
            },
            {
                "_index": "website",
                "_type": "blog",
                "_id": "3",
                "_version": 1,
                "found": true,
                "_source": {
                    "username.firstname": "firstname3"
                }
            }
        ]
   }

路由

在mget查询中也会涉及到路由的问题。可以在url中设置默认的路由，然后在Body中修改：

GET /website/blog/_mget?routing=key1 
{
    "docs" : [
        {
            "_id" : "1",
            "_routing" : "key2"
        },
        {
            "_id" : "2"
        }
    ]
}

在上面的例子中，test/type/1按照key2这个路由锁定分片进行查询；test/type/2按照key1这个路由锁定分片进行查询。

代价较小的批量操作

与 mget 可以使我们一次取回多个文档同样的方式， bulk API 允许在单个步骤中进行多次 create 、 index 、 update 或 delete 请求。如果你需要索引一个数据流比如日志事件，它可以排队和索引数百或数千批次。

bulk 与其他的请求体格式稍有不同，如下所示：

   { action: { metadata }}
   { request body }
   { action: { metadata }}
   { request body }
   ...

这种格式类似一个有效的单行 JSON 文档流，它通过换行符()连接到一起。注意两个要点：

每行一定要以换行符()结尾， 包括最后一行 。这些换行符被用作一个标记，可以有效分隔行。
这些行不能包含未转义的换行符，因为他们将会对解析造成干扰。这意味着这个 JSON 不能使用 pretty 参数打印。

在为什么是有趣的格式？中，我们解释为什么 bulk API 使用这种格式。

action/metadata 行指定 哪一个文档 做 什么操作 。

action 必须是以下选项之一:

create

如果文档不存在，那么就创建它。详情请见创建新文档。

index

创建一个新文档或者替换一个现有的文档。详情请见索引文档和更新整个文档。

update

部分更新一个文档。详情请见文档的部分更新。

delete

删除一个文档。详情请见删除文档。

metadata 应该指定被索引、创建、更新或者删除的文档的 _index 、 _type 和 _id 。

例如，一个 delete 请求看起来是这样的：

{ "delete": { "_index": "website", "_type": "blog", "_id": "123" }}

request body 行由文档的 _source 本身组成--文档包含的字段和值。它是 index 和 create 操作所必需的，这是有道理的：你必须提供文档以索引。

它也是 update 操作所必需的，并且应该包含你传递给 update API 的相同请求体： doc 、 upsert 、 script 等等。删除操作不需要 request body 行。

{ "create":  { "_index": "website", "_type": "blog", "_id": "123" }}
{ "title":    "My first blog post" }

如果不指定 _id ，将会自动生成一个 ID ：

{ "index": { "_index": "website", "_type": "blog" }}
{ "title":    "My second blog post" }

为了把所有的操作组合在一起，一个完整的 bulk 请求有以下形式:

   POST /_bulk
   { "delete": { "_index": "website", "_type": "blog", "_id": "123" }}
   { "create": { "_index": "website", "_type": "blog", "_id": "123" }}
   { "title": "My first blog post" }
   { "index": { "_index": "website", "_type": "blog" }}
   { "title": "My second blog post" }
   { "update": { "_index": "website", "_type": "blog", "_id": "123", "_retry_on_conflict" : 3} }
   { "doc" : {"title" : "My updated blog post"} }

注意 delete 动作不能有请求体,它后面跟着的是另外一个操作。

谨记最后一个换行符不要落下。

这个 Elasticsearch 响应包含 items 数组，这个数组的内容是以请求的顺序列出来的每个请求的结果。

   {
       "took": 4,
       "errors": false,
       "items": [
          { "delete": {
                "_index":   "website",
                "_type":    "blog",
                "_id":      "123",
                "_version": 2,
                "status":   200,
                "found":    true
          }},
          { "create": {
                "_index":   "website",
                "_type":    "blog",
                "_id":      "123",
                "_version": 3,
                "status":   201
          }},
          { "create": {
                "_index":   "website",
                "_type":    "blog",
                "_id":      "EiwfApScQiiy7TIKFxRCTw",
                "_version": 1,
                "status":   201
          }},
          { "update": {
                "_index":   "website",
                "_type":    "blog",
                "_id":      "123",
                "_version": 4,
                "status":   200
          }}
       ]
   }

所有的子请求都成功完成。

每个子请求都是独立执行，因此某个子请求的失败不会对其他子请求的成功与否造成影响。如果其中任何子请求失败，最顶层的 error 标志被设置为 true ，并且在相应的请求报告出错误明细：

POST /_bulk
{ "create": { "_index": "website", "_type": "blog", "_id": "123" }}
{ "title":    "Cannot create - it already exists" }
{ "index":  { "_index": "website", "_type": "blog", "_id": "123" }}
{ "title":    "But we can update it" }

在响应中，我们看到 create 文档 123 失败，因为它已经存在。但是随后的 index 请求，也是对文档 123操作，就成功了：

   {
       "took": 3,
       "errors": true,
       "items": [
          { "create": {
                "_index":   "website",
                "_type":    "blog",
                "_id":      "123",
                "status":   409,
                "error":    "DocumentAlreadyExistsException
                            [[website][4] [blog][123]:
                            document already exists]"
          }},
          { "index": {
                "_index":   "website",
                "_type":    "blog",
                "_id":      "123",
                "_version": 5,
                "status":   200
          }}
       ]
   }

   一个或者多个请求失败。
   这个请求的HTTP状态码报告为 409 CONFLICT 。
   解释为什么请求失败的错误信息。
第二个请求成功，返回 HTTP 状态码 200 OK 。

这也意味着 bulk 请求不是原子的：不能用它来实现事务控制。每个请求是单独处理的，因此一个请求的成功或失败不会影响其他的请求。

不要重复指定Index和Type

也许你正在批量索引日志数据到相同的 index 和 type 中。但为每一个文档指定相同的元数据是一种浪费。相反，可以像 mget API 一样，在 bulk 请求的 URL 中接收默认的 /_index 或者 /_index/_type ：

POST /website/_bulk
{ "index": { "_type": "log" }}
{ "event": "User logged in" }

你仍然可以覆盖元数据行中的 _index 和 _type , 但是它将使用 URL 中的这些元数据值作为默认值：

POST /website/log/_bulk
{ "index": {}}
{ "event": "User logged in" }
{ "index": { "_type": "blog" }}
{ "title": "Overriding the default type" }

多大是太大了？

整个批量请求都需要由接收到请求的节点加载到内存中，因此该请求越大，其他请求所能获得的内存就越少。批量请求的大小有一个最佳值，大于这个值，性能将不再提升，甚至会下降。但是最佳值不是一个固定的值。它完全取决于硬件、文档的大小和复杂度、索引和搜索的负载的整体情况。

幸运的是，很容易找到这个 最佳点 ：通过批量索引典型文档，并不断增加批量大小进行尝试。当性能开始下降，那么你的批量大小就太大了。一个好的办法是开始时将 1,000 到 5,000 个文档作为一个批次, 如果你的文档非常大，那么就减少批量的文档个数。

密切关注你的批量请求的物理大小往往非常有用，一千个 1KB 的文档是完全不同于一千个 1MB 文档所占的物理大小。一个好的批量大小在开始处理后所占用的物理大小约为 5-15 MB。

相关阅读:
49. 字母异位词分组
73. 矩阵置零
Razor语法问题(foreach里面嵌套if)
多线程问题
Get json formatted string from web by sending HttpWebRequest and then deserialize it to get needed data
How to execute tons of tasks parallelly with TPL method?
How to sort the dictionary by the value field
How to customize the console applicaton
What is the difference for delete/truncate/drop
How to call C/C++ sytle function from C# solution?

原文地址：https://www.cnblogs.com/gmhappy/p/11864066.html