• [翻译]——MySQL 8.0 Histograms


    前言: 本文是对这篇博客MySQL 8.0 Histograms的翻译,翻译如有不当的地方,敬请谅解,请尊重原创和翻译劳动成果,转载的时候请注明出处。谢谢!

     

    英文原文地址:https://lefred.be/content/mysql-8-0-histograms/

     

    翻译原文地址:https://www.cnblogs.com/kerrycode/p/11817026.html

     

     

    在MySQL 8.0之前,MySQL缺失了其它关系数据库中一个众所周知的功能:优化器的直方图

     

    优化器团队(Optimizer Team)在越来越多的MySQL DBA的呼声中实现了这个功能。

     

     

    直方图定义

     

    但什么是直方图呢?我们来看维基百科的定义吧,直方图是数值数据分布的准确表示。 对于RDBMS来说,直方图是特定列内数据分布的近似值。因此在MySQL中,直方图能够帮助优化器找到最有效的执行计划。

     

    直方图例子

     

    为了说明直方图是如何影响优化器工作的,我会用dbt3生成的数据来演示。

     

    我们准备了一个简单查询:

     

    SELECT * FROM orders  
      JOIN customer ON o_custkey = c_custkey 
    WHERE o_orderdate < '1993-01-01' 
      AND c_mktsegment = "AUTOMOBILE"G

     

    让我们看一下传统的执行计划的EXPLAIN输出,以及可视化方式(VISUAL one):

     

    mysql> EXPLAIN SELECT * FROM orders  
           JOIN customer ON o_custkey = c_custkey 
           WHERE o_orderdate < '1993-01-01' AND c_mktsegment = "AUTOMOBILE"G
    *************************** 1. row ***************************
               id: 1
      select_type: SIMPLE
            table: customer
       partitions: NULL
             type: ALL
    possible_keys: PRIMARY
              key: NULL
          key_len: NULL
              ref: NULL
             rows: 149050
         filtered: 10.00
            Extra: Using where
    *************************** 2. row ***************************
               id: 1
      select_type: SIMPLE
            table: orders
       partitions: NULL
             type: ref
    possible_keys: i_o_custkey,i_o_orderdate
              key: i_o_custkey
          key_len: 5
              ref: dbt3.customer.c_custkey
             rows: 14
         filtered: 30.62
            Extra: Using where
    2 rows in set, 1 warning (0.28 sec)

     

    我们看到MySQL首先对customer表做了一个全表扫描,并且它的选择估计记录(过滤)是10%;

     

     

    image

     

    接下来让我们运行这个查询(我使用了COUNT(*)),然后我们来看看有多少行记录

     

    mysql> SELECT count(*) FROM orders  
           JOIN customer ON o_custkey = c_custkey 
           WHERE o_orderdate < '1993-01-01' AND c_mktsegment = "AUTOMOBILE"G
    *************************** 1. row ***************************
    count(*): 45127
    1 row in set (49.98 sec)

     

     

    创建直方图

     

    现在,我将在表customer上的字段c_mktsegment上创建一个直方图

     

    mysql> ANALYZE TABLE customer UPDATE HISTOGRAM ON c_mktsegment WITH 1024 BUCKETS;
    +---------------+-----------+----------+---------------------------------------------------------+
    | Table         | Op        | Msg_type | Msg_text                                                |
    +---------------+-----------+----------+---------------------------------------------------------+
    | dbt3.customer | histogram | status   | Histogram statistics created for column 'c_mktsegment'. |
    +---------------+-----------+----------+---------------------------------------------------------+

     

    接下来,我们来验证查询的执行计划:

     

    mysql> EXPLAIN SELECT * FROM orders  
                   JOIN customer ON o_custkey = c_custkey 
                   WHERE o_orderdate < '1993-01-01' AND c_mktsegment = "AUTOMOBILE"G
    *************************** 1. row ***************************
               id: 1
      select_type: SIMPLE
            table: orders
       partitions: NULL
             type: ALL
    possible_keys: i_o_custkey,i_o_orderdate
              key: NULL
          key_len: NULL
              ref: NULL
             rows: 1494230
         filtered: 30.62
            Extra: Using where
    *************************** 2. row ***************************
               id: 1
      select_type: SIMPLE
            table: customer
       partitions: NULL
             type: eq_ref
    possible_keys: PRIMARY
              key: PRIMARY
          key_len: 4
              ref: dbt3.orders.o_custkey
             rows: 1
         filtered: 19.84
            Extra: Using where
    2 rows in set, 1 warning (1.06 sec)

     

     

    现在,使用直方图后,我们可以看到customer表的吸引力降低了,因为order表按条件过滤的行的百分比(30.62)几乎是customer表按条件过滤行的百分比的两倍(19.84%),这将导致低order表进行查找。

     

    注意:这段感觉没有翻译恰当,英文原文如下,如果感觉翻译比较生硬,参考原文

     

    Now with the histogram we can see that it becomes less attractive to start with customer table since almost twice as many rows (19.84%) will cause look-ups into the order table.

     

     

    image

     

    优化器选择对order表进行全表扫描(full sacn),此时执行计划的代价看起来似乎还高一些,,让我们看一下SQL的执行时间:

     

     

    mysql> SELECT count(*) FROM orders  
           JOIN customer ON o_custkey = c_custkey 
           WHERE o_orderdate < '1993-01-01' AND c_mktsegment = "AUTOMOBILE"G
    *************************** 1. row ***************************
    count(*): 45127
    1 row in set (6.35 sec)

     

    SQL语句的执行时间更短,明显比之前要快了

     

     

     

    查看数据的分布

     

     

    直方图数据存贮在Information_Schema.column_statistics表中,这个表的定义如下

     

     

    +-------------+-------------+------+-----+---------+-------+
    | Field       | Type        | Null | Key | Default | Extra |
    +-------------+-------------+------+-----+---------+-------+
    | SCHEMA_NAME | varchar(64) | NO   |     | NULL    |       |
    | TABLE_NAME  | varchar(64) | NO   |     | NULL    |       |
    | COLUMN_NAME | varchar(64) | NO   |     | NULL    |       |
    | HISTOGRAM   | json        | NO   |     | NULL    |       |
    +-------------+-------------+------+-----+---------+-------+

     

     

    它的一条记录类似下面这样:

     

    SELECT SCHEMA_NAME, TABLE_NAME, COLUMN_NAME, JSON_PRETTY(HISTOGRAM) 
    FROM information_schema.column_statistics 
    WHERE COLUMN_NAME = 'c_mktsegment'G
    *************************** 1. row ***************************
               SCHEMA_NAME: dbt3
                TABLE_NAME: customer
               COLUMN_NAME: c_mktsegment
    JSON_PRETTY(HISTOGRAM): {
      "buckets": [
        [
          "base64:type254:QVVUT01PQklMRQ==",
          0.19837010534684954
        ],
        [
          "base64:type254:QlVJTERJTkc=",
          0.3983104750546611
        ],
        [
          "base64:type254:RlVSTklUVVJF",
          0.5978433710991851
        ],
        [
          "base64:type254:SE9VU0VIT0xE",
          0.799801232359372
        ],
        [
          "base64:type254:TUFDSElORVJZ",
          1.0
        ]
      ],
      "data-type": "string",
      "null-values": 0.0,
      "collation-id": 255,
      "last-updated": "2018-03-02 20:21:48.271523",
      "sampling-rate": 0.6709158000670916,
      "histogram-type": "singleton",
      "number-of-buckets-specified": 1024
    }

     

    而且可以查看分布

     

    SELECT FROM_BASE64(SUBSTRING_INDEX(v, ':', -1)) value, concat(round(c*100,1),'%') cumulfreq, 
           CONCAT(round((c - LAG(c, 1, 0) over()) * 100,1), '%') freq  
    FROM information_schema.column_statistics, JSON_TABLE(histogram->'$.buckets', 
         '$[*]' COLUMNS(v VARCHAR(60) PATH '$[0]', c double PATH '$[1]')) hist  
    WHERE schema_name  = 'dbt3' and table_name = 'customer' and column_name = 'c_mktsegment';
    +------------+-----------+-------+
    | value      | cumulfreq | freq  |
    +------------+-----------+-------+
    | AUTOMOBILE | 19.8%     | 19.8% |
    | BUILDING   | 39.9%     | 20.1% |
    | FURNITURE  | 59.9%     | 19.9% |
    | HOUSEHOLD  | 79.9%     | 20.1% |
    | MACHINERY  | 100.0%    | 20.1% |
    +------------+-----------+-------+

     

    你也可以用下面语法删除直方图信息。

     

     

    mysql> ANALYZE TABLE customer DROP HISTOGRAM on c_mktsegment;
    +---------------+-----------+----------+---------------------------------------------------------+
    | Table         | Op        | Msg_type | Msg_text                                                |
    +---------------+-----------+----------+---------------------------------------------------------+
    | dbt3.customer | histogram | status   | Histogram statistics removed for column 'c_mktsegment'. |
    +---------------+-----------+----------+---------------------------------------------------------+
    1 row in set (0.00 sec)

     

     

    Buckets

     

    你会注意到,当我们创建一个直方图时,我们需要指定buckets的数量,事实上,数据被分成包含特定值以及他们基数(cardinality)的一组Buckets,如果在上一个例子中检查直方图的类型,你会发现它是等宽直方图(singleton)

     

     

    "histogram-type": "singleton",

     

     

    这种类型的直方图最好的,因为基数是针对单个特定值。 如果这次我仅使用2个存储桶(buckets)来重新创建直方图(请记住,在c_mktsegment列中有4个不同的值):

     

     

    mysql> ANALYZE TABLE customer UPDATE HISTOGRAM ON c_mktsegment WITH 2 BUCKETS;
    +---------------+-----------+----------+---------------------------------------------------------+
    | Table         | Op        | Msg_type | Msg_text                                                |
    +---------------+-----------+----------+---------------------------------------------------------+
    | dbt3.customer | histogram | status   | Histogram statistics created for column 'c_mktsegment'. |
    +---------------+-----------+----------+---------------------------------------------------------+

     

    如果我检查直方图的类型:

     

     

    mysql> SELECT SCHEMA_NAME, TABLE_NAME, COLUMN_NAME, 
                  JSON_PRETTY(HISTOGRAM) 
           FROM information_schema.column_statistics 
          WHERE COLUMN_NAME = 'c_mktsegment'G
    *************************** 1. row ***************************
               SCHEMA_NAME: dbt3
                TABLE_NAME: customer
               COLUMN_NAME: c_mktsegment
    JSON_PRETTY(HISTOGRAM): {
      "buckets": [
        [
          "base64:type254:QVVUT01PQklMRQ==",
          "base64:type254:RlVSTklUVVJF",
          0.5996992690844636,
          3
        ],
        [
          "base64:type254:SE9VU0VIT0xE",
          "base64:type254:TUFDSElORVJZ",
          1.0,
          2
        ]
      ],
      "data-type": "string",
      "null-values": 0.0,
      "collation-id": 255,
      "last-updated": "2018-03-02 20:42:26.165898",
      "sampling-rate": 0.6709158000670916,
      "histogram-type": "equi-height",
      "number-of-buckets-specified": 2
    }

    现在的直方图类型是等高直方图,这意味着将连续范围的值分组到存储桶中,以使落入每个存储桶的数据项的数量相同。

     

     

    结论:

     

    直方图对那些不是索引中第一列的列非常有用,这些列用于JOIN、IN子查询(IN-subqueries)或ORDER BY…LIMIT的查询的WHERE条件下使用。  

     

    另外, 可以考虑尝试使用足够的存储通来获取等宽直方图。

  • 相关阅读:
    iOS:Core Data 中的简单ORM
    Win8:Snap 实现
    js: 删除node的所有child
    WinJS:Listview item 设置背景透明
    iOS: 消息通信中的Notification&KVO
    win8: 清除iframe的缓存
    What's New in iOS7,iOS7新特性介绍
    "Entity Framework数据插入性能追踪"读后总结
    夜,思考——我想要的到底是什么?
    【查询】—Entity Framework实例详解
  • 原文地址:https://www.cnblogs.com/kerrycode/p/11817026.html
Copyright © 2020-2023  润新知