• 多维数据查询效率分析(1)


    有时需求需要我们把系统做成灵活的。最常见的形式是,属性不能是固定的,要用户可以自定义。这样的需求往往会在数据库中建模成一个一对多的关系。

    create table person {

    ...  

    }
    create table person_attribute {

    person_id ...

    attribute_name ...

    attribute_value ...

    ...

    }

    这样的建模在没有查询需要的时候,还是蛮不错的。但是一旦需要对扩充的属性值进行查询,速度往往惨不忍睹。曾经在新加坡做过一个电信的遗留系统的前端,其数据库的建模就是这样的。对于中间的属性表,一个简单的查询都需要join好几次,速度非常慢。好在那次只是做ETL,并不是直接把这样的数据库做后端,要不然肯定死的很惨。当时的做法是把所有的数据读入到内存中,针对属性的查询用内存集合遍历来实现。这样做的前提是集合的元素数量非常少(几百而已),总数据量也非常少。但是如果我们需要处理的数据量非常大,那么我们就必须在数据库中能够对多维数据进行高效查询。

    为了搞清楚这个问题,我们需要做一系列实验。先来介绍一下我们实验的对象。

    假定我们有一张contacts表,然后对每个联系人有一个contact_categories的表,简称cc

    create table cc(
    contact_id integer,
    cad_id integer,
    value integer);

    cad_id代表字段的id,value是cateogry的值。数据量是500万。实验用的数据库分别是PostgreSQL(原始数据大小250M),和使用MYISAM引擎的MySQL(原始数据大小73M)。使用的磁盘是普通的笔记本硬盘,没有raid,普通的ext4分区,峰值传输率大概是70M/s。假设没有资源的争抢,而且数据库总是以最快的顺序读的方式从磁盘中加载数据,那么PostgreSQL得用4s,而MySQL也需要1s才能把所有的磁盘内容读到内存中。

    所以如果我们使用基于磁盘的解决方案的话,无论如何也无法把查询压缩到1s以内。因为我们需要提供一个Reponsive的界面前端,所以数据必须能够在内存中被查询,可能不是所有的数据都能放入内存,但是最起码被查询到的数据得一直在内存中。最简单的办法把数据库移到内存中的方式不是改数据库的设置,而是直接把内存映射成文件夹:

    sudo mount -t ramfs -o size=200000m ramfs /mnt/memory

    然后把数据库的数据目录移动到/mnt/memory之中。但是数据库的缓存设置也是必须修改的,比如PostgreSQL的work_mem如果设置过小的话,在做对一个很大的表做count(distinct xxx)时就会导致中间结果被写入到临时表之中。所以我们还是把所有的缓存搞大一些吧。

    work_mem = 1000MB
    shared_buffer = 1000MB
    temp_buffer = 1000MB
    effective_cache_size = 1000MB
    wal_buffers = 1000MB
    auto_vacuum = off

    让我们来看看最基本的一个查询能有多快吧

    taowen@dmright-perf:~$ time psql postgres -c 'select count(*) from cc;'
    count
    ---------
    5904385
    (1 row)


    real 0m0.448s
    user 0m0.024s
    sys 0m0.000s

    count(*)和count(contact_id)是一样的么,让我们来试验一下

    taowen@dmright-perf:~$ time psql postgres -c 'select count(contact_id) from cc;'
    count
    ---------
    5904385
    (1 row)


    real 0m0.537s
    user 0m0.016s
    sys 0m0.008s

    有意思!居然比count(*)还要慢。可能是因为我们没有给contact_id字段加索引的原因。好吧,加上索引。

    taowen@dmright-perf:~$ time psql postgres -c 'create index contact_id_idx on cc(contact_id);'
    CREATE INDEX

    real 0m4.848s
    user 0m0.016s
    sys 0m0.008s
    taowen@dmright-perf:~$ time psql postgres -c 'analyze cc;'
    ANALYZE

    real 0m0.197s
    user 0m0.024s
    sys 0m0.004s

    再试试看

    taowen@dmright-perf:~$ time psql postgres -c 'select count(contact_id) from cc;'
    count
    ---------
    5904385
    (1 row)


    real 0m0.534s
    user 0m0.020s
    sys 0m0.008s

    taowen@dmright-perf:~$ time psql postgres -c 'select count(*) from cc;'
    count
    ---------
    5904385
    (1 row)


    real 0m0.447s
    user 0m0.028s
    sys 0m0.008s

    基本上没有变化……count(*)貌似就是比count(contact_id)要快。不管啦,既然小于500ms,也算是够快了。让我们给查询加上个条件吧。

    taowen@dmright-perf:~$ time psql postgres -c 'select count(*) from cc where cad_id = 101 and value = 5;'
    count
    --------
    998839
    (1 row)


    real 0m0.686s
    user 0m0.024s
    sys 0m0.000s

    taowen@dmright-perf:~$ time psql postgres -c 'select count(contact_id) from cc where cad_id = 101 and value = 5;'
    count
    --------
    998839
    (1 row)


    real 0m0.660s
    user 0m0.024s
    sys 0m0.000s

    我们可以看到,在count(*)和count(contact_id)之间没有特别大的区别。但是我们还没有给cad_id和value加索引,所以让我们加上看看如何

    taowen@dmright-perf:~$ time psql postgres -c 'create index cad_id_value_idx on cc(cad_id, value);'
    CREATE INDEX

    real 0m10.069s
    user 0m0.020s
    sys 0m0.008s

    taowen@dmright-perf:~$ time psql postgres -c 'analyze cc;'
    ANALYZE

    real 0m0.199s
    user 0m0.016s
    sys 0m0.012s

    and try again.

    taowen@dmright-perf:~$ time psql postgres -c 'select count(contact_id) from cc where cad_id = 101 and value = 5;'
    count
    --------
    998839
    (1 row)


    real 0m0.283s
    user 0m0.020s
    sys 0m0.012s

    真是很快很快耶!让我们来回顾一下,我们现在对于contact_id和(cad_id, value)都建立了索引。
    接下来我们把查询弄得更复杂一些,加上AND条件。有三种可能的方式:
    1、INTERSECT
    2、INNER JOIN
    3、IN + SUB QUERY
    我们每种做法都试试

    taowen@dmright-perf:~$ time psql postgres -c 'select count(contact_id) from (select contact_id from cc where cad_id = 101 and value = 5 intersect select contact_id from cc where cad_id = 102 and value = 7) as temp'
    count
    --------
    164788
    (1 row)


    real 0m1.159s
    user 0m0.028s
    sys 0m0.008s

    taowen@dmright-perf:~$ time psql postgres -c 'select count(*) from (select contact_id from cc where cad_id = 101 and value = 5 intersect select contact_id from cc where cad_id = 102 and value = 7) as temp' count
    --------
    164788
    (1 row)


    real 0m1.148s
    user 0m0.032s
    sys 0m0.000s

    显而易见,这么做很慢。那么INNER JOIN是不是更快一些呢?

    taowen@dmright-perf:~$ time psql postgres -c 'select count(a1.contact_id) from cc as a1 inner join cc as a2 on a1.contact_id = a2.contact_id and a2.cad_id = 102 and a2.value = 7 where a1.cad_id = 101 and a1.value = 5;'
    count
    --------
    164788
    (1 row)


    real 0m1.162s
    user 0m0.036s
    sys 0m0.000s

    这大概要慢上个200ms了。那么IN + SUB QUERY呢?

    taowen@dmright-perf:~$ time psql postgres -c 'select count(a1.contact_id) from cc as a1 where a1.cad_id = 101 and a1.value = 5 and a1.contact_id in (select contact_id from cc as a2 where a2.cad_id = 102 and a2.value = 7)'
    count
    --------
    164788
    (1 row)


    real 0m2.645s
    user 0m0.024s
    sys 0m0.004s

    靠,居然更慢了。总结就是,在有两个条件的情况下,INTERSECT似乎是最快的。但是即便如此,它也超过了一秒钟了。为什么会这样呢?时间都花哪里去了?

    taowen@dmright-perf:~$ time psql postgres -c 'explain analyze select count(contact_id) from (select contact_id from cc where cad_id = 101 and value = 5 intersect select contact_id from cc where cad_id = 102 and value = 7) as temp'
    QUERY PLAN
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
    Aggregate (cost=107853.91..107853.92 rows=1 width=4) (actual time=1471.907..1471.907 rows=1 loops=1)
    -> Subquery Scan on temp (cost=7467.63..107375.65 rows=191301 width=4) (actual time=1366.131..1459.419 rows=164788 loops=1)
    -> HashSetOp Intersect (cost=7467.63..105462.64 rows=191301 width=4) (actual time=1366.129..1439.781 rows=164788 loops=1)
    -> Append (cost=7467.63..103326.69 rows=854380 width=4) (actual time=69.436..797.478 rows=1829408 loops=1)
    -> Subquery Scan on "*SELECT* 2" (cost=7467.63..48180.53 rows=351876 width=4) (actual time=69.435..350.031 rows=830569 loops=1)
    -> Bitmap Heap Scan on cc (cost=7467.63..44661.77 rows=351876 width=4) (actual time=69.434..264.538 rows=830569 loops=1)
    Recheck Cond: ((cad_id = 102) AND (value = 7))
    -> Bitmap Index Scan on cad_id_value_idx (cost=0.00..7379.66 rows=351876 width=0) (actual time=64.162..64.162 rows=830569 loops=1)
    Index Cond: ((cad_id = 102) AND (value = 7))
    -> Subquery Scan on "*SELECT* 1" (cost=10667.56..55146.16 rows=502504 width=4) (actual time=69.846..331.544 rows=998839 loops=1)
    -> Bitmap Heap Scan on cc (cost=10667.56..50121.12 rows=502504 width=4) (actual time=69.845..233.794 rows=998839 loops=1)
    Recheck Cond: ((cad_id = 101) AND (value = 5))
    -> Bitmap Index Scan on cad_id_value_idx (cost=0.00..10541.94 rows=502504 width=0) (actual time=64.501..64.501 rows=998839 loops=1)
    Index Cond: ((cad_id = 101) AND (value = 5))
    Total runtime: 1477.484 ms
    (15 rows)

    首先,INTERSECT不是并行执行的。两个子查询分别花费了300ms以上的时间,加起来有800ms是用在搜集contact_id上了。其次,集合之间的并集操作花费了600多ms。其余的时间都花在了数集合的成员个数上了。有一点值得注意的是,根据 http://postgresql.1045698.n5.nabble.com/ANTI-JOIN-needs-table-index-scan-not-possible-td3425340.html index scan和heap scan实际上都用上了索引。然而,SELECT还是在其之上额外耗费了100ms,我猜测它可能是回到原始的表结构中把行取出来,以获得contact_id的值。这在磁盘上的话速度会更慢,因为会是random seek操作。
    看完了INTERSECT,让我们再来分析分析Join:

    taowen@dmright-perf:~$ time psql postgres -c 'explain analyze select count(a1.contact_id) from cc as a1 inner join cc as a2 on a1.contact_id = a2.contact_id and a2.cad_id = 102 and a2.value = 7 where a1.cad_id = 101 and a1.value = 5;'
    QUERY PLAN
    -----------------------------------------------------------------------------------------------------------------------------------------------------------
    Aggregate (cost=180366.31..180366.32 rows=1 width=4) (actual time=1280.964..1280.964 rows=1 loops=1)
    -> Merge Join (cost=174783.00..180104.02 rows=104912 width=4) (actual time=1043.879..1270.197 rows=164788 loops=1)
    Merge Cond: (a1.contact_id = a2.contact_id)
    -> Sort (cost=97705.18..98961.44 rows=502504 width=4) (actual time=586.735..626.292 rows=998839 loops=1)
    Sort Key: a1.contact_id
    Sort Method: quicksort Memory: 71397kB
    -> Bitmap Heap Scan on cc a1 (cost=10667.56..50121.12 rows=502504 width=4) (actual time=75.377..257.403 rows=998839 loops=1)
    Recheck Cond: ((cad_id = 101) AND (value = 5))
    -> Bitmap Index Scan on cad_id_value_idx (cost=0.00..10541.94 rows=502504 width=0) (actual time=69.565..69.565 rows=998839 loops=1)
    Index Cond: ((cad_id = 101) AND (value = 5))
    -> Sort (cost=77077.83..77957.52 rows=351876 width=4) (actual time=457.131..492.428 rows=830569 loops=1)
    Sort Key: a2.contact_id
    Sort Method: quicksort Memory: 63509kB
    -> Bitmap Heap Scan on cc a2 (cost=7467.63..44661.77 rows=351876 width=4) (actual time=60.039..186.422 rows=830569 loops=1)
    Recheck Cond: ((cad_id = 102) AND (value = 7))
    -> Bitmap Index Scan on cad_id_value_idx (cost=0.00..7379.66 rows=351876 width=0) (actual time=54.929..54.929 rows=830569 loops=1)
    Index Cond: ((cad_id = 102) AND (value = 7))
    Total runtime: 1286.846 ms
    (18 rows)

    貌似大部分时间都花在了排序上。而且它也不是并行执行的。如果后台同时执行top命令的话,就会发现只有一个核是被实际占用着的。
    要是再创建更多的索引呢?会不会有帮助?

    taowen@dmright-perf:~$ time psql postgres -c 'create index cad_id_value_contact_id on cc(cad_id, value, contact_id);'CREATE INDEX

    real 0m10.683s
    user 0m0.020s
    sys 0m0.004s
    taowen@dmright-perf:~$ time psql postgres -c 'explain analyze select count(a1.contact_id) from cc as a1 inner join cc as a2 on a1.contact_id = a2.contact_id and a2.cad_id = 102 and a2.value = 7 where a1.cad_id = 101 and a1.value = 5;'
    QUERY PLAN
    -----------------------------------------------------------------------------------------------------------------------------------------------------------
    Aggregate (cost=180366.31..180366.32 rows=1 width=4) (actual time=1338.478..1338.478 rows=1 loops=1)
    -> Merge Join (cost=174783.00..180104.02 rows=104912 width=4) (actual time=1090.136..1327.312 rows=164788 loops=1)
    Merge Cond: (a1.contact_id = a2.contact_id)
    -> Sort (cost=97705.18..98961.44 rows=502504 width=4) (actual time=615.843..657.438 rows=998839 loops=1)
    Sort Key: a1.contact_id
    Sort Method: quicksort Memory: 71397kB
    -> Bitmap Heap Scan on cc a1 (cost=10667.56..50121.12 rows=502504 width=4) (actual time=80.926..275.431 rows=998839 loops=1)
    Recheck Cond: ((cad_id = 101) AND (value = 5))
    -> Bitmap Index Scan on cad_id_value_idx (cost=0.00..10541.94 rows=502504 width=0) (actual time=75.816..75.816 rows=998839 loops=1)
    Index Cond: ((cad_id = 101) AND (value = 5))
    -> Sort (cost=77077.83..77957.52 rows=351876 width=4) (actual time=474.279..510.866 rows=830569 loops=1)
    Sort Key: a2.contact_id
    Sort Method: quicksort Memory: 63509kB
    -> Bitmap Heap Scan on cc a2 (cost=7467.63..44661.77 rows=351876 width=4) (actual time=65.335..198.655 rows=830569 loops=1)
    Recheck Cond: ((cad_id = 102) AND (value = 7))
    -> Bitmap Index Scan on cad_id_value_idx (cost=0.00..7379.66 rows=351876 width=0) (actual time=60.314..60.314 rows=830569 loops=1)
    Index Cond: ((cad_id = 102) AND (value = 7))
    Total runtime: 1346.587 ms
    (18 rows)

    还是不行!Merge Join慢的话,升级到9.1然后强制使用hash join会不会好一点?

    postgres@dmright-perf:/mnt/memory/pg$ time psql -c 'explain analyze select count(a1.contact_id) from cc as a1 inner join cc as a2 on a1.contact_id = a2.contact_id where a2.cad_id = 102 and a2.value = 7 and a1.cad_id = 101 and a1.value = 5';
    QUERY PLAN
    ----------------------------------------------------------------------------------------------------------------------------------------------------------
    Aggregate (cost=902356.85..902356.86 rows=1 width=4) (actual time=1693.137..1693.137 rows=1 loops=1)
    -> Hash Join (cost=59599.78..902135.88 rows=88389 width=4) (actual time=461.788..1682.718 rows=164788 loops=1)
    Hash Cond: (a1.contact_id = a2.contact_id)
    -> Bitmap Heap Scan on cc a1 (cost=10833.24..50406.32 rows=510472 width=4) (actual time=76.337..224.571 rows=998839 loops=1)
    Recheck Cond: ((cad_id = 101) AND (value = 5))
    -> Bitmap Index Scan on cad_id_value_idx (cost=0.00..10705.62 rows=510472 width=0) (actual time=71.028..71.028 rows=998839 loops=1)
    Index Cond: ((cad_id = 101) AND (value = 5))
    -> Hash (cost=44444.15..44444.15 rows=345792 width=4) (actual time=385.256..385.256 rows=830569 loops=1)
    Buckets: 65536 Batches: 1 Memory Usage: 29200kB
    -> Bitmap Heap Scan on cc a2 (cost=7341.27..44444.15 rows=345792 width=4) (actual time=64.778..258.059 rows=830569 loops=1)
    Recheck Cond: ((cad_id = 102) AND (value = 7))
    -> Bitmap Index Scan on cad_id_value_idx (cost=0.00..7254.82 rows=345792 width=0) (actual time=59.675..59.675 rows=830569 loops=1)
    Index Cond: ((cad_id = 102) AND (value = 7))
    Total runtime: 1698.207 ms
    (14 rows)

    有两个原因造成这样的情况:
    1、两个条件意味着扫描两次,而且不是并行扫描
    2、join自身很耗费时间,无论是sort merge join还是hash join

    两个条件都这熊样了,三个条件呢?看看吧,先上INTERSECT:

    postgres@dmright-perf:/mnt/memory/pg$ time psql postgres -c 'select count(contact_id) from (select contact_id from cc where cad_id = 101 and value = 5 intersect select contact_id from cc where cad_id = 102 and value = 7 intersect select contact_id from cc where cad_id = 6 and value = 1) as temp'
    count
    -------
    6748
    (1 row)


    real 0m1.350s
    user 0m0.020s
    sys 0m0.008s

    然后是INNER JOIN

    postgres@dmright-perf:/mnt/memory/pg$ time psql -c 'select count(a1.contact_id) from cc as a1 inner join cc as a2 on a1.contact_id = a2.contact_id inner join cc as a3 on a1.contact_id = a3.contact_id where a2.cad_id = 102 and a2.value = 7 and a1.cad_id = 101 and a1.value = 5 and a3.cad_id = 6 and a3.value = 1';
    count
    -------
    6748
    (1 row)


    real 0m0.756s
    user 0m0.028s
    sys 0m0.000s

    然后是IN + SUBQUERY:

    postgres@dmright-perf:/mnt/memory/pg$ time psql -c 'select count(a1.contact_id) from cc as a1 where a1.cad_id = 101 and a1.value = 5 and a1.contact_id in (select contact_id from cc as a2 where a2.cad_id = 102 and a2.value = 7 and a2.contact_id in (select contact_id from cc as a3 where a3.cad_id = 6 and a3.value = 1))';
    count
    -------
    6748
    (1 row)


    real 0m7.320s
    user 0m0.024s
    sys 0m0.004s

    我们可以看到这回INNER JOIN是最快的了。为什么?

    postgres@dmright-perf:/mnt/memory/pg$ time psql postgres -c 'explain analyze select count(contact_id) from (select contact_id from cc where cad_id = 101 and value = 5 intersect select contact_id from cc where cad_id = 102 and value = 7 intersect select contact_id from cc where cad_id = 6 and value = 1) as temp'
    QUERY PLAN
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    Aggregate (cost=132357.21..132357.22 rows=1 width=4) (actual time=1659.666..1659.666 rows=1 loops=1)
    -> Subquery Scan on temp (cost=306.85..132335.71 rows=8602 width=4) (actual time=1648.818..1659.243 rows=6748 loops=1)
    -> HashSetOp Intersect (cost=306.85..132249.69 rows=8602 width=4) (actual time=1648.817..1658.543 rows=6748 loops=1)
    -> Append (cost=306.85..131691.89 rows=223118 width=4) (actual time=33.797..1547.811 rows=372016 loops=1)
    -> Subquery Scan on "*SELECT* 3" (cost=306.85..26138.13 rows=14239 width=4) (actual time=33.797..153.230 rows=207228 loops=1)
    -> Bitmap Heap Scan on cc (cost=306.85..25995.74 rows=14239 width=4) (actual time=33.795..134.292 rows=207228 loops=1)
    Recheck Cond: ((cad_id = 6) AND (value = 1))
    -> Bitmap Index Scan on cad_id_value_idx (cost=0.00..303.29 rows=14239 width=0) (actual time=28.558..28.558 rows=207228 loops=1)
    Index Cond: ((cad_id = 6) AND (value = 1))
    -> Result (cost=7341.27..105553.76 rows=208879 width=4) (actual time=1282.426..1371.255 rows=164788 loops=1)
    -> HashSetOp Intersect (cost=7341.27..105553.76 rows=208879 width=4) (actual time=1282.418..1353.422 rows=164788 loops=1)
    -> Append (cost=7341.27..103413.10 rows=856264 width=4) (actual time=57.550..734.282 rows=1829408 loops=1)
    -> Subquery Scan on "*SELECT* 2" (cost=7341.27..47902.07 rows=345792 width=4) (actual time=57.550..292.595 rows=830569 loops=1)
    -> Bitmap Heap Scan on cc (cost=7341.27..44444.15 rows=345792 width=4) (actual time=57.549..218.162 rows=830569 loops=1)
    Recheck Cond: ((cad_id = 102) AND (value = 7))
    -> Bitmap Index Scan on cad_id_value_idx (cost=0.00..7254.82 rows=345792 width=0) (actual time=52.606..52.606 rows=830569 loops=1)
    Index Cond: ((cad_id = 102) AND (value = 7))
    -> Subquery Scan on "*SELECT* 1" (cost=10833.24..55511.04 rows=510472 width=4) (actual time=69.129..330.737 rows=998839 loops=1)
    -> Bitmap Heap Scan on cc (cost=10833.24..50406.32 rows=510472 width=4) (actual time=69.128..242.416 rows=998839 loops=1)
    Recheck Cond: ((cad_id = 101) AND (value = 5))
    -> Bitmap Index Scan on cad_id_value_idx (cost=0.00..10705.62 rows=510472 width=0) (actual time=64.161..64.161 rows=998839 loops=1)
    Index Cond: ((cad_id = 101) AND (value = 5))
    Total runtime: 1665.691 ms
    (23 rows)

    对于INTERSECT来说,过程和两个条件是差不多的,只是集合更大一些罢了。

    postgres@dmright-perf:/mnt/memory/pg$ time psql -c 'explain analyze select count(a1.contact_id) from cc as a1 inner join cc as a2 on a1.contact_id = a2.contact_id inner join cc as a3 on a1.contact_id = a3.contact_id where a2.cad_id = 102 and a2.value = 7 and a1.cad_id = 101 and a1.value = 5 and a3.cad_id = 6 and a3.value = 1';
    QUERY PLAN
    ----------------------------------------------------------------------------------------------------------------------------------------------------------------
    Aggregate (cost=118979.47..118979.48 rows=1 width=4) (actual time=771.393..771.393 rows=1 loops=1)
    -> Nested Loop (cost=49073.39..118977.89 rows=630 width=4) (actual time=399.835..770.587 rows=6748 loops=1)
    -> Hash Join (cost=49073.39..97159.97 rows=2466 width=8) (actual time=399.814..691.519 rows=34219 loops=1)
    Hash Cond: (a3.contact_id = a2.contact_id)
    -> Bitmap Heap Scan on cc a3 (cost=306.85..25995.74 rows=14239 width=4) (actual time=25.655..83.526 rows=207228 loops=1)
    Recheck Cond: ((cad_id = 6) AND (value = 1))
    -> Bitmap Index Scan on cad_id_value_idx (cost=0.00..303.29 rows=14239 width=0) (actual time=20.570..20.570 rows=207228 loops=1)
    Index Cond: ((cad_id = 6) AND (value = 1))
    -> Hash (cost=44444.15..44444.15 rows=345792 width=4) (actual time=373.969..373.969 rows=830569 loops=1)
    Buckets: 65536 Batches: 1 Memory Usage: 29200kB
    -> Bitmap Heap Scan on cc a2 (cost=7341.27..44444.15 rows=345792 width=4) (actual time=59.271..250.932 rows=830569 loops=1)
    Recheck Cond: ((cad_id = 102) AND (value = 7))
    -> Bitmap Index Scan on cad_id_value_idx (cost=0.00..7254.82 rows=345792 width=0) (actual time=54.030..54.030 rows=830569 loops=1)
    Index Cond: ((cad_id = 102) AND (value = 7))
    -> Index Scan using cad_id_value_contact_id_idx on cc a1 (cost=0.00..8.83 rows=1 width=4) (actual time=0.002..0.002 rows=0 loops=34219)
    Index Cond: ((cad_id = 101) AND (value = 5) AND (contact_id = a2.contact_id))
    Total runtime: 774.588 ms
    (17 rows)

    对于INNER JOIN,索引都被利用上了。它不需要取得contact_id然后再来做集合操作。结论是对rowid做hash操作比集合操作更快。
    不显示用INNER JOIN,让Planner决定Join顺序也是一样的:

    postgres@dmright-perf:/mnt/memory/pg$ time psql -c 'explain analyze select count(a1.contact_id) from cc as a1, cc as a2, cc as a3 where a1.contact_id = a2.contact_id and a2.contact_id = a3.contact_id and a2.cad_id = 102 and a2.value = 7 and a1.cad_id = 101 and a1.value = 5 and a3.cad_id = 6 and a3.value = 1';
    QUERY PLAN
    ----------------------------------------------------------------------------------------------------------------------------------------------------------------
    Aggregate (cost=118979.47..118979.48 rows=1 width=4) (actual time=762.969..762.970 rows=1 loops=1)
    -> Nested Loop (cost=49073.39..118977.89 rows=630 width=4) (actual time=398.554..762.206 rows=6748 loops=1)
    -> Hash Join (cost=49073.39..97159.97 rows=2466 width=8) (actual time=398.531..684.425 rows=34219 loops=1)
    Hash Cond: (a3.contact_id = a2.contact_id)
    -> Bitmap Heap Scan on cc a3 (cost=306.85..25995.74 rows=14239 width=4) (actual time=34.802..91.672 rows=207228 loops=1)
    Recheck Cond: ((cad_id = 6) AND (value = 1))
    -> Bitmap Index Scan on cad_id_value_idx (cost=0.00..303.29 rows=14239 width=0) (actual time=29.524..29.524 rows=207228 loops=1)
    Index Cond: ((cad_id = 6) AND (value = 1))
    -> Hash (cost=44444.15..44444.15 rows=345792 width=4) (actual time=363.537..363.537 rows=830569 loops=1)
    Buckets: 65536 Batches: 1 Memory Usage: 29200kB
    -> Bitmap Heap Scan on cc a2 (cost=7341.27..44444.15 rows=345792 width=4) (actual time=57.799..245.467 rows=830569 loops=1)
    Recheck Cond: ((cad_id = 102) AND (value = 7))
    -> Bitmap Index Scan on cad_id_value_idx (cost=0.00..7254.82 rows=345792 width=0) (actual time=52.349..52.349 rows=830569 loops=1)
    Index Cond: ((cad_id = 102) AND (value = 7))
    -> Index Scan using cad_id_value_contact_id_idx on cc a1 (cost=0.00..8.83 rows=1 width=4) (actual time=0.002..0.002 rows=0 loops=34219)
    Index Cond: ((cad_id = 101) AND (value = 5) AND (contact_id = a2.contact_id))
    Total runtime: 766.107 ms
    (17 rows)

    结果是一样的。

    基本上对PostgreSQL的实验就到这里了,差不多也就这样了,提高空间不大。Google之后发现,PostgreSQL使用的MVCC机制导致其甚至在做SELECT COUNT这样的操作的时候也会去更新hint bit。也许这就是其慢的重要原因。MySQL的MYISAM引擎以完全不负责事务和著称,应该会比PostgreSQL有更大的提升,下一篇中,我们将对MySQL重复同样的实验。




























  • 相关阅读:
    Js 30 BOM
    js面向对象
    js模态窗口
    js默认行为(也称默认事件)
    框架的控件隐藏
    20150706 js之定时器
    sublime快捷方式和node.js
    js回调函数2
    Hibernate 多对一
    Hibernate入门之配置文件
  • 原文地址:https://www.cnblogs.com/taowen/p/2367724.html
Copyright © 2020-2023  润新知