哪种方法是找出在一张表中,而不在另一张表中的记录的最佳方法呢?
SELECT l.* FROM t_left l LEFT JOIN t_right r ON r.value = l.value WHERE r.value IS NULL;
SELECT l.* FROM t_left l WHERE l.value NOT IN ( SELECT value FROM t_right r );
SELECT l.* FROM t_left l WHERE NOT EXISTS ( SELECT NULL FROM t_right r WHERE r.value = l.value );
我们先把环境准备一下:
postgres 11.9
CREATE TABLE t_left ( id INT NOT NULL PRIMARY KEY, value INT NOT NULL, stuffing VARCHAR(200) NOT NULL ); CREATE TABLE t_right ( id INT NOT NULL PRIMARY KEY, value INT NOT NULL, stuffing VARCHAR(200) NOT NULL );
CREATE INDEX ix_left_value ON t_left (value); CREATE INDEX ix_right_value ON t_right (value);
INSERT INTO t_left SELECT s, s % 10000, RPAD('Value '|| s || ' ', 200, '*') FROM generate_series(1, 100000) s;
INSERT INTO t_right SELECT s, s % 10000 + 1, RPAD('Value '|| s || ' ', 200, '*') FROM generate_series(1, 1000000) s;
ANALYZE t_left; ANALYZE t_right;
表t_left包含100000行记录,其中有1000个不同的记录;
只是测试
表t_right包含100000行记录,其中有1000个不同的记录。
t_left表有10条记录不存在与t_right表中。
NOT IN
先看看not in
postgres=# SELECT l.id, l.value postgres-# FROM t_left l postgres-# WHERE value NOT IN postgres-# ( postgres(# SELECT value postgres(# FROM t_right r postgres(# ); id | value --------+------- 20000 | 0 10000 | 0 60000 | 0 30000 | 0 40000 | 0 50000 | 0 70000 | 0 80000 | 0 90000 | 0 100000 | 0 (10 rows) postgres=# explain analyze SELECT l.id, l.value postgres-# FROM t_left l postgres-# WHERE value NOT IN postgres-# ( postgres(# SELECT value postgres(# FROM t_right r postgres(# ); QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------- Gather (cost=1000.00..1520921531.00 rows=50000 width=8) (actual time=7329.726..40554.033 rows=10 loops=1) Workers Planned: 1 Workers Launched: 1 -> Parallel Seq Scan on t_left l (cost=0.00..1520915531.00 rows=29412 width=8) (actual time=5506.368..40405.539 rows=5 loops=2) Filter: (NOT (SubPlan 1)) Rows Removed by Filter: 49995 SubPlan 1 -> Materialize (cost=0.00..49211.00 rows=1000000 width=4) (actual time=0.001..0.534 rows=4084 loops=100000) -> Seq Scan on t_right r (cost=0.00..40304.00 rows=1000000 width=4) (actual time=0.002..0.575 rows=3739 loops=37433) Planning Time: 0.155 ms Execution Time: 40557.112 ms (11 rows) postgres=#
这里之所以先讨论not in,是因为其对待null的方式特殊。
不幸的是,postgresql优化器不能利用到t_right.value被定义为not null的事实。因此,不可以返回null值。(即not in不能返回null值)
这里可以做一个小测试:
postgres=# create table aa(id int,age int); postgres=# insert into aa values(1,1); postgres=# insert into aa values(2,2); postgres=# insert into aa(id) values(3); postgres=# select * from aa where age not in (1,2); id | age ----+----- (0 rows) postgres=# select * from aa where age not in (1); id | age ----+----- 2 | 2 (1 row) postgres=#
这就是为什么postgresql对not in查询使用一个特殊的访问方法。
NOT EXISTS
postgres=# SELECT l.id, l.value postgres-# FROM t_left l postgres-# WHERE NOT EXISTS postgres-# ( postgres(# SELECT select postgres(# FROM t_right r postgres(# WHERE r.value = l.value postgres(# ); id | value --------+------- 10000 | 0 30000 | 0 20000 | 0 40000 | 0 60000 | 0 50000 | 0 80000 | 0 70000 | 0 90000 | 0 100000 | 0 (10 rows) postgres=# explain analyze SELECT l.id, l.value postgres-# FROM t_left l postgres-# WHERE NOT EXISTS postgres-# ( postgres(# SELECT value postgres(# FROM t_right r postgres(# WHERE r.value = l.value postgres(# ); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------- Gather (cost=1000.42..34410.22 rows=10 width=8) (actual time=31.555..262.307 rows=10 loops=1) Workers Planned: 1 Workers Launched: 1 -> Nested Loop Anti Join (cost=0.42..33409.22 rows=6 width=8) (actual time=43.314..257.792 rows=5 loops=2) -> Parallel Seq Scan on t_left l (cost=0.00..3619.24 rows=58824 width=8) (actual time=0.007..10.066 rows=50000 loops=2) -> Index Only Scan using ix_right_value on t_right r (cost=0.42..3.53 rows=100 width=4) (actual time=0.005..0.005 rows=1 loops=100000) Index Cond: (value = l.value) Heap Fetches: 99990 Planning Time: 0.328 ms Execution Time: 262.348 ms (10 rows) postgres=#
LEFT JOIN / IS NULL
postgres=# SELECT l.id, l.value postgres-# FROM t_left l postgres-# WHERE NOT EXISTS postgres-# ( postgres(# SELECT value postgres(# FROM t_right r postgres(# WHERE r.value = l.value postgres(# ); id | value --------+------- 80000 | 0 10000 | 0 20000 | 0 30000 | 0 40000 | 0 50000 | 0 60000 | 0 70000 | 0 90000 | 0 100000 | 0 (10 rows) postgres=# explain analyze SELECT l.id, l.value postgres-# FROM t_left l postgres-# WHERE NOT EXISTS postgres-# ( postgres(# SELECT value postgres(# FROM t_right r postgres(# WHERE r.value = l.value postgres(# ); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------- Gather (cost=1000.42..34410.22 rows=10 width=8) (actual time=32.109..259.553 rows=10 loops=1) Workers Planned: 1 Workers Launched: 1 -> Nested Loop Anti Join (cost=0.42..33409.22 rows=6 width=8) (actual time=43.744..255.254 rows=5 loops=2) -> Parallel Seq Scan on t_left l (cost=0.00..3619.24 rows=58824 width=8) (actual time=0.008..9.777 rows=50000 loops=2) -> Index Only Scan using ix_right_value on t_right r (cost=0.42..3.53 rows=100 width=4) (actual time=0.005..0.005 rows=1 loops=100000) Index Cond: (value = l.value) Heap Fetches: 99990 Planning Time: 0.396 ms Execution Time: 259.608 ms (10 rows) postgres=#
postgresql按照相同的方式对待left join和not exists,使用相同的执行计划(nested loop anti join)。
至于NOT IN,这在语义上是不同的, PostgreSQL试图考虑这一点,并限制自己对子计划使用过滤器。