刚开始工作的开发,很容易过度的使用in、not in。其实,在有些时候可以用exists、not exists来代替in和not in,实现查询性能的提升。
exists操作符时候会和in操作符产生混淆。因为他们都是用于检查某个属性是否包含在某个集合中。但是相同点也仅限于此。
exists的真正目的是检查子查询是否至少包含一条记录。例如,下面的查询会返回行1和2:
WITH numbers (nr) AS ( SELECT 1 AS nr UNION ALL SELECT 2 AS nr UNION ALL SELECT 3 AS nr ), letters (letter, nr) AS ( SELECT 'A' AS letter, 1 AS nr UNION ALL SELECT 'B' AS letter, 2 AS nr ) SELECT * FROM numbers n WHERE EXISTS (SELECT nr FROM letters WHERE nr= n.nr);
当然,你也可以改写成in:
WITH numbers (nr) AS ( SELECT 1 AS nr UNION ALL SELECT 2 AS nr UNION ALL SELECT 3 AS nr ), letters (letter, nr) AS ( SELECT 'A' AS letter, 1 AS nr UNION ALL SELECT 'B' AS letter, 2 AS nr ) SELECT * FROM numbers n WHERE n.nr IN (SELECT nr FROM letters);
这两种写法,都可以返回相同的记录。区别是exists会更快,因为在得到第一条满足条件的记录之后就会停止,而in会查询所有的记录(如果in返回很多行的话)。但是,要注意另一件事,exists使用的关联子查询,它可以使用left join加上一个NULL检查来替换。
执行计划
not in操作符的执行计划:
=# EXPLAIN ANALYZE SELECT * FROM products WHERE id NOT IN (SELECT product_id FROM bought_products); QUERY PLAN --------------------------------------------------------------------------------------------------------------------- Seq Scan on products (cost=38.25..50.50 rows=90 width=422) (actual time=0.069..0.072 rows=5 loops=1) Filter: (NOT (hashed SubPlan 1)) Rows Removed by Filter: 2 SubPlan 1 -> Seq Scan on bought_products (cost=0.00..32.60 rows=2260 width=4) (actual time=0.006..0.009 rows=3 loops=1) Planning Time: 0.180 ms Execution Time: 0.130 ms (7 rows)
从执行计划可以看到,因为没有用到任何索引,对表bought_products先走的是全表扫描,然后在参与过滤。
not exists的执行计划:
=# EXPLAIN ANALYZE SELECT * FROM products WHERE NOT EXISTS (SELECT product_id FROM bought_products WHERE id = product_id); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop Anti Join (cost=0.15..47.80 rows=90 width=422) (actual time=0.038..0.045 rows=5 loops=1) -> Seq Scan on products (cost=0.00..11.80 rows=180 width=422) (actual time=0.012..0.014 rows=7 loops=1) -> Index Only Scan using bought_products_pkey on bought_products (cost=0.15..0.61 rows=11 width=4) (actual time=0.003..0.003 rows=0 loops=7) Index Cond: (product_id = products.id) Heap Fetches: 2 Planning Time: 0.177 ms Execution Time: 0.080 ms (7 rows)
从这两个执行计划来看,最大的区别是执行时间的不同(0.130 ms vs 0.080 ms),计划时间差不多。
所以,当你有些in查询时,可以看看是否可以使用exists来替换。
再来看看exists的执行计划:
=# EXPLAIN ANALYZE SELECT * FROM products WHERE EXISTS (SELECT product_id FROM bought_products WHERE id = product_id); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop Semi Join (cost=0.15..47.80 rows=90 width=422) (actual time=0.030..0.046 rows=2 loops=1) -> Seq Scan on products (cost=0.00..11.80 rows=180 width=422) (actual time=0.011..0.013 rows=7 loops=1) -> Index Only Scan using bought_products_pkey on bought_products (cost=0.15..0.61 rows=11 width=4) (actual time=0.003..0.003 rows=0 loops=7) Index Cond: (product_id = products.id) Heap Fetches: 2 Planning Time: 0.261 ms Execution Time: 0.084 ms (7 rows)
如果不是关联子查询呢?
=# EXPLAIN ANALYZE SELECT * FROM products WHERE EXISTS (SELECT product_id FROM bought_products); QUERY PLAN --------------------------------------------------------------------------------------------------------------------- Result (cost=0.01..11.81 rows=180 width=422) (actual time=0.013..0.015 rows=7 loops=1) One-Time Filter: $0 InitPlan 1 (returns $0) -> Seq Scan on bought_products (cost=0.00..32.60 rows=2260 width=0) (actual time=0.007..0.007 rows=1 loops=1) -> Seq Scan on products (cost=0.01..11.81 rows=180 width=422) (actual time=0.002..0.004 rows=7 loops=1) Planning Time: 0.077 ms Execution Time: 0.036 ms (7 rows)
这次,子查询使用了顺序扫描,但是计划时间只用了0.077ms。它将只返回1行,因为它是EXISTS的语义。
尽管exists和in看起来类似,但是他们是不同的。exists在子查询返回第一条满足条件的记录后就会停止。如果你的查询很慢,将in换成exists可能会提升很多性能。
此外,exists可以用left join加上null检查来取代。