摘要:如果可以将外连接转换为内连接,那么就可以简化查询优化过程。
本文分享自华为云社区《GaussDB(DWS)外连接向内连接的转换》,作者:wangxiaojuan8。
在查询优化的过程中,内连接的表之间的连接顺序可以随意交换,where或on条件中只涉及单表的条件可以下推到表上作为表的过滤条件;而对于外连接来说,表的连接顺序不能随意交换,约束条件也不能随意的下推。如果可以将外连接转换为内连接,那么就可以简化查询优化过程。
外连接可转为内连接需满足的条件
为了描述方便,引入两个名词:
- 不空侧:外连接中所有数据都被输出的一侧。比如:左外连接的左表、右外连接的右表
- 可空侧:外连接中会被补空值的一侧。比如:左外连接的右表、右外连接的左表、全外连接的左表和右表
只要满足以下条件之一,就可以将外连接转换为内连接:
- Where条件中有“严格”的约束条件,且该约束条件中引用了可空侧的表中列。这样,该谓词便可以将可空侧产生的空值都过滤掉了,使得最终结果等同于内连接。
参考《PostgreSQL技术内幕-查询优化深度探索》一书中对“严格”的定义如下:
“严格”的精确定义是对于一个函数、操作符或者表达式,如果输入参数是NULL值,那么输出也一定是NULL值,就可以说这个函数、操作符或者表达式是严格的;但是宽泛的说,对于函数、操作符或者表达式,如果输入参数是NULL值,输出结果是NULL值或者FALSE,那么就认为这个函数或者操作符是严格的。如果在约束条件里有这种严格的操作符、函数或者表达式,由于输入是NULL值,输出是NULL或者FALSE,那么对于含有NULL值的元组就会被过滤掉。
本文沿用GuassDB(DWS)的连接类型介绍一文中的表和数据进行举例说明。
例1:查询数学成绩不为空的学生的数学成绩
postgres=# select s.id, s.name, ms.score from student s left join math_score ms on (s.id = ms.id) where ms.score is not null; id | name | score ----+-------+------- 1 | Tom | 80 2 | Lily | 75 4 | Perry | 95 (3 rows) postgres=# explain select s.id, s.name, ms.score from student s left join math_score ms on (s.id = ms.id) where ms.score is not null; QUERY PLAN ---------------------------------------------------------------------------- id | operation | E-rows | E-width | E-costs ----+----------------------------------------+--------+---------+--------- 1 | -> Streaming (type: GATHER) | 30 | 126 | 36.59 2 | -> Hash Join (3,4) | 30 | 126 | 28.59 3 | -> Seq Scan on student s | 30 | 122 | 14.14 4 | -> Hash | 29 | 8 | 14.14 5 | -> Seq Scan on math_score ms | 30 | 8 | 14.14 Predicate Information (identified by plan id) --------------------------------------------- 2 --Hash Join (3,4) Hash Cond: (s.id = ms.id) 5 --Seq Scan on math_score ms Filter: (score IS NOT NULL) (14 rows)
例2:查询数学成绩高于80的学生的数学成绩
postgres=# select s.id, s.name, ms.score from student s left join math_score ms on (s.id = ms.id) where ms.score > 80; id | name | score ----+-------+------- 4 | Perry | 95 (1 row) postgres=# explain select s.id, s.name, ms.score from student s left join math_score ms on (s.id = ms.id) where ms.score > 80; QUERY PLAN ---------------------------------------------------------------------------- id | operation | E-rows | E-width | E-costs ----+----------------------------------------+--------+---------+--------- 1 | -> Streaming (type: GATHER) | 10 | 126 | 36.44 2 | -> Hash Join (3,4) | 10 | 126 | 28.44 3 | -> Seq Scan on student s | 30 | 122 | 14.14 4 | -> Hash | 10 | 8 | 14.18 5 | -> Seq Scan on math_score ms | 10 | 8 | 14.18 Predicate Information (identified by plan id) --------------------------------------------- 2 --Hash Join (3,4) Hash Cond: (s.id = ms.id) 5 --Seq Scan on math_score ms Filter: (score > 80) (14 rows)
上面两个例子中,条件where ms.score is not null和where ms.score > 80,如果输入的score为NULL,则这个约束条件返回的是false,满足了宽泛的“严格”定义。所以可以将外连接消除,转换为内连接。从上面的查询计划也得到了验证。而且这种外连接消除是可以有数据库的查询优化器来自动处理的。
- On连接条件中,如果不空侧列中的值是可空侧列的子集,且可空侧的值都不为NULL。典型的,不空侧的列为外键,可空侧的列为主键,且两者之间是主外键参考关系。
CREATE TABLE student( id INTEGER primary key, name varchar(50) ); CREATE TABLE math_score( id INTEGER, -- 由于GaussDB(DWS)不支持外键,故此处省去了外键定义,但保证该列的值是student表中id列的子集 score INTEGER ); INSERT INTO student VALUES(1, 'Tom'); INSERT INTO student VALUES(2, 'Lily'); INSERT INTO student VALUES(3, 'Tina'); INSERT INTO student VALUES(4, 'Perry'); INSERT INTO math_score VALUES(1, 80); INSERT INTO math_score VALUES(2, 75); INSERT INTO math_score VALUES(4, 95);
那么下面这个外连接则和内连接的结果相同:
postgres=# select ms.id, s.name, ms.score from student s right join math_score ms on (s.id = ms.id); id | name | score ----+-------+------- 1 | Tom | 80 2 | Lily | 75 4 | Perry | 95 (3 rows) postgres=# explain select ms.id, s.name, ms.score from student s right join math_score ms on (s.id = ms.id); QUERY PLAN ------------------------------------------------------------------------- id | operation | E-rows | E-width | E-costs ----+-------------------------------------+--------+---------+--------- 1 | -> Streaming (type: GATHER) | 30 | 126 | 36.59 2 | -> Hash Left Join (3, 4) | 30 | 126 | 28.59 3 | -> Seq Scan on math_score ms | 30 | 8 | 14.14 4 | -> Hash | 29 | 122 | 14.14 5 | -> Seq Scan on student s | 30 | 122 | 14.14 Predicate Information (identified by plan id) --------------------------------------------- 2 --Hash Left Join (3, 4) Hash Cond: (ms.id = s.id) (12 rows) postgres=# select ms.id, s.name, ms.score from student s join math_score ms on (s.id = ms.id); id | name | score ----+-------+------- 1 | Tom | 80 2 | Lily | 75 4 | Perry | 95 (3 rows) postgres=# explain select ms.id, s.name, ms.score from student s join math_score ms on (s.id = ms.id); QUERY PLAN ---------------------------------------------------------------------------- id | operation | E-rows | E-width | E-costs ----+----------------------------------------+--------+---------+--------- 1 | -> Streaming (type: GATHER) | 30 | 126 | 36.59 2 | -> Hash Join (3,4) | 30 | 126 | 28.59 3 | -> Seq Scan on student s | 30 | 122 | 14.14 4 | -> Hash | 29 | 8 | 14.14 5 | -> Seq Scan on math_score ms | 30 | 8 | 14.14 Predicate Information (identified by plan id) --------------------------------------------- 2 --Hash Join (3,4) Hash Cond: (s.id = ms.id) (12 rows)
由于GaussDB(DWS)不支持外键,所以这种可将外连接消除转换为内连接的条件无法被优化器识别而进行自动转换,但是可帮助开发者在日常编写SQL时加以人工识别,手工消除外连接。
一个有趣的改写示例
有如下的一个用例:
Select count(1) from student s left join math_score ms on (s.id = ms.id) where s.id = 2 and ms.score > 70; postgres=# Select count(1) postgres-# from student s left join math_score ms on (s.id = ms.id) postgres-# where s.id = 2 postgres-# and ms.score > 70; count ------- 0 (1 row) postgres=# explain Select count(1) postgres-# from student s left join math_score ms on (s.id = ms.id) postgres-# where s.id = 2 postgres-# and ms.score > 70; QUERY PLAN ----------------------------------------------------------------------------------------------------- id | operation | E-rows | E-width | E-costs ----+-----------------------------------------------------------------+--------+---------+--------- 1 | -> Aggregate | 1 | 8 | 26.51 2 | -> Streaming (type: GATHER) | 1 | 8 | 26.51 3 | -> Aggregate | 1 | 8 | 22.51 4 | -> Nested Loop (5,6) | 3 | 0 | 22.49 5 | -> Index Only Scan using student_pkey on student s | 1 | 4 | 8.27 6 | -> Seq Scan on math_score ms | 1 | 4 | 14.21 Predicate Information (identified by plan id) ----------------------------------------------------- 5 --Index Only Scan using student_pkey on student s Index Cond: (id = 2) 6 --Seq Scan on math_score ms Filter: ((score > 70) AND (id = 2)) (15 rows)
从上面的计划可见,sql中的左外连接已经被优化为交叉连接,因为在4号算子Nest Loop上没有join条件。
此查询最终是要计算两个表连接结果的总行数。对于交叉连接来说,两表连接的行数等于左表行数和右表行数的乘积。所以此查询可以修改为如下等价查询:
explain select lcount * rcount as count from (select count(1) lcount from student where id = 2) s, (select count(1) rcount from math_score where score > 70 and id = 2) ms; postgres=# select lcount * rcount as count postgres-# from (select count(1) lcount from student where id = 2) s, postgres-# (select count(1) rcount from math_score where score > 70 and id = 2) ms; count ------- 1 (1 row) postgres=# explain select lcount * rcount as count postgres-# from (select count(1) lcount from student where id = 2) s, postgres-# (select count(1) rcount from math_score where score > 70 and id = 2) ms; QUERY PLAN ------------------------------------------------------------------------------------------------------ id | operation | E-rows | E-width | E-costs ----+------------------------------------------------------------------+--------+---------+--------- 1 | -> Streaming (type: GATHER) | 1 | 16 | 26.56 2 | -> Nested Loop (3,7) | 1 | 16 | 22.56 3 | -> Aggregate | 1 | 8 | 8.29 4 | -> Streaming(type: BROADCAST) | 1 | 8 | 8.29 5 | -> Aggregate | 1 | 8 | 8.28 6 | -> Index Only Scan using student_pkey on student | 1 | 0 | 8.27 7 | -> Materialize | 1 | 8 | 14.25 8 | -> Aggregate | 1 | 8 | 14.23 9 | -> Streaming(type: BROADCAST) | 1 | 8 | 14.23 10 | -> Aggregate | 1 | 8 | 14.22 11 | -> Seq Scan on math_score | 1 | 0 | 14.21 Predicate Information (identified by plan id) --------------------------------------------------- 6 --Index Only Scan using student_pkey on student Index Cond: (id = 2) 11 --Seq Scan on math_score Filter: ((score > 70) AND (id = 2)) (20 rows)
通过这种改写,可以将聚集操作推到Nested Loop的每个子树中执行,当Nested Loop的每个子树的数据量比较大时,聚集可以大大降低结果集,减少参与join的数据量,从而提高性能。
下面例子留作思考:
Select sum(score) From student s left join math_score ms on (s.id = ms.id) Where s.id = 2 And ms.score > 70;