An SQL query walks into a bar and sees two tables. He walks up to them and asks ‘Can I join you?’
工作中常用Postgresql的EXPLAIN
命令对慢查询进行分析,对QUERY PLAN中JOIN相关的节点一直都是一知半解,本文将对JOIN操作(主要指inner equi-join,$\bowtie$
)做一个梳理。
负责选择IO开销最低的JOIN算法,并给上层的节点的输出数据。
数据输出有两种方式:
JOIN算法大致分三种 1. Nested Loop Join 2. Sort Merge Join 3. Hash Join
为方便分析作出如下定义:
$R$
有$M$
页,$m$
条数据$S$
有$N$
页,$n$
条数据1for r in R: # outer
2 for s in S: # inner
3 if not match(r, s):
4 continue
5 yield r, s
开销: $cost = M + m\times N$
结论: 应该让条数少的表作为outer表。
1for block_r in R:
2 for block_s in S:
3 for r in block_r:
4 for s in block_s:
5 if not match(r, s):
6 continue
7 yield r, s
开销: $cost = M + M\times N$
结论: 应该让页数少的表作为outer表。
如果有B
Buffer可以同时读取多页
1for blocks_r in chunked(R, B - 2):
2 for block_s in S:
3 for block_r in blocks_r:
4 for r in block_r:
5 for s in block_s:
6 if not match(r, s):
7 continue
8 yield r, s
开销: $cost = \left\lceil\frac {M}{B-2}\right\rceil \times N$
结论:当数据可以完全放到内存里时开销$M + N$
对于inner的查询,我们还可以做一些优化,比如对Join的字段建索引,可以是提前建好,也可以是运行时建立临时的Hash Table或者B+Tree
1for r in R: # outer
2 for s in index(r, S): # inner
3 if not match(r, s):
4 continue
5 yield r, s
下面看一个具体的例子。
1select * from department join employee on department.departmentid = employee.departmentid;
outer表为department
,先Seq Scan
department
,对于每个departmentid
再对department
进行Index Scan
test=# explain analyze select * from department join employee on department.departmentid = employee.departmentid;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.15..217.68 rows=900 width=124) (actual time=0.025..0.036 rows=5 loops=1)
-> Seq Scan on employee (cost=0.00..19.00 rows=900 width=62) (actual time=0.010..0.011 rows=6 loops=1)
-> Index Scan using department_pkey on department (cost=0.15..0.22 rows=1 width=62) (actual time=0.003..0.003 rows=1 loops=6)
Index Cond: (departmentid = employee.departmentid)
Planning Time: 0.117 ms
Execution Time: 0.070 ms
(6 rows)