Join Operators

PUBLISHED ON MAR 29, 2020 — DATABASE

An SQL query walks into a bar and sees two tables. He walks up to them and asks ‘Can I join you?’

工作中常用Postgresql的EXPLAIN命令对慢查询进行分析,对QUERY PLAN中JOIN相关的节点一直都是一知半解,本文将对JOIN操作(主要指inner equi-join,$\bowtie$)做一个梳理。

Join Operator

负责选择IO开销最低的JOIN算法,并给上层的节点的输出数据。

数据输出有两种方式:

  1. 将outer和inner的tuple(元组)中取出来组成新的tuple,返回给上层节点。后续的操作不需要再到原表里取数据。
  2. 返回join字段和record id,上层节点通过record id取需要的字段。只取需要的数据,适用于列存储。

Join Algorithms

JOIN算法大致分三种 1. Nested Loop Join 2. Sort Merge Join 3. Hash Join

为方便分析作出如下定义:

  1. $R$$M$页,$m$条数据
  2. $S$$N$页,$n$条数据

Nested Loop

Stupid Nested Loop

1for r in R:                   # outer
2    for s in S:               # inner
3        if not match(r, s):
4            continue
5        yield r, s 

开销: $cost = M + m\times N$

结论: 应该让条数少的表作为outer表。

Block Nested Loop

1for block_r in R:                 
2    for block_s in S:  
3        for r in block_r:
4            for s in block_s:             
5                if not match(r, s):
6                    continue
7                yield r, s 

开销: $cost = M + M\times N$

结论: 应该让页数少的表作为outer表。

如果有B Buffer可以同时读取多页

1for blocks_r in chunked(R, B - 2):                 
2    for block_s in S:  
3        for block_r in blocks_r:
4            for r in block_r:
5                for s in block_s:             
6                    if not match(r, s):
7                        continue
8                    yield r, s 

开销: $cost = \left\lceil\frac {M}{B-2}\right\rceil \times N$

结论:当数据可以完全放到内存里时开销$M + N$

对于inner的查询,我们还可以做一些优化,比如对Join的字段建索引,可以是提前建好,也可以是运行时建立临时的Hash Table或者B+Tree

1for r in R:                   # outer
2    for s in index(r, S):     # inner
3        if not match(r, s):
4            continue
5        yield r, s

下面看一个具体的例子。

1select * from department join employee on department.departmentid = employee.departmentid;

outer表为department,先Seq Scan department,对于每个departmentid再对department进行Index Scan

test=# explain analyze select * from department join employee on department.departmentid = employee.departmentid;
                                                           QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Nested Loop  (cost=0.15..217.68 rows=900 width=124) (actual time=0.025..0.036 rows=5 loops=1)
  ->  Seq Scan on employee  (cost=0.00..19.00 rows=900 width=62) (actual time=0.010..0.011 rows=6 loops=1)
  ->  Index Scan using department_pkey on department  (cost=0.15..0.22 rows=1 width=62) (actual time=0.003..0.003 rows=1 loops=6)
        Index Cond: (departmentid = employee.departmentid)
Planning Time: 0.117 ms
Execution Time: 0.070 ms
(6 rows)

Sort Merge JOIN

Hash JOIN

参考

  1. The Join Operation
  2. CMU 15-445/645 (FALL 2019) DATABASE SYSTEMS
  3. Database System Concepts Seventh Edition
comments powered by Disqus