PostgreSQL中HashAggregate與GroupAggregate的區(qū)別是什么

157次閱讀

共計 9395 個字符，預(yù)計需要花費 24 分鐘才能閱讀完成。

本篇內(nèi)容介紹了“PostgreSQL 中 HashAggregate 與 GroupAggregate 的區(qū)別是什么”的有關(guān)知識，在實際案例的操作過程中，不少人都會遇到這樣的困境，接下來就讓丸趣 TV 小編帶領(lǐng)大家學(xué)習(xí)一下如何處理這些情況吧！希望大家仔細閱讀，能夠?qū)W有所成！

案例一

首先我們看一個案例:
測試表:

drop table if exists t_agg;
create table t_agg(bh varchar(20),c1 int,c2 int,c3 int,c4 int,c5 int,c6 int);
insert into t_agg select  GZ01 ,col,col,col,col,col,col from generate_series(1,100000) as col;
insert into t_agg select  GZ02 ,col,col,col,col,col,col from generate_series(1,100000) as col;
insert into t_agg select  GZ03 ,col,col,col,col,col,col from generate_series(1,100000) as col;
insert into t_agg select  GZ04 ,col,col,col,col,col,col from generate_series(1,100000) as col;
insert into t_agg select  GZ05 ,col,col,col,col,col,col from generate_series(1,100000) as col;

執(zhí)行查詢:

testdb=# --  禁用并行
testdb=# set max_parallel_workers_per_gather=0;
testdb=# explain verbose select bh,min(c1),max(c1),min(c2),max(c2),min(c3),max(c3),min(c4),max(c4),min(c5),max(c5) from t_agg group by bh;
 QUERY PLAN 
--------------------------------------------------------------------------------------------------------
 HashAggregate (cost=22427.00..22427.05 rows=5 width=45)
 Output: bh, min(c1), max(c1), min(c2), max(c2), min(c3), max(c3), min(c4), max(c4), min(c5), max(c5)
 Group Key: t_agg.bh
 -  Seq Scan on public.t_agg (cost=0.00..8677.00 rows=500000 width=25)
 Output: bh, c1, c2, c3, c4, c5, c6
(5 rows)

PG 的優(yōu)化器選擇了 HashAggregate.
下面禁用 HashAggregate, 優(yōu)化器只能選擇 GroupAggregate. 可以看到兩者的總成本比較:22427.05 vs 82968.97

testdb=# set enable_hashagg = off;
testdb=# explain verbose select bh,min(c1),max(c1),min(c2),max(c2),min(c3),max(c3),min(c4),max(c4),min(c5),max(c5) from t_agg group by bh;
 QUERY PLAN 
--------------------------------------------------------------------------------------------------------
 GroupAggregate (cost=67968.92..82968.97 rows=5 width=45)
 Output: bh, min(c1), max(c1), min(c2), max(c2), min(c3), max(c3), min(c4), max(c4), min(c5), max(c5)
 Group Key: t_agg.bh
 -  Sort (cost=67968.92..69218.92 rows=500000 width=25)
 Output: bh, c1, c2, c3, c4, c5
 Sort Key: t_agg.bh
 -  Seq Scan on public.t_agg (cost=0.00..8677.00 rows=500000 width=25)
 Output: bh, c1, c2, c3, c4, c5
(8 rows)

案例二
下面用一個寬表來進行測試: 分組鍵值很少, 但聚合列很多

drop table if exists t_agg_width;
create table t_agg_width
(bh varchar(20)
,c1 int,c2 int,c3 int,c4 int,c5 int,c6 int,c7 int,c8 int,c9 int
,c11 int,c12 int,c13 int,c14 int,c15 int,c16 int,c17 int,c18 int,c19 int
,c21 int,c22 int,c23 int,c24 int,c25 int,c26 int,c27 int,c28 int,c29 int
,c31 int,c32 int,c33 int,c34 int,c35 int,c36 int,c37 int,c38 int,c39 int);
insert into t_agg_width 
select  GZ01 
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
from generate_series(1,100000) as col;
insert into t_agg_width 
select  GZ02 
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
from generate_series(1,100000) as col;
insert into t_agg_width 
select  GZ03 
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
from generate_series(1,100000) as col;
insert into t_agg_width 
select  GZ04 
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
from generate_series(1,100000) as col;
--  禁用 hashagg
set enable_hashagg = off;
--  禁用并行
set max_parallel_workers_per_gather=0;
select bh
,min(c1),min(c2) ,min(c3) ,min(c4) ,min(c5) ,min(c6) ,min(c7) ,min(c8) ,min(c9)
,min(c11),min(c12) ,min(c13) ,min(c14) ,min(c15) ,min(c16) ,min(c17) ,min(c18) ,min(c19)
,min(c21),min(c22) ,min(c23) ,min(c24) ,min(c25) ,min(c26) ,min(c27) ,min(c28) ,min(c29)
,min(c31),min(c32) ,min(c33) ,min(c34) ,min(c35) ,min(c36) ,min(c37) ,min(c38) ,min(c39)
from t_agg_width group by bh;

在這種情況下, 優(yōu)化器仍會選擇 Hash

testdb=# explain verbose select bh
testdb-# ,min(c1),min(c2) ,min(c3) ,min(c4) ,min(c5) ,min(c6) ,min(c7) ,min(c8) ,min(c9)
testdb-# ,min(c11),min(c12) ,min(c13) ,min(c14) ,min(c15) ,min(c16) ,min(c17) ,min(c18) ,min(c19)
testdb-# ,min(c21),min(c22) ,min(c23) ,min(c24) ,min(c25) ,min(c26) ,min(c27) ,min(c28) ,min(c29)
testdb-# ,min(c31),min(c32) ,min(c33) ,min(c34) ,min(c35) ,min(c36) ,min(c37) ,min(c38) ,min(c39)
testdb-# from t_agg_width group by bh;
 QUERY PLAN 
----------------------------------------------------------------------------------------------------------
 HashAggregate (cost=49889.00..49889.04 rows=4 width=149)
 Output: bh, min(c1), min(c2), min(c3), min(c4), min(c5), min(c6), min(c7), min(c8), min(c9), min(c11), min(c12), min(c13),
 min(c14), min(c15), min(c16), min(c17), min(c18), min(c19), min(c21), min(c22), min(c23), min(c24), min(c25), min(c26), min(c27), min(c28), min(c29), min(c31), min(c32), min(c33), min(c34), min(c35), min(c36), min(c37), min(c38), min(c39)
 Group Key: t_agg_width.bh
 -  Seq Scan on public.t_agg_width (cost=0.00..12889.00 rows=400000 width=149)
 Output: bh, c1, c2, c3, c4, c5, c6, c7, c8, c9, c11, c12, c13, c14, c15, c16, c17, c18, c19, c21, c22, c23, c24, c25
, c26, c27, c28, c29, c31, c32, c33, c34, c35, c36, c37, c38, c39
(5 rows)
testdb=# set enable_hashagg = off;
testdb=# explain verbose select bh
,min(c1),min(c2) ,min(c3) ,min(c4) ,min(c5) ,min(c6) ,min(c7) ,min(c8) ,min(c9)
,min(c11),min(c12) ,min(c13) ,min(c14) ,min(c15) ,min(c16) ,min(c17) ,min(c18) ,min(c19)
,min(c21),min(c22) ,min(c23) ,min(c24) ,min(c25) ,min(c26) ,min(c27) ,min(c28) ,min(c29)
,min(c31),min(c32) ,min(c33) ,min(c34) ,min(c35) ,min(c36) ,min(c37) ,min(c38) ,min(c39)
from t_agg_width group by bh;
 QUERY PLAN 
----------------------------------------------------------------------------------------------------------
 GroupAggregate (cost=110266.28..148266.32 rows=4 width=149)
 Output: bh, min(c1), min(c2), min(c3), min(c4), min(c5), min(c6), min(c7), min(c8), min(c9), min(c11), min(c12), min(c13),
 min(c14), min(c15), min(c16), min(c17), min(c18), min(c19), min(c21), min(c22), min(c23), min(c24), min(c25), min(c26), min(c27), min(c28), min(c29), min(c31), min(c32), min(c33), min(c34), min(c35), min(c36), min(c37), min(c38), min(c39)
 Group Key: t_agg_width.bh
 -  Sort (cost=110266.28..111266.28 rows=400000 width=149)
 Output: bh, c1, c2, c3, c4, c5, c6, c7, c8, c9, c11, c12, c13, c14, c15, c16, c17, c18, c19, c21, c22, c23, c24, c25
, c26, c27, c28, c29, c31, c32, c33, c34, c35, c36, c37, c38, c39
 Sort Key: t_agg_width.bh
 -  Seq Scan on public.t_agg_width (cost=0.00..12889.00 rows=400000 width=149)
 Output: bh, c1, c2, c3, c4, c5, c6, c7, c8, c9, c11, c12, c13, c14, c15, c16, c17, c18, c19, c21, c22, c23, c2
4, c25, c26, c27, c28, c29, c31, c32, c33, c34, c35, c36, c37, c38, c39
(8 rows)
testdb=#

下面增大分組鍵值的分布, 同時提高 c1 等列的選擇率, 再次測試:

testdb=# insert into t_agg_width 
testdb-# select  GZ ||col
testdb-# ,mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100) 
testdb-# ,mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100) 
testdb-# ,mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100) 
testdb-# ,mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100) 
testdb-# from generate_series(1,1000000) as col;
INSERT 0 1000000
testdb=# set enable_hashagg = on;
testdb=# explain verbose select bh
,min(c1),min(c2) ,min(c3) ,min(c4) ,min(c5) ,min(c6) ,min(c7) ,min(c8) ,min(c9)
,min(c11),min(c12) ,min(c13) ,min(c14) ,min(c15) ,min(c16) ,min(c17) ,min(c18) ,min(c19)
,min(c21),min(c22) ,min(c23) ,min(c24) ,min(c25) ,min(c26) ,min(c27) ,min(c28) ,min(c29)
,min(c31),min(c32) ,min(c33) ,min(c34) ,min(c35) ,min(c36) ,min(c37) ,min(c38) ,min(c39)
from t_agg_width group by bh;
 QUERY PLAN 
----------------------------------------------------------------------------------------------------------
 GroupAggregate (cost=440012.46..586553.52 rows=7414 width=149)
 Output: bh, min(c1), min(c2), min(c3), min(c4), min(c5), min(c6), min(c7), min(c8), min(c9), min(c11), min(c12), min(c13),
 min(c14), min(c15), min(c16), min(c17), min(c18), min(c19), min(c21), min(c22), min(c23), min(c24), min(c25), min(c26), min(c27), min(c28), min(c29), min(c31), min(c32), min(c33), min(c34), min(c35), min(c36), min(c37), min(c38), min(c39)
 Group Key: t_agg_width.bh
 -  Sort (cost=440012.46..443866.86 rows=1541757 width=149)
 Output: bh, c1, c2, c3, c4, c5, c6, c7, c8, c9, c11, c12, c13, c14, c15, c16, c17, c18, c19, c21, c22, c23, c24, c25
, c26, c27, c28, c29, c31, c32, c33, c34, c35, c36, c37, c38, c39
 Sort Key: t_agg_width.bh
 -  Seq Scan on public.t_agg_width (cost=0.00..49681.57 rows=1541757 width=149)
 Output: bh, c1, c2, c3, c4, c5, c6, c7, c8, c9, c11, c12, c13, c14, c15, c16, c17, c18, c19, c21, c22, c23, c2
4, c25, c26, c27, c28, c29, c31, c32, c33, c34, c35, c36, c37, c38, c39
(8 rows)
testdb=#

這一次選擇的是 GroupAggregate.

HashAggregate
HashAggregate, 數(shù)據(jù)庫會根據(jù) group by 字段后面的值算出 hash 值, 并在內(nèi)存中維護對應(yīng)的 Hash 表, 比如 select 有 n 個聚合函數(shù), 那么在內(nèi)存中就會維護 n 個 Hash 表. 這種方式使用的內(nèi)存比 GroupAggregate 要大, 內(nèi)存的使用與 group by COLUMN 中的 COLUMN 的唯一鍵值以及聚合列的多少成正比.

GroupAggregate
GroupAggregate, 數(shù)據(jù)庫先將表中的數(shù)據(jù)按 group by 的字段進行排序, 然后對排好序的數(shù)據(jù)進行一次掃描, 計算得到聚合的結(jié)果. 這種方式需要先執(zhí)行一次排序, 計算復(fù)雜度上面要比 HashAggregate 要高, 但這種方法的好處是與 group by COLUMN 中的 COLUMN 的唯一鍵值多寡 / 聚合列多寡無關(guān), 分組鍵值很多而且聚合列很多且列數(shù)據(jù)選擇很高的情況下, 會優(yōu)于 HashAggregate.

“PostgreSQL 中 HashAggregate 與 GroupAggregate 的區(qū)別是什么”的內(nèi)容就介紹到這里了，感謝大家的閱讀。如果想了解更多行業(yè)相關(guān)的知識可以關(guān)注丸趣 TV 網(wǎng)站，丸趣 TV 小編將為大家輸出更多高質(zhì)量的實用文章！

正文完