PostgreSQL中的GIN索引有什么作用

138次閱讀

共計 7046 個字符，預計需要花費 18 分鐘才能閱讀完成。

本篇內容主要講解“PostgreSQL 中的 GIN 索引有什么作用”，感興趣的朋友不妨來看看。本文介紹的方法操作簡單快捷，實用性強。下面就讓丸趣 TV 小編來帶大家學習“PostgreSQL 中的 GIN 索引有什么作用”吧!

GIN 索引的主要用處是加快全文檢索 full-text search 的速度.

全文檢索
全文檢索 full-text search 的目的是從文檔集中找到匹配檢索條件的文檔(document). 在搜索引擎中, 如果有很多匹配的文檔, 那么需要找到最匹配的那些, 但在數據庫查詢中, 找到滿足條件的即可.

在 PG 中, 出于搜索的目的, 文檔會被轉換為特定的類型 tsvector, 包含詞素 (lexemes) 和它們在文檔中的位置. 詞素 Lexemes 是那些轉換適合查詢的單詞形式(即分詞). 比如:

testdb=# select to_tsvector( There was a crooked man, and he walked a crooked mile 
 to_tsvector 
-----------------------------------------
  crook :4,10  man :5  mile :11  walk :8
(1 row)

從本例可以看到, 分詞后, 出現了 crook/man/mile 和 walk, 其位置分別是 4,10/5/11/8. 同時, 也可以看到比如 there 等詞被忽略了, 因為這些詞是 stop words(從搜索引擎的角度來看, 這些詞太過普通, 不需要記錄), 當然這是可以配置的.

PG 全文檢索中的查詢通過 tsquery 來表示, 查詢條件包含 1 個或多個使用 and(\)/or(|)/not(!)等操作符連接的詞素. 同樣的, 使用括號來闡明操作的優先級.

testdb=# select to_tsquery(man   (walking | running) 
 to_tsquery 
----------------------------
  man    (  walk  |  run  )
(1 row)

操作符 @@ 用于全文檢索

testdb=# select to_tsvector(There was a crooked man, and he walked a crooked mile) @@ to_tsquery(man   (walking | running) 
 ?column? 
----------
 t
(1 row)
select to_tsvector(There was a crooked man, and he walked a crooked mile) @@ to_tsquery(man   (going | running) 
 ?column? 
----------
 f
(1 row)

GIN 簡介
GIN 是 Generalized Inverted Index 通用倒排索引的簡稱, 如熟悉搜索引擎, 這個概念不難理解. 它所操作的數據類型的值由元素組成而不是原子的. 這樣的數據類型成為復合數據類型. 索引的是數據值中的元素.
舉個例子, 比如書末尾的索引，它為每個術語提供了一個包含該術語出現位置所對應的頁面列表。訪問方法 (AM) 需要確保索引元素的快速訪問, 因此這些元素存儲在類似 Btree 中, 引用包含復合值 (內含元素) 數據行的有序集合鏈接到每個元素上. 有序對于數據檢索并不重要 (如 TIDs 的排序), 但對于索引的內部結構很重要.
元素不會從 GIN 索引中刪除, 可能有人會認為包含元素的值可以消失 / 新增 / 變化, 但組成這些元素的元素集大多是穩定的. 這樣的處理方式大大簡化了多進程使用索引的算法.

如果 TIDs 不大, 那么可以跟元素存儲在同一個 page 中 (稱為 posting list), 但如果鏈表很大, 會采用 Btree 這種更有效的數據結構, 會存儲在分開的數據頁中(稱為 posting tree).
因此,GIN 索引包含含有元素的 Btree,TIDs Btree 或者普通鏈表會鏈接到該 Btree 的葉子行上.

與前面討論的 GiST 和 SP-GiST 索引一樣，GIN 為應用程序開發人員提供了接口，以支持復合數據類型上的各種操作。

舉個例子, 下面是表 ts, 為 ts 創建 GIN 索引:

testdb=# drop table if exists ts;
psql: NOTICE: table  ts  does not exist, skipping
DROP TABLE
testdb=# create table ts(doc text, doc_tsv tsvector);
CREATE TABLE
testdb=# truncate table ts;
 slitter. ), 
 (I am a sheet slitter.),
 (I slit sheets.),
 (I am the sleekest sheet slitter that ever slit sheets.),
 ( She slits the sheet she sits on. 
update ts set doc_tsv = to_tsvector(doc);
create index on ts using gin(doc_tsv);
TRUNCATE TABLE
testdb=# insert into ts(doc) values
testdb-# (Can a sheet slitter slit sheets?), 
testdb-# (How many sheets could a sheet slitter slit?),
testdb-# (I slit a sheet, a sheet I slit.),
testdb-# (Upon a slitted sheet I sit.), 
testdb-# (Whoever slit the sheets is a good sheet slitter.), 
testdb-# (I am a sheet slitter.),
testdb-# (I slit sheets.),
testdb-# (I am the sleekest sheet slitter that ever slit sheets.),
testdb-# ( She slits the sheet she sits on. 
INSERT 0 9
testdb=# 
testdb=# update ts set doc_tsv = to_tsvector(doc);
UPDATE 9
testdb=# 
testdb=# create index on ts using gin(doc_tsv);
CREATE INDEX

在這里, 使用黑底 (page 編號 + page 內偏移) 而不是箭頭來表示對 TIDs 的引用.
與常規的 Btree 不同, 因為遍歷只有一種方法,GIN 索引由單向鏈表連接, 而不是雙向鏈表.

testdb=# select ctid, left(doc,20), doc_tsv from ts;
 ctid | left | doc_tsv 
--------+----------------------+---------------------------------------------------------
 (0,10) | Can a sheet slitter |  sheet :3,6  slit :5  slitter :4
 (0,11) | How many sheets coul |  could :4  mani :2  sheet :3,6  slit :8  slitter :7
 (0,12) | I slit a sheet, a sh |  sheet :4,6  slit :2,8
 (0,13) | Upon a slitted sheet |  sheet :4  sit :6  slit :3  upon :1
 (0,14) | Whoever slit the she |  good :7  sheet :4,8  slit :2  slitter :9  whoever :1
 (0,15) | I am a sheet slitter |  sheet :4  slitter :5
 (0,16) | I slit sheets. |  sheet :3  slit :2
 (0,17) | I am the sleekest sh |  ever :8  sheet :5,10  sleekest :4  slit :9  slitter :6
 (0,18) | She slits the sheet |  sheet :4  sit :6  slit :2
(9 rows)

在這個例子中,sheet/slit/slitter 使用 Btree 存儲而其他元素則使用簡單的鏈表.

如果我們希望知道元素的個數, 如何獲取?

testdb=# select (unnest(doc_tsv)).lexeme, count(*) from ts
testdb-# group by 1 order by 2 desc;
 lexeme | count 
----------+-------
 sheet | 9
 slit | 8
 slitter | 5
 sit | 2
 upon | 1
 mani | 1
 whoever | 1
 sleekest | 1
 good | 1
 could | 1
 ever | 1
(11 rows)

下面舉例說明如何通過 GIN 索引進行掃描:

testdb=# explain(costs off)
testdb-# select doc from ts where doc_tsv @@ to_tsquery( many   slitter 
 QUERY PLAN 
-----------------------------------------------------------
 Seq Scan on ts
 Filter: (doc_tsv @@ to_tsquery( many   slitter ::text))
(2 rows)
testdb=# set enable_seqscan=off;
testdb=# explain(costs off)
select doc from ts where doc_tsv @@ to_tsquery( many   slitter 
 QUERY PLAN 
---------------------------------------------------------------------
 Bitmap Heap Scan on ts
 Recheck Cond: (doc_tsv @@ to_tsquery( many   slitter ::text))
 -  Bitmap Index Scan on ts_doc_tsv_idx
 Index Cond: (doc_tsv @@ to_tsquery( many   slitter ::text))
(4 rows)

執行此查詢首先需要提取單個詞素(lexeme, 亦即檢索鍵):mani/slitter.PG 中有專門的 API 函數來完成, 該函數考慮了由 op class 確定的數據類型和使用場景.

testdb=# select amop.amopopr::regoperator, amop.amopstrategy
testdb-# from pg_opclass opc, pg_opfamily opf, pg_am am, pg_amop amop
testdb-# where opc.opcname =  tsvector_ops 
testdb-# and opf.oid = opc.opcfamily
testdb-# and am.oid = opf.opfmethod
testdb-# and amop.amopfamily = opc.opcfamily
testdb-# and am.amname =  gin 
testdb-# and amop.amoplefttype = opc.opcintype;
 amopopr | amopstrategy 
-----------------------+--------------
 @@(tsvector,tsquery) | 1
 @@@(tsvector,tsquery) | 2
(2 rows)

回到本例中, 在詞素 Btree 中, 下一步會同時檢索鍵并進入 TIDs 鏈表中, 得到:
mani — (0,2)
slitter — (0,1), (0,2), (1,2), (1,3), (2,2)

對于每一個找到的 TID, 調用 consistency function
API, 由此函數確定找到的行是否匹配檢索鍵. 因為查詢為 AND, 因此只返回(0,2).

testdb=# select doc from ts where doc_tsv @@ to_tsquery( many   slitter 
 doc 
---------------------------------------------
 How many sheets could a sheet slitter slit?
(1 row)

Slow Update
對 GIN index 的列進行 DML(主要是 insert update)是相當慢的, 每一個文檔通常包含許多需要索引的詞素. 因此, 雖然只添加或更新一個文檔, 但也需要更新大量索引樹. 換句話說, 如果多個文檔同時更新, 這些文檔中的詞素可能是一樣的, 因此總的消耗可能比逐個更新文檔要小.
PG 提供了 fastupdate 選項, 用打開此參數后, 更新將在一個單獨的無序鏈表中處理, 當這個鏈表超過閾值 (參數:gin_pending_list_limit 或索引同名存儲參數) 時才會對索引進行更新. 這種技術也有負面影響, 一是降低了查詢效率(需額外掃描該鏈表), 二是某個更新恰好碰上索引更新, 那么該次更新會相對很久.

Limiting the query result
GIN AM 的其中一個特性時通常會返回 bitmap 而不是逐個返回 TID, 因此執行計劃都是 bitmap scan.
這樣的特性胡導致 LIMIT 子句不會太有效:

testdb=# explain verbose 
select doc from ts where doc_tsv @@ to_tsquery( many   slitter 
 QUERY PLAN 
------------------------------------------------------------------------------
 Bitmap Heap Scan on public.ts (cost=12.25..16.51 rows=1 width=32)
 Output: doc
 Recheck Cond: (ts.doc_tsv @@ to_tsquery( many   slitter ::text))
 -  Bitmap Index Scan on ts_doc_tsv_idx (cost=0.00..12.25 rows=1 width=0)
 Index Cond: (ts.doc_tsv @@ to_tsquery( many   slitter ::text))
(5 rows)
testdb=# explain verbose
select doc from ts where doc_tsv @@ to_tsquery(many   slitter) limit 1;
 QUERY PLAN 
------------------------------------------------------------------------------------
 Limit (cost=12.25..16.51 rows=1 width=32)
 Output: doc
 -  Bitmap Heap Scan on public.ts (cost=12.25..16.51 rows=1 width=32)
 Output: doc
 Recheck Cond: (ts.doc_tsv @@ to_tsquery( many   slitter ::text))
 -  Bitmap Index Scan on ts_doc_tsv_idx (cost=0.00..12.25 rows=1 width=0)
 Index Cond: (ts.doc_tsv @@ to_tsquery( many   slitter ::text))
(7 rows)

這是因為 Bitmap Heap Scan 的啟動成本與 Bitmap Index Scan 不會差太多.

基于這樣的情況,PG 提供了 gin_fuzzy_search_limit 參數控制返回的結果行數(默認為 0, 即全部返回).

testdb=# show gin_fuzzy_search_limit ;
 gin_fuzzy_search_limit 
------------------------
 0
(1 row)

到此，相信大家對“PostgreSQL 中的 GIN 索引有什么作用”有了更深的了解，不妨來實際操作一番吧！這里是丸趣 TV 網站，更多相關內容可以進入相關頻道進行查詢，關注我們，繼續學習！

正文完

gin man 全文檢索文檔索引

發表至：數據庫

2023-07-24

轉載說明：除特殊說明外本站除技術相關以外文章皆由網絡搜集發布，轉載請注明出處。

Linux中如何使用ps命令

裝完phpmyadmin后打不開網頁的解決方法

MySQL中怎樣實現主從同步

如何解決python連接數據庫mysql解壓版安裝配置及遇到問題

怎么理解Oracl中的Remote

久久精品人人爽,华人av在线,亚洲性视频网站,欧美专区一二三

PostgreSQL中的GIN索引有什么作用