如何優化MySQL反連接

153次閱讀

共計 6229 個字符，預計需要花費 16 分鐘才能閱讀完成。

這篇文章給大家分享的是有關如何優化 MySQL 反連接的內容。丸趣 TV 小編覺得挺實用的，因此分享給大家做個參考，一起跟隨丸趣 TV 小編過來看看吧。

總體來看這個環境還是相對比較繁忙的，線程大概是 200 多個。
# mysqladmin pro|less|wc -l
235
帶著好奇查看慢日志，馬上定位到這個語句，已做了脫敏處理。
# Time: 161013 9:51:45
# User@Host: root[root] @ localhost []
# Thread_id: 24630498 Schema: test Last_errno: 1160 Killed: 0
# Query_time: 61213.561106 Lock_time: 0.000082 Rows_sent: 7551 Rows_examined: 201945890920 Rows_affected: 0 Rows_read: 7551
# Bytes_sent: 0 Tmp_tables: 1 Tmp_disk_tables: 0 Tmp_table_sizes: 0
# InnoDB_trx_id: 2F8E5A82
SET timestamp=1476323505;
select account from t_fund_info
where money =300 and account not in
(select distinct(login_account) from t_user_login_record where login_time = 2016-06-01 )
into outfile /tmp/data.txt
從慢日志來看，執行時間達 61213s, 這個是相當驚人了，也就意味著這個語句跑了一整天。
這引起了我的好奇和興趣，這個問題有得搞頭了。
表 t_fund_info 數據量近 200 萬，存在一個主鍵在 id 列，唯一性索引在 account 上。
CREATE TABLE `t_fund_info`
。。。
PRIMARY KEY (`id`),
UNIQUE KEY `account` (`account`)
) ENGINE=InnoDB AUTO_INCREMENT=1998416 DEFAULT CHARSET=utf8
表 t_user_login_record 數據量 2 千多萬，存在主鍵列 id
CREATE TABLE `t_user_login_record`
。。。
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=22676193 DEFAULT CHARSET=utf8
從語句可以看出，是在做一個批量的大查詢，是希望把查詢結果生成一個文本文件來，但是過濾條件很有限。目前根據查詢來看肯定是全表掃描。
先簡單看了下過濾條件，從 t_fund_info 這個表中，根據一個過濾條件能過濾掉絕大多數的數據，得到 1 萬多數據，還是比較理想的。
select count(*)from t_fund_info where money =300;
+———-+
| count(*) |
+———-+
| 13528 |
+———-+
1 row in set (0.99 sec)
那問題的瓶頸看來是在后面的子查詢了。
把下面的語句放入一個 SQL 腳本 query.sql
select distinct(login_account) from t_user_login_record where login_time = 2016-06-01
導出數據，大概耗時 1 分鐘。
time mysql test query.sql query_rt.log
real 0m59.149s
user 0m0.394s
sys 0m0.046s
過濾后的數據有 50 多萬，相對還是比較理想的過濾情況。
# less query_rt.log|wc -l
548652
我們來解析一下這個語句，看看里面的 Not in 的條件是怎么解析的。
explain extended select account from t_fund_info
where money =300 and account not in
(select distinct(login_account) from t_user_login_record where login_time = 2016-06-01 )；
show warnings;
結果如下：
Message: select `test`.`t_fund_info`.`account` AS `account` from `test`.`t_fund_info` where ((`test`.`t_fund_info`.`money` = 300) and (not((`test`.`t_fund_info`.`account`,(select distinct 1 from `test`.`t_user_login_record` where ((`test`.`t_user_login_record`.`login_time` = 2016-06-01) and (((`test`.`t_fund_info`.`account`) = `test`.`t_user_login_record`.`login_account`) or isnull(`test`.`t_user_login_record`.`login_account`))) having (`test`.`t_user_login_record`.`login_account`))))))
可以看到整個解析的過程非常復雜，原本簡單的一個語句，經過解析，竟然變得如此復雜。

因為 MySQL 里面的優化改進空間相比 Oracle 還是少很多，我決定循序漸進來嘗試優化。因為這個環境還是很重要的，所以我在從庫端使用 mysqldump 導出數據，導入到另外一個測試環境，放開手腳來測試了。
首先對于 not in 的部分，是否是因為生成臨時表的消耗代價太高導致，所以我決定建立一個臨時表來緩存子查詢的數據。
create table test_tab as select distinct(login_account) login_account from t_user_login_record where login_time = 2016-06-01
Query OK, 548650 rows affected (1 min 3.78 sec)
Records: 548650 Duplicates: 0 Warnings: 0
這樣查看這個臨時表就很輕松了，不到 1 秒就出結果。
select count(*)from test_tab;
+———-+
| count(*) |
+———-+
| 548650 |
+———-+
1 row in set (0.38 sec)
然后再次查看使用臨時表后的查詢是否解析會有改善。
explain extended select account from t_fund_info
where money =300 and account not in(select login_account from test_tab);
show warnings;
發現還是一樣，可見臨時表的改進效果不大。
| Note | 1003 | select `test`.`t_fund_info`.`account` AS `account` from `test`.`t_fund_info` where ((`test`.`t_fund_info`.`money` = 300) and (not((`test`.`t_fund_info`.`account`,(select 1 from `test`.`test_tab` where (((`test`.`t_fund_info`.`account`) = `test`.`test_tab`.`login_account`) or isnull(`test`.`test_tab`.`login_account`)) having (`test`.`test_tab`.`login_account`)))))) |
是否是因為子查詢中的數據量太大導致整個反連接的查詢過程中回表太慢，那我縮小一下子查詢的數據條數。
select account from t_fund_info
where money =300 and not exists (select login_account from test_tab where login_account=t_fund_info.account limit 1,10);
這種方式依舊很卡，持續了近半個小時還是沒有反應，所以果斷放棄。
是不是 t_fund_info 的過濾查詢導致了性能問題，我們也創建一個臨時表
create table test_tab1 as select account from t_fund_info
– where money =300;
Query OK, 13528 rows affected (1.38 sec)
Records: 13528 Duplicates: 0 Warnings: 0
再次查詢效果依舊很不理想。
select account from test_tab1
where not exists (select login_account from test_tab where login_account=test_tab1.account limit 1,10);
持續了 20 多分鐘還是沒有反應，所以還是果斷放棄。
這個時候能想到就是索引了，我們在臨時表 test_tab 上創建索引。
create index ind_tmp_login_account on test_tab(login_account);
Query OK, 0 rows affected (4.31 sec)
Records: 0 Duplicates: 0 Warnings: 0
在臨時表 test_tab1 上也創建索引。
create index ind_tmp_account on test_tab1(account);
Query OK, 0 rows affected (0.18 sec)
Records: 0 Duplicates: 0 Warnings: 0
再次查看性能就變得很好了，運行時間 0.15 秒，簡直不敢相信。
explain select account from test_tab1
where not exists (select login_account from test_tab where login_account=test_tab1.account);
11364 rows in set (0.15 sec)

執行計劃如下：

可見通過這種拆分，不斷的猜測和排除，已經找到了一些思路。
我們開始抓住問題的本質。
首先刪除 test_tab1 上的索引，看看執行效果如何。
alter table test_tab1 drop index ind_tmp_account;
Query OK, 0 rows affected (0.00 sec)
Records: 0 Duplicates: 0 Warnings: 0
語句如下，執行時間 0.15 秒
select account from test_tab1
where not exists (select login_account from test_tab where login_account=test_tab1.account);
+——————————–+
11364 rows in set (0.15 sec)
是否 not in 的方式會有很大的差別呢，持續 0.18 秒，有差別，但差別不大。
select account from test_tab1
where account not in (select login_account from test_tab );
+——————————–+
11364 rows in set (0.18 sec)
我們逐步恢復原來的查詢，去除臨時表 test_tab1, 整個查詢持續了 1.12 秒。
select account from t_fund_info
where money =300 and account not in(select login_account from test_tab);
+——————————–+
11364 rows in set (1.12 sec)
使用 explain extended 解析的內容如下：
Message: select `test`.`t_fund_info`.`account` AS `account` from `test`.`t_fund_info` where ((`test`.`t_fund_info`.`money` = 300) and (not((`test`.`t_fund_info`.`account`,(((`test`.`t_fund_info`.`account`) in test_tab on ind_tmp_login_account checking NULL having (`test`.`test_tab`.`login_account`)))))))
這個時候，問題已經基本定位了。在反連接的查詢中，在這個問題場景中，需要對子查詢的表添加一個索引基于 login_account，可以和外層的查詢字段映射，提高查詢效率。
當然在一個數據量龐大，業務相對繁忙的系統中，添加一個臨時需求的索引可能不是一個很好的方案。不過我們還是在測試環境體驗一下。
create index ind_tmp_account1 on t_user_login_record(login_account);
Query OK, 0 rows affected (4 min 45.48 sec)
Records: 0 Duplicates: 0 Warnings: 0
添加索引的過程持續了近 4 分鐘，在這個時候我們使用最開始的查詢語句，性能如何呢。
select account from t_fund_info where money =300 and account not in (select distinct(login_account) from t_user_login_record where);
+——————————–+
11364 rows in set (2.52 sec)
只要 2.52 秒就可以完成之前 20 多個小時查詢結果，性能簡直就是天壤之別。
不過話說回來，跑批查詢可以在從庫上執行，從庫上創建一個這樣的索引，用完再刪掉也是不錯的選擇，要么就是創建一個臨時表，在臨時表上創建索引，臨時表的意義就在于此，不是為了做查詢結果緩存而是創建索引來提高數據過濾效率。
在此有個問題就是臨時表只有一個字段，創建索引的意義在哪里呢。
我畫一個圖來解釋一下。

首先這個查詢的數據是以 t_fund_info 的過濾條件為準，從 200 萬數據中過濾得到 1 萬條數據，然后兩個字段通過 account=login_account 的條件關聯，而不是先關聯子查詢的過濾條件 login_time, 過濾完之后 account 的值之后再過濾 login_time，最后根據 not in 的邏輯來取舍數據，整個數據集就會大大減少。如此一來，子查詢的表千萬行，性能的差別就不會是指數級的。

感謝各位的閱讀！關于“如何優化 MySQL 反連接”這篇文章就分享到這里了，希望以上內容可以對大家有一定的幫助，讓大家可以學到更多知識，如果覺得文章不錯，可以把它分享出去讓更多的人看到吧！

正文完