共計(jì) 13570 個(gè)字符,預(yù)計(jì)需要花費(fèi) 34 分鐘才能閱讀完成。
這期內(nèi)容當(dāng)中丸趣 TV 小編將會(huì)給大家?guī)?lái)有關(guān) galera mysql cluster 故障節(jié)點(diǎn)再次接入集群遇到問(wèn)題該怎么辦,文章內(nèi)容豐富且以專業(yè)的角度為大家分析和敘述,閱讀完這篇文章希望大家可以有所收獲。
galera cluster 是 mysql 的多主集群.
我們目前搭建了 3 個(gè)節(jié)點(diǎn)的測(cè)試集群.
第一輪測(cè)試的時(shí)候, 發(fā)現(xiàn)一個(gè)問(wèn)題, 節(jié)點(diǎn)故障了, 下線, 然后重新加入集群, 無(wú)法加入.
然后直接整個(gè)節(jié)點(diǎn)內(nèi)容 作為一個(gè)新節(jié)點(diǎn)加入, 也是失敗的. 搞了兩天, 頭大了. 失敗告終.
報(bào)錯(cuò)信息如下:
170609 16:55:59 [Note] WSREP: Read nil XID from storage engines, skipping position init
170609 16:55:59 [Note] WSREP: wsrep_load(): loading provider library /usr/lib64/galera-3/libgalera_smm.so
170609 16:55:59 [Note] WSREP: wsrep_load(): Galera 3.20(r7e383f7) by Codership Oy info@codership.com loaded successfully.
170609 16:55:59 [Note] WSREP: CRC-32C: using hardware acceleration.
170609 16:55:59 [Note] WSREP: Found saved state: 51391c6d-4bff-11e7-a1c3-b797743e8629:824276, safe_to_bootsrap: 0
170609 16:55:59 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 192.168.11.152; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 300M; gcache.recover = no; gcache.size = 300M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; pc
170609 16:55:59 [Note] WSREP: GCache history reset: old(51391c6d-4bff-11e7-a1c3-b797743e8629:0) – new(51391c6d-4bff-11e7-a1c3-b797743e8629:824276)
170609 16:55:59 [Note] WSREP: Assign initial position for certification: 824276, protocol version: -1
170609 16:55:59 [Note] WSREP: wsrep_sst_grab()
170609 16:55:59 [Note] WSREP: Start replication
170609 16:55:59 [Note] WSREP: Setting initial position to 51391c6d-4bff-11e7-a1c3-b797743e8629:824276
170609 16:55:59 [Note] WSREP: protonet asio version 0
170609 16:55:59 [Note] WSREP: Using CRC-32C for message checksums.
170609 16:55:59 [Note] WSREP: backend: asio
170609 16:55:59 [Note] WSREP: gcomm thread scheduling priority set to other:0
170609 16:55:59 [Warning] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory)
170609 16:55:59 [Note] WSREP: restore pc from disk failed
170609 16:55:59 [Note] WSREP: GMCast version 0
170609 16:55:59 [Warning] WSREP: Failed to resolve tcp:// 192.168.11.98:4567
170609 16:55:59 [Warning] WSREP: Failed to resolve tcp:// 192.168.12.75 :4567
170609 16:55:59 [Note] WSREP: (753e6ee4, tcp://0.0.0.0:4567) listening at tcp://0.0.0.0:4567
170609 16:55:59 [Note] WSREP: (753e6ee4, tcp://0.0.0.0:4567) multicast: , ttl: 1
170609 16:55:59 [Note] WSREP: EVS version 0
170609 16:55:59 [Note] WSREP: gcomm: connecting to group mycluster , peer 192.168.11.152:, 192.168.11.98:, 192.168.12.75 :
170609 16:55:59 [Note] WSREP: (753e6ee4, tcp://0.0.0.0:4567) connection established to 753e6ee4 tcp://192.168.11.152:4567
170609 16:55:59 [Warning] WSREP: (753e6ee4, tcp://0.0.0.0:4567) address tcp://192.168.11.152:4567 points to own listening address, blacklisting
170609 16:56:02 [Warning] WSREP: no nodes coming from prim view, prim not possible
170609 16:56:02 [Note] WSREP: view(view_id(NON_PRIM,753e6ee4,1) memb {
753e6ee4,0
} joined {
} left {
} partitioned {
})
170609 16:56:02 [Note] WSREP: (753e6ee4, tcp://0.0.0.0:4567) connection to peer 753e6ee4 with addr tcp://192.168.11.152:4567 timed out, no messages seen in PT3S
170609 16:56:03 [Warning] WSREP: last inactive check more than PT1.5S ago (PT3.50193S), skipping check
170609 16:56:32 [Note] WSREP: view((empty))
170609 16:56:32 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
at gcomm/src/pc.cpp:connect():158
170609 16:56:32 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out)
170609 16:56:32 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1404: Failed to open channel mycluster at gcomm://192.168.11.152, 192.168.11.98, 192.168.12.75 ? gmcast.segment=0 evs.max_install_timeouts=1 : -110 (Connection timed out)
170609 16:56:32 [ERROR] WSREP: gcs connect failed: Connection timed out
170609 16:56:32 [ERROR] WSREP: wsrep::connect(gcomm://192.168.11.152, 192.168.11.98, 192.168.12.75 ? gmcast.segment=0 evs.max_install_timeouts=1) failed: 7
170609 16:56:32 [ERROR] Aborting
170609 16:56:32 [Note] WSREP: Service disconnected.
170609 16:56:33 [Note] WSREP: Some threads may fail to exit.
170609 16:56:33 [Note] /usr/sbin/mysqld: Shutdown complete
然后 就在也加入不了集群了.
人都蒙了, 一度懷疑國(guó)內(nèi)的最大的集群是怎么維護(hù)這個(gè)問(wèn)題的?
刪除所有的測(cè)試 vm , 從新安裝 os . 從新來(lái)過(guò).
這兩天重新開始測(cè)試這個(gè)問(wèn)題.
繼續(xù)重復(fù)測(cè)試這個(gè)案例.
節(jié)點(diǎn)刪除后, 重現(xiàn)了相同的問(wèn)題.
幾點(diǎn)不管是清空所有數(shù)據(jù), 重新加入, 還是保留原數(shù)據(jù)加入集群. 都是失敗, 報(bào)錯(cuò)信息跟上面是一樣的.
又無(wú)解了.
又開始郁悶了. 按理說(shuō)不應(yīng)該. 開始分析報(bào)錯(cuò)信息. 從信息上了. 似乎總是讀了第一個(gè)節(jié)點(diǎn), 也就是本身這個(gè)節(jié)點(diǎn).
報(bào)錯(cuò)無(wú)法連接. 然后重復(fù) 7 次, 然后 timeout 退出.
我們集群有 3 個(gè)節(jié)點(diǎn), 不應(yīng)該啊, 第一個(gè)無(wú)法連接, 應(yīng)該會(huì) roundrobin 嘗試后面的節(jié)點(diǎn)連接啊.
但是從日志里, 沒(méi)有體現(xiàn)出來(lái)這個(gè)問(wèn)題.
我突然開始懷疑這部門軟件代碼的設(shè)計(jì)上是不是有問(wèn)題呢?
源代碼就不用看了, 我們可以修改下配嘛.
于是我修改了 wsrep_cluster_address 的配置 把第一個(gè)節(jié)點(diǎn)的 ip 的位置拿到了最后面.
然后重新啟動(dòng)數(shù)據(jù)庫(kù), 奇跡發(fā)生了.
170609 16:57:09 [Note] WSREP: Read nil XID from storage engines, skipping position init
170609 16:57:09 [Note] WSREP: wsrep_load(): loading provider library /usr/lib64/galera-3/libgalera_smm.so
170609 16:57:09 [Note] WSREP: wsrep_load(): Galera 3.20(r7e383f7) by Codership Oy info@codership.com loaded successfully.
170609 16:57:09 [Note] WSREP: CRC-32C: using hardware acceleration.
170609 16:57:09 [Note] WSREP: Found saved state: 51391c6d-4bff-11e7-a1c3-b797743e8629:-1, safe_to_bootsrap: 0
170609 16:57:09 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 192.168.11.152; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 300M; gcache.recover = no; gcache.size = 300M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; pc
170609 16:57:09 [Note] WSREP: GCache history reset: old(51391c6d-4bff-11e7-a1c3-b797743e8629:0) – new(51391c6d-4bff-11e7-a1c3-b797743e8629:824276)
170609 16:57:09 [Note] WSREP: Assign initial position for certification: 824276, protocol version: -1
170609 16:57:09 [Note] WSREP: wsrep_sst_grab()
170609 16:57:09 [Note] WSREP: Start replication
170609 16:57:09 [Note] WSREP: Setting initial position to 51391c6d-4bff-11e7-a1c3-b797743e8629:824276
170609 16:57:09 [Note] WSREP: protonet asio version 0
170609 16:57:09 [Note] WSREP: Using CRC-32C for message checksums.
170609 16:57:09 [Note] WSREP: backend: asio
170609 16:57:09 [Note] WSREP: gcomm thread scheduling priority set to other:0
170609 16:57:09 [Warning] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory)
170609 16:57:09 [Note] WSREP: restore pc from disk failed
170609 16:57:09 [Note] WSREP: GMCast version 0
170609 16:57:09 [Warning] WSREP: Failed to resolve tcp:// 192.168.12.75:4567
170609 16:57:09 [Note] WSREP: (9f2dfc7e, tcp://0.0.0.0:4567) listening at tcp://0.0.0.0:4567
170609 16:57:09 [Note] WSREP: (9f2dfc7e, tcp://0.0.0.0:4567) multicast: , ttl: 1
170609 16:57:09 [Note] WSREP: EVS version 0
170609 16:57:09 [Note] WSREP: gcomm: connecting to group mycluster , peer 192.168.11.98:, 192.168.12.75:,192.168.11.152 :
170609 16:57:09 [Note] WSREP: (9f2dfc7e, tcp://0.0.0.0:4567) connection established to 9f2dfc7e tcp://192.168.11.152:4567
170609 16:57:09 [Warning] WSREP: (9f2dfc7e, tcp://0.0.0.0:4567) address tcp://192.168.11.152:4567 points to own listening address, blacklisting
170609 16:57:09 [Note] WSREP: (9f2dfc7e, tcp://0.0.0.0:4567) connection established to 017c00ff tcp://192.168.11.98:4567
170609 16:57:09 [Note] WSREP: (9f2dfc7e, tcp://0.0.0.0:4567) turning message relay requesting on, nonlive peers: tcp://192.168.12.75:4567
170609 16:57:10 [Note] WSREP: (9f2dfc7e, tcp://0.0.0.0:4567) connection established to 325d47d6 tcp://192.168.12.75:4567
170609 16:57:10 [Note] WSREP: declaring 017c00ff at tcp://192.168.11.98:4567 stable
170609 16:57:10 [Note] WSREP: declaring 325d47d6 at tcp://192.168.12.75:4567 stable
170609 16:57:10 [Note] WSREP: Node 017c00ff state prim
170609 16:57:10 [Note] WSREP: view(view_id(PRIM,017c00ff,13) memb {
017c00ff,0
325d47d6,0
9f2dfc7e,0
} joined {
} left {
} partitioned {
})
170609 16:57:10 [Note] WSREP: save pc into disk
170609 16:57:10 [Note] WSREP: gcomm: connected
170609 16:57:10 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
170609 16:57:10 [Note] WSREP: Shifting CLOSED – OPEN (TO: 0)
170609 16:57:10 [Note] WSREP: Opened channel mycluster
170609 16:57:10 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 2, memb_num = 3
170609 16:57:10 [Note] WSREP: Waiting for SST to complete.
170609 16:57:10 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
170609 16:57:10 [Note] WSREP: STATE EXCHANGE: sent state msg: 9f7acc66-4cf1-11e7-878b-a2d0231f71b0
170609 16:57:10 [Note] WSREP: STATE EXCHANGE: got state msg: 9f7acc66-4cf1-11e7-878b-a2d0231f71b0 from 0 (11_98)
170609 16:57:10 [Note] WSREP: STATE EXCHANGE: got state msg: 9f7acc66-4cf1-11e7-878b-a2d0231f71b0 from 1 (12_75)
170609 16:57:10 [Note] WSREP: STATE EXCHANGE: got state msg: 9f7acc66-4cf1-11e7-878b-a2d0231f71b0 from 2 (11_152)
170609 16:57:10 [Note] WSREP: Quorum results:
version = 4,
component = PRIMARY,
conf_id = 12,
members = 3/3 (joined/total),
act_id = 824276,
last_appl. = -1,
protocols = 0/7/3 (gcs/repl/appl),
group UUID = 51391c6d-4bff-11e7-a1c3-b797743e8629
170609 16:57:10 [Note] WSREP: Flow-control interval: [28, 28]
170609 16:57:10 [Note] WSREP: Restored state OPEN – JOINED (824276)
170609 16:57:10 [Note] WSREP: New cluster view: global state: 51391c6d-4bff-11e7-a1c3-b797743e8629:824276, view# 13: Primary, number of nodes: 3, my index: 2, protocol version 3
170609 16:57:10 [Note] WSREP: SST complete, seqno: 824276
170609 16:57:10 [Note] WSREP: Member 2.0 (11_152) synced with group.
170609 16:57:10 [Note] WSREP: Shifting JOINED – SYNCED (TO: 824276)
170609 16:57:10 [Note] Plugin FEDERATED is disabled.
170609 16:57:10 InnoDB: The InnoDB memory heap is disabled
170609 16:57:10 InnoDB: Mutexes and rw_locks use InnoDB s own implementation
170609 16:57:10 InnoDB: Compressed tables use zlib 1.2.3
170609 16:57:10 InnoDB: Using Linux native AIO
170609 16:57:10 InnoDB: Initializing buffer pool, size = 122.0M
170609 16:57:10 InnoDB: Completed initialization of buffer pool
170609 16:57:10 InnoDB: highest supported file format is Barracuda.
170609 16:57:11 InnoDB: Waiting for the background threads to start
170609 16:57:12 InnoDB: 5.5.54 started; log sequence number 6024720364
170609 16:57:12 [Note] Server hostname (bind-address): 0.0.0.0 port: 3306
170609 16:57:12 [Note] – 0.0.0.0 resolves to 0.0.0.0
170609 16:57:12 [Note] Server socket created on IP: 0.0.0.0 .
170609 16:57:12 [Note] Event Scheduler: Loaded 0 events
170609 16:57:12 [Note] /usr/sbin/mysqld: ready for connections.
Version: 5.5.54 socket: /var/lib/mysql/mysql.sock port: 3306 MySQL Community Server (GPL), wsrep_25.19.20170106.aa7e07d
170609 16:57:12 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
170609 16:57:12 [Note] WSREP: REPL Protocols: 7 (3, 2)
170609 16:57:12 [Note] WSREP: Assign initial position for certification: 824276, protocol version: 3
170609 16:57:12 [Note] WSREP: Service thread queue flushed.
170609 16:57:12 [Note] WSREP: Synchronized with group, ready for connections
170609 16:57:12 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
170609 16:57:13 [Note] WSREP: (9f2dfc7e, tcp://0.0.0.0:4567) connection to peer 9f2dfc7e with addr tcp://192.168.11.152:4567 timed out, no messages seen in PT3S
170609 16:57:13 [Note] WSREP: (9f2dfc7e, tcp://0.0.0.0:4567) turning message relay requesting off
節(jié)點(diǎn) 順利的連接并加入了集群.
然后我又測(cè)試了, 把數(shù)據(jù)文件都清空的情況, 也是順利的加入了集群, 并自動(dòng)完成了數(shù)據(jù)同步,
從另個(gè)一個(gè)幾點(diǎn)的日志可以看到 數(shù)據(jù)同步的情況:
170608 12:05:48 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 3
170608 12:05:48 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
/var/log/mysqld.log 744L, 62162C 112,1 13%
170609 16:42:43 [Note] WSREP: Quorum results:
version = 4,
component = PRIMARY,
conf_id = 4,
members = 2/3 (joined/total),
act_id = 824275,
last_appl. = 824274,
protocols = 0/7/3 (gcs/repl/appl),
group UUID = 51391c6d-4bff-11e7-a1c3-b797743e8629
170609 16:42:43 [Note] WSREP: Flow-control interval: [28, 28]
170609 16:42:43 [Note] WSREP: New cluster view: global state: 51391c6d-4bff-11e7-a1c3-b797743e8629:824275, view# 5: Primary, number of nodes: 3, my index: 0, protocol version 3
170609 16:42:43 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
170609 16:42:43 [Note] WSREP: REPL Protocols: 7 (3, 2)
170609 16:42:43 [Note] WSREP: Assign initial position for certification: 824275, protocol version: 3
170609 16:42:43 [Note] WSREP: Service thread queue flushed.
170609 16:42:43 [Note] WSREP: Member 1.0 (11_152) requested state transfer from *any* . Selected 0.0 (11_98)(SYNCED) as donor.
170609 16:42:43 [Note] WSREP: Shifting SYNCED – DONOR/DESYNCED (TO: 824275)
170609 16:42:43 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
170609 16:42:43 [Note] WSREP: Running: wsrep_sst_rsync –role donor –address 192.168.11.152:4444/rsync_sst –socket /var/lib/mysql/mysql.sock –datadir /var/lib/mysql/ –defaults-file /etc/my.cnf –gtid 51391c6d-4bff-11e7-a1c3-b797743e8629:824275
170609 16:42:43 [Note] WSREP: sst_donor_thread signaled with 0
170609 16:42:43 [Note] WSREP: Flushing tables for SST…
170609 16:42:43 [Note] WSREP: Provider paused at 51391c6d-4bff-11e7-a1c3-b797743e8629:824275 (831018)
170609 16:42:43 [Note] WSREP: Tables flushed.
通過(guò)這一點(diǎn), 也基本上驗(yàn)證了我的猜測(cè).
節(jié)點(diǎn)在退出集群后, 從新加入的時(shí)候, 如果這個(gè)故障節(jié)點(diǎn)的 ip 在自己的配置文件 wsrep_cluster_address 的選項(xiàng)中的第一個(gè) ip .
那么這個(gè)節(jié)點(diǎn)是永遠(yuǎn)都無(wú)法再加入這個(gè)集群了.
怎么辦呢, 把他的 ip 從這個(gè)配置項(xiàng)里面, 換一下位置. 這個(gè)問(wèn)題就完美解決了.
通過(guò)進(jìn)一步的測(cè)試. 如果這個(gè)節(jié)點(diǎn)是 master , 通過(guò) –wsrep-new-cluster 啟動(dòng)的節(jié)點(diǎn), 如果 ip 排在第一位會(huì)有這個(gè)問(wèn)題.
如果這個(gè)節(jié)點(diǎn) 經(jīng)過(guò)上述的步驟能夠重新加入解群了. 那么這個(gè)節(jié)點(diǎn)應(yīng)該就拿不到這個(gè) master 的角色了.
這個(gè)時(shí)候, 就不會(huì)發(fā)生上述的問(wèn)題, 即便 ip 排在第一個(gè)的位置, 也是可以加入集群的.
這個(gè)應(yīng)該是一個(gè) bug 了.
再進(jìn)一步驗(yàn)證后, 可以提交 bug 記錄了.
規(guī)避這個(gè)問(wèn)題的方案就是節(jié)點(diǎn)的機(jī)器上的配置 wsrep-cluster-address 的配置選項(xiàng)里, 本機(jī)的 ip 不要放在第一位.
上述就是丸趣 TV 小編為大家分享的 galera mysql cluster 故障節(jié)點(diǎn)再次接入集群遇到問(wèn)題該怎么辦了,如果剛好有類似的疑惑,不妨參照上述分析進(jìn)行理解。如果想知道更多相關(guān)知識(shí),歡迎關(guān)注丸趣 TV 行業(yè)資訊頻道。