怎么實(shí)現(xiàn)CloudStack High Availability源碼分析

152次閱讀

共計(jì) 10347 個(gè)字符，預(yù)計(jì)需要花費(fèi) 26 分鐘才能閱讀完成。

怎么實(shí)現(xiàn) CloudStack High Availability 源碼分析，很多新手對(duì)此不是很清楚，為了幫助大家解決這個(gè)難題，下面丸趣 TV 小編將為大家詳細(xì)講解，有這方面需求的人可以來學(xué)習(xí)下，希望你能有所收獲。

我們先來看 DirectAgentAttache 的內(nèi)部類 PingTask, 首先我們要知道每一個(gè)注冊(cè)到 CS 中的主機(jī)都有一個(gè)對(duì)應(yīng)的 DirectAgentAttache, 這也就意味著每一個(gè) HOST 都有一個(gè) PingTask 線程在后臺(tái)循環(huán)運(yùn)行，時(shí)間間隔是由全局變量 ping.interval 來指定的，默認(rèn)是 60s.

我們來看 PingTask 的代碼

ServerResource resource = _resource;
if (resource != null) { PingCommand cmd = resource.getCurrentStatus(_id);
 int retried = 0;
 while (cmd == null   ++retried  = _HostPingRetryCount.value()) { Thread.sleep(1000*_HostPingRetryTimer.value());
 cmd = resource.getCurrentStatus(_id);
 }
 if (cmd == null) { s_logger.warn( Unable to get current status on

_id 代表 host_id, 當(dāng) getCurrentStatus 能返回正確的 cmd 就說明能夠 Ping 通該 host，那接下來就是執(zhí)行_agentMgr.handleCommands

public void handleCommands(final AgentAttache attache, final long sequence, final Command[] cmds) { for (final Pair Integer, Listener  listener : _cmdMonitors) { final boolean processed = listener.second().processCommands(attache.getId(), sequence, cmds);
 }
}

其中我們關(guān)心 BehindOnPingListener，我們來看它的 processCommands 方法

@Override
public boolean processCommands(final long agentId, final long seq, final Command[] commands) {
 final boolean processed = false;
 for (final Command cmd : commands) { if (cmd instanceof PingCommand) { pingBy(agentId);
 }
 }
 return processed;
}

接下來是 pingBy 方法

public void pingBy(final long agentId) {
 // Update PingMap with the latest time if agent entry exists in the PingMap
 if (_pingMap.replace(agentId, InaccurateClock.getTimeInSeconds()) == null) {
 s_logger.info( PingMap for agent:   + agentId +   will not be updated because agent is no longer in the PingMap 
 }
}

這里重點(diǎn)就是這個(gè)_pingMap，我們看到它其實(shí)是一個(gè) ConcurrentHashMap,key 是 agentId(比如 hostId),value 是一個(gè)時(shí)間戳，就是當(dāng)我們這一次如果 Ping 通之后會(huì)把當(dāng)前時(shí)間作為 value 插入到_pingMap 中。我們回顧一下上面說過 PingTask 是每 ping.interval 時(shí)間間隔執(zhí)行一次，所以如果我們的主機(jī)是在正常運(yùn)行的話那么_pingMap 就會(huì)幾乎每 ping.interval 更新一次。（當(dāng)然執(zhí)行 getCurrentStatus 方法會(huì)有一定的延遲）那如果主機(jī)出現(xiàn)突然的故障導(dǎo)致網(wǎng)絡(luò)無法連接的情況下，那_pingMap 中的時(shí)間就會(huì)一直停留在上一次 Ping 通的那個(gè)時(shí)間戳。

所以我們來總結(jié)一下 PingTask 的邏輯：就是每隔 ping.interval(默認(rèn) 60s) 去 Ping 我們的主機(jī)，如果能夠 Ping 通就更新_pingMap 中的 value 為當(dāng)前時(shí)間戳，否則什么都不做。

接下來我們要看的另一個(gè)后臺(tái)線程是 MonitorTask, 同樣是每隔 ping.interval 執(zhí)行一次，先是方法 findAgentsBehindOnPing

 protected List Long  findAgentsBehindOnPing() {
 final List Long  agentsBehind = new ArrayList Long 
 final long cutoffTime = InaccurateClock.getTimeInSeconds() - getTimeout();
 for (final Map.Entry Long, Long  entry : _pingMap.entrySet()) { if (entry.getValue()   cutoffTime) { agentsBehind.add(entry.getKey());
 }
 }
 return agentsBehind;
 } 
 protected long getTimeout() { return (long) (PingTimeout.value() * PingInterval.value());
 }

全局變量 ping.timeout 默認(rèn)值是 2.5，這段代碼的意思就是找出上一次 Ping 通的時(shí)間距離現(xiàn)在超過 ping.interval 的 2.5 倍的主機(jī)，簡單講就是 Ping 不通或者 Ping 通的延時(shí)超過我們認(rèn)為的不合理時(shí)間的主機(jī)。正常情況下該方法返回的會(huì)是一個(gè)空的 List，這個(gè)時(shí)候 MonitorTask 就結(jié)束當(dāng)前任務(wù)。但是如果出現(xiàn)網(wǎng)絡(luò)延時(shí)或者主機(jī)故障的時(shí)候，就要執(zhí)行接下來的代碼。

final List Long  behindAgents = findAgentsBehindOnPing();
for (final Long agentId : behindAgents) { final QueryBuilder HostVO  sc = QueryBuilder.create(HostVO.class);
 sc.and(sc.entity().getId(), Op.EQ, agentId);
 final HostVO h = sc.find();
 if (h != null) { final ResourceState resourceState = h.getResourceState();
 if (resourceState == ResourceState.Disabled || resourceState == ResourceState.Maintenance || resourceState == ResourceState.ErrorInMaintenance) { disconnectWithoutInvestigation(agentId, Event.ShutdownRequested);
 } else { final HostVO host = _hostDao.findById(agentId);
 if (host != null   (host.getType() == Host.Type.ConsoleProxy || host.getType() == Host.Type.SecondaryStorageVM
 || host.getType() == Host.Type.SecondaryStorageCmdExecutor)) { disconnectWithoutInvestigation(agentId, Event.ShutdownRequested);
 } else { disconnectWithInvestigation(agentId, Event.PingTimeout);
 }
 }
 }
}

我們假設(shè)出問題的是一臺(tái)計(jì)算節(jié)點(diǎn)，那么一路往下將要執(zhí)行的將是 AgentManagerImpl 的 handleDisconnectWithInvestigation 方法

protected boolean handleDisconnectWithInvestigation(final AgentAttache attache, Status.Event event) { final long hostId = attache.getId();
 HostVO host = _hostDao.findById(hostId);
 if (host != null) {
 Status nextStatus = null;
 nextStatus = host.getStatus().getNextStatus(event);
 if (nextStatus == Status.Alert) { Status determinedState = investigate(attache);
 if (determinedState == null) { if ((System.currentTimeMillis()   10) - host.getLastPinged()   AlertWait.value()) {
 determinedState = Status.Alert;
 } else {
 return false;
 }
 }
 final Status currentStatus = host.getStatus();
 if (determinedState == Status.Down) {
 event = Status.Event.HostDown;
 } else if (determinedState == Status.Up) { agentStatusTransitTo(host, Status.Event.Ping, _nodeId);
 return false;
 } else if (determinedState == Status.Disconnected) { if (currentStatus == Status.Disconnected) { if ((System.currentTimeMillis()   10) - host.getLastPinged()   AlertWait.value()) {
 event = Status.Event.WaitedTooLong;
 } else {
 return false;
 }
 } else if (currentStatus == Status.Up) {
 event = Status.Event.AgentDisconnected;
 }
 }
 } 
 }
 handleDisconnectWithoutInvestigation(attache, event, true, true);
 host = _hostDao.findById(hostId); // Maybe the host magically reappeared?
 if (host != null   host.getStatus() == Status.Down) { _haMgr.scheduleRestartForVmsOnHost(host, true);
 }
 return true;
}

我們先看一下該方法最后的那個(gè) if, 就是在特定的條件下我們最終的目的就是重啟該主機(jī)上的所有虛擬機(jī)，這才是 HA 的真正目的。但是我們要記住我們進(jìn)入這個(gè) handleDisconnectWithInvestigation 方法的前提其實(shí)是很簡單的，就是只要我們發(fā)現(xiàn)距離上一次 Ping 通該主機(jī)的時(shí)間超過比如說 2 分半鐘就會(huì)進(jìn)入該方法，而我們要真正執(zhí)行 HA 應(yīng)該是要非常確定該主機(jī)確實(shí)是掛掉了的情況下才發(fā)生的。所以該方法前面一大堆都是在反復(fù)的確認(rèn)主機(jī)的狀態(tài)，就如方法名所示 Inverstigation(調(diào)查)。我們假設(shè)該主機(jī)的 currentStatus 是 UP,event 我們知道是 PingTimeout, 所以 nextStatus 就是 Alert。接下來就是執(zhí)行 investigate 方法

protected Status investigate(final AgentAttache agent) { final Long hostId = agent.getId();
 final HostVO host = _hostDao.findById(hostId);
 if (host != null   host.getType() != null   !host.getType().isVirtual()) { final Answer answer = easySend(hostId, new CheckHealthCommand());
 if (answer != null   answer.getResult()) {
 final Status status = Status.Up;
 return status;
 }
 return _haMgr.investigate(hostId);
 }
 return Status.Alert;
}

該方法先會(huì)向該 hostId 發(fā)送一個(gè) CheckHealthCommand，這個(gè)時(shí)候會(huì)有兩種可能：

1、如果能夠接受到應(yīng)答說明此時(shí)該主機(jī)是正常的就直接返回 UP 狀態(tài)，我們?cè)诨氐?handleDisconnectWithInvestigation 就會(huì)發(fā)現(xiàn)此時(shí)該任務(wù)也就基本結(jié)束了意思就是觸發(fā)該方法的僅僅是臨時(shí)的網(wǎng)絡(luò)不通或者什么情況現(xiàn)在主機(jī)已經(jīng)恢復(fù)正常

2. 那另一種情況就是 CheckHealthCommand 沒有得到應(yīng)答，也就是說我直接從 management-server 去請(qǐng)求你主機(jī)你沒有反應(yīng)，那也不代表你就真的掛了，接下來怎么辦呢，我們?nèi)フ腋鞣N偵探（investigators）去調(diào)查你是否 alive

@Override
public Status investigate(final long hostId) { final HostVO host = _hostDao.findById(hostId);
 if (host == null) {
 return Status.Alert;
 }
 Status hostState = null;
 for (Investigator investigator : investigators) { hostState = investigator.isAgentAlive(host);
 if (hostState != null) {
 return hostState;
 }
 }
 return hostState;
}

那假如我們的主機(jī)是一臺(tái) XenServer 的主機(jī)的話，最重要的當(dāng)然是 XenServerInvestigator，我們來看它的 isAgentAlive 方法

public Status isAgentAlive(Host agent) { CheckOnHostCommand cmd = new CheckOnHostCommand(agent);
 List HostVO  neighbors = _resourceMgr.listAllHostsInCluster(agent.getClusterId());
 for (HostVO neighbor : neighbors) { Answer answer = _agentMgr.easySend(neighbor.getId(), cmd);
 if (answer != null   answer.getResult()) { CheckOnHostAnswer ans = (CheckOnHostAnswer)answer;
 if (!ans.isDetermined()) {
 continue;
 }
 return ans.isAlive() ? null : Status.Down;
 }
 }
 return null;
}

邏輯也很簡單就是我直接找不到你我就去找你同一個(gè) Cluster 中的鄰居，我向你的每一個(gè)鄰居主機(jī)發(fā)送一個(gè) CheckOnHostCommand 命令，看它們能不能知道你到底怎么了。關(guān)于 CheckOnHostCommand 命令的具體實(shí)現(xiàn)在開頭那篇官網(wǎng)的文章里有詳細(xì)的說明

If the network ping investigation returns that it cannot detect the status of the host, CloudStack HA then relies on the hypervisor specific investigation. For VmWare, there is no such investigation as the hypervisor host handles its own HA. For XenServer and KVM, CloudStack HA deploys a monitoring script that writes the current timestamp on to a heartbeat file on shared storage. If the timestamp cannot be written, the hypervisor host self-fences by rebooting itself. For these two hypervisors, CloudStack HA sends a CheckOnHostCommand to a neighboring hypervisor host that shares the same storage. The neighbor then checks on the heartbeat file on shared storage and see if the heartbeat is no longer being written. If the heartbeat is still being written, the host reports that the host in question is still alive. If the heartbeat file’s timestamp is lagging behind, after an acceptable timeout value, the host reports that the host in question is down and HA is started on the VMs on that host.

大致的意思是 CS 會(huì)在每一個(gè) XenServer 和 KVM 的主機(jī)上運(yùn)行一段監(jiān)控腳本，這個(gè)腳本會(huì)將當(dāng)前時(shí)間戳寫入一個(gè)在共享存儲(chǔ)的文件中。如果某一臺(tái)主機(jī)發(fā)現(xiàn)自己無法往文件中寫入數(shù)據(jù)將會(huì)強(qiáng)制自己重啟。那上面那段代碼的邏輯就是向與該被調(diào)查的主機(jī)共享存儲(chǔ)的其他主機(jī)發(fā)送 CheckOnHostCommand 命令，鄰居主機(jī)接受到命令就去查看文件中被調(diào)查主機(jī)有沒有持續(xù)的更新時(shí)間戳，如果有它就返回相應(yīng)說該主機(jī) is still alive，否則就返回說該主機(jī) is down. 這樣只有主機(jī)確實(shí)出了故障無法連接的情況下，handleDisconnectWithInvestigation 方法中的 determinedState 才會(huì)是 Status.Down，那么 event 就變成了 Status.Event.HostDown，接下來就執(zhí)行 HighAvailabilityManagerImpl 的 scheduleRestartForVmsOnHost 方法重起該主機(jī)上的所以虛擬機(jī)，然后是在數(shù)據(jù)庫中出入一個(gè) HaWorkVO，然后喚醒 CS 啟動(dòng)的時(shí)候初始化好的 WorkerThread, 到很重要的同樣是 HighAvailabilityManagerImpl 的 restart 方法

protected Long restart(final HaWorkVO work) {
 boolean isHostRemoved = false;
 Boolean alive = null;
 if (work.getStep() == Step.Investigating) { if (!isHostRemoved) {
 Investigator investigator = null;
 for (Investigator it : investigators) {
 investigator = it;
 try
 {（1） alive = investigator.isVmAlive(vm, host);
 break;
 } catch (UnknownVM e) { s_logger.info(investigator.getName() +   could not find   + vm);
 }
 }
 boolean fenced = false;
 if (alive == null) { for (FenceBuilder fb : fenceBuilders) {（2） Boolean result = fb.fenceOff(vm, host);
 if (result != null   result) {
 fenced = true;
 break;
 }
 }
 }（3） _itMgr.advanceStop(vm.getUuid(), true);
 }
 }
 vm = _itMgr.findById(vm.getId());（4）if (!_forceHA   !vm.isHaEnabled()) {
 return null; // VM doesn t require HA
 }
 try {
 HashMap VirtualMachineProfile.Param, Object  params = new HashMap VirtualMachineProfile.Param, Object（5） if (_haTag != null) { params.put(VirtualMachineProfile.Param.HaTag, _haTag);
 }
 WorkType wt = work.getWorkType();
 if (wt.equals(WorkType.HA)) { params.put(VirtualMachineProfile.Param.HaOperation, true);
 }（6） try{ _itMgr.advanceStart(vm.getUuid(), params, null);
 }catch (InsufficientCapacityException e){
 s_logger.warn( Failed to deploy vm   + vmId +   with original planner, sending HAPlanner 
 _itMgr.advanceStart(vm.getUuid(), params, _haPlanners.get(0));
 }
 }
 return (System.currentTimeMillis()   10) + _restartRetryInterval;
}

如上代碼我所標(biāo)記的有 5 個(gè)重點(diǎn)需要關(guān)注的。大致的流程如下：

（1）調(diào)用各個(gè) investigator.isVmAlive 方法，如果 isAlive 則什么都不做，否則往下走

（2）調(diào)用 fb.fenceOff 方法

（3）執(zhí)行_itMgr.advanceStop 方法

（4）關(guān)于_forceHA 變量，因?yàn)槲以谌肿兞亢蛿?shù)據(jù)庫的 configuration 表中都沒有找到，所以初始化的值為 FALSE，那么也就是說只有 vm.isHaEnabled 為 ture 的 VM 才會(huì)繼續(xù)執(zhí)行下去，否則直接 return 了

（5）_haTag 的值是由全局變量 ha.tag 來指定的，默認(rèn)為空，如果指定了這個(gè)值對(duì)后面確定 VM 分配主機(jī)很重要，記住這行代碼 params.put(VirtualMachineProfile.Param.HaTag, _haTag);

（6）這里有沒有很熟悉，是的，凡是讀過 CS 創(chuàng)建 VM 實(shí)例的過程代碼的人都知道這個(gè)方法就是去分配一個(gè) VM，那么到這里整個(gè) CS 的 HA 執(zhí)行代碼就完成大部分了，接下來就是重啟 VM, 至于該 VM 能否重啟就要依賴各種條件了，比如該 Cluster 中有沒有合適的主機(jī)、主機(jī)的物理資源是否充足、有沒有設(shè)置 ha.tag、VM 有沒有使用標(biāo)簽等等這里就不再詳述了。

看完上述內(nèi)容是否對(duì)您有幫助呢？如果還想對(duì)相關(guān)知識(shí)有進(jìn)一步的了解或閱讀更多相關(guān)文章，請(qǐng)關(guān)注丸趣 TV 行業(yè)資訊頻道，感謝您對(duì)丸趣 TV 的支持。

正文完