Kubernetes的PDB怎么應用

164次閱讀

共計 13412 個字符，預計需要花費 34 分鐘才能閱讀完成。

這篇文章主要介紹“Kubernetes 的 PDB 怎么應用”，在日常操作中，相信很多人在 Kubernetes 的 PDB 怎么應用問題上存在疑惑，丸趣 TV 小編查閱了各式資料，整理出簡單好用的操作方法，希望對大家解答”Kubernetes 的 PDB 怎么應用”的疑惑有所幫助！接下來，請跟著丸趣 TV 小編一起來學習吧！

PDB 的應用場景

大概在 Kubernetes 1.4 新增了 PodDisruptionBudget Object（后面簡稱 PDB），在 1.5 的時候升級到 Beta，但是直到 1.9 Released 還是 Beta。不過沒關系，我們拋開這些，先來想想 PDB 是為了解決什么問題的。PDB Feature 已經一年多了，以前沒有研究過它，主要是沒場景。最近在做基于 Kubernetes 的 ElasticSearch as a Service(簡稱 ESaaS) 項目方案，要盡量保證任何 ElasticSearch Cluster 中始終至少要有一個健康可用的 ES client pod, ES master pod 和 ES data pod。很多同學都學想到 Deployment 中可以設置 maxUnavailable，那不就行了嗎？再說了，還會有 RS Controller 在做副本控制呢？

等下！Deployment 中的 maxUnavailable 是什么時候用的？—— 是用來對使用 Deployment 部署的應用進行滾動更新時保障最少可服務副本數的！RS Controller 呢？—— 那只是副本控制器之一，它并不能給你保證集群中始終有幾個副本的，它是負責盡快的讓實際副本數跟你的期望副本數相同的，它才不管中間某些時刻的實際副本數呢。這個時候，你就可以考慮使用 Kubernetes PDB 了，它是用來保證應用的高可用的，對那些 Voluntary（自愿的）Disruption 做好 Budgets(預算方案)。

前面提到了 Voluntary Disruption，我們來捋一下，什么是 Voluntary Disruption？什么又是 Involuntary Disruption？

Involuntary Disruption 及其應對措施

Involuntary Disruption 指的是那些不可控的（或者目前來說難于控制的）外界因素導致的 Disruption，比如：

服務器的硬件故障或者內核崩潰導致節點 Down 了。

如果容器部署在 VM，VM 被誤刪了或者 Hyperwisor 出問題了。

集群出現了網絡腦裂。（Kubernetes 通過 NodeController 來處理網絡腦裂情況，但是 evict pods 時仍然沒有考慮到保證應用的高可用）關于 NodeController 深度解析，請參考我的下面博文：

Kubernetes Node Controller 源碼分析之執行篇

Kubernetes Node Controller 源碼分析之創建篇

Kubernetes Node Controller 源碼分析之配置篇

Kubernetes Node Controller 源碼分析之 Taint Controller

某個節點因為不合理的超配導致出現計算資源不足時，觸發 kubelet eviction 時也沒有考慮到保證應用的高可用。關于 kubelet eviction 深度解析，請參考我的下面博文：

Kubernetes Eviction Manager 源碼分析

Kubernetes Eviction Manager 工作機制分析

PDB 不是解決 Involuntary Disruption 的，我們如何在使用 Kubernetes 時盡量減輕或者緩解 Involuntary Disruption 對應用高可用的影響呢？

一個應用盡量使用 Deployment,RS,StatefulSet 等副本控制器部署，并且 replicas 大于 1。

設置應用 container 的 request 值，使得即使在資源非常緊張的情況下，也能有足夠的資源供它使用。

另外，盡量考慮物理設備上的 HA，比如一個應用的不同副本要跨服務器部署，跨機柜跨機架部署，跨交換機部署等。

PDB 是為了 Voluntary Disruption 時保障應用的高可用

Involuntary Disruption 對立的場景，自然就是 Voluntary Disruption 了，指的是用戶或者集群管理員觸發的，Kubernetes 可控的 Disruption 場景，比如：

刪除那些管理 Pods 的控制器，比如 Deployment，RS，RC，StatefulSet。

觸發應用的滾動更新。

直接批量刪除 Pods。

kubectl drain 一個節點（節點下線、集群縮容）

PDB 就是針對 Voluntary Disruption 場景設計的，屬于 Kubernetes 可控的范疇之一，而不是為 Involuntary Disruption 設計的。

Kube-Node 項目上線后，可以支持對接 Openstack，AWS，GCE 等 cloud provider 實現 Node 的自動管理，因此可能會經常有 HNA(Horizontal Node Autoscaleer) 事件, 工作流就有類似 drain a node 的邏輯，因此需要使用 PDB 來保障應用的 HA。

PDB 的使用方法及注意事項使用說明及注意點

部署在 Kubernetes 的每個 App 都可以創建一個對應 PDB Object，用來限制 Voluntary Disruptions 時最大可以 down 的副本數或者最少應該保持 Available 的副本數，以此來保證應用的高可用。

PDB 可以用來保護由 Kubernetes 內置控制器管理的應用，這種情況下要求 DPB selector 等同于這些 Controller Object 的 Selector：

Deployment

ReplicationController

ReplicaSet

StatefulSet

也可以用來保護那些僅僅由 PDB Selector 自己選擇的 Pods Set，但是有兩個使用限制：

只能配置.spec.minAvailable, 不能使用 maxUnavailable;

.spec.minAvailable 只能為整型值，不能是百分比。

因此，不管怎么說，PDB 影響的 Pods Set 都是通過自己的 Selector 來選擇的，使用時要注意同一個 namespace 下不同的 PDB Object 不要使用有重疊的 Selectors。

在使用 PDB 時，你需要弄清楚你的應用類型以及你想要的應對措施：

無狀態應用：比如想至少有 60% 的副本 Available。

解決辦法：創建 PDB Object，指定 minAvailable 為 60%，或者 maxUnavailable 為 40%。

單實例的有狀態應用：終止這個實例之前必須提前通知客戶并取得同意。

解決辦法：創建 PDB Object，并設置 maxUnavailable 為 0，這樣 Kubernetes 就會阻止這個實例的刪除，然后去通知并征求用戶同意后，再把這個 PDB 刪除從而解除這個阻止，然后再去 recreate。單實例的 statefulset 的滾動更新一定會有服務停止時間，因此建議生產環境不要創建單實例的 StatefulSet。

多實例的有狀態應用：最少可用的實例數不能少于某個數 N（比如受限于 raft 協議類應用的選舉機制）

解決辦法：設置 maxUnavailable= 1 或者 minAvailable=N, 分別允許每次只刪除一個實例和每次刪除 expected_replicas – minAvailable 個實例。

批處理 Job：Job 需要最終有一個 Pod 成功完成任務。

Job Controller 有自己的機制保證這個，不需要創建 PDB。

關于 Job Controller 深入解讀，請參考我的博文：Kubernetes Job Controller 源碼分析

定義 PDB Object

進行了以上思考后，確定了要創建 PDB，接下來就看看 PodDisruptionBudget 怎么定義的，下面是個 Sample：

apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
 name: zk-pdb
spec:
 minAvailable: 2
 selector:
 matchLabels:
 app: zookeeper

PDB 的定義，其實就三項關鍵內容：

.spec.selector 用來選擇后端 Pods Set，最佳實踐是與應用對應的 Deployment,StatefulSet 的 Selector 一致；

.spec.minAvailable 表示發生 voluntary disruptions 的過程中，要保證至少可用的 Pods 數或者比例；

.spec.maxUnavailable 表示發生 voluntary disruptions 的過程中，要保證最大不可用的 Pods 數或者比例，要求 Kubernetes version = 1.7；這個配置只能用來對應 Deployment，RS，RC，StatefulSet 的 Pods，推薦優先使用.spec.maxUnavailable。

注意:

同一個 PDB Object 中不能同時定義.spec.minAvailable 和.spec.maxUnavailable。

前面提到，應用滾動更新時 Pod 的 delete 和 unavailable 雖然也屬于 voluntary disruption，但是實際上滾動更新有自己的策略控制（marSurge 和 maxUnavailable），因此 PDB 不會干預這個過程。

PDB 只能保證 voluntary disruptions 時的副本數，比如 evict pod 過程中剛好滿足.spec.minAvailable 或.spec.maxUnavailable，這時某個本來正常的 Pod 突然因為 Node Down(Involuntary Disruption) 了掛了，那么這個時候實際 Pods 數就比 PDB 中要求的少了，因此 PDB 不是萬能的！

使用上，如果設置.spec.minAvailable 為 100% 或者.spec.maxUnavailable 為 0%，意味著會完全阻止 evict pods 的過程（Deployment 和 StatefulSet 的滾動更新除外）。

創建 PDB Object

kubectl apply -f zk-pdb.yaml 創建該 PDB Object；

$ kubectl get poddisruptionbudgets
NAME MIN-AVAILABLE ALLOWED-DISRUPTIONS AGE
zk-pdb 2 1 7s

kubect get pdb zk-pdb -o yaml 查看：

$ kubectl get poddisruptionbudgets zk-pdb -o yaml
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
 creationTimestamp: 2017-08-28T02:38:26Z
 generation: 1
 name: zk-pdb
status:
 currentHealthy: 3
 desiredHealthy: 3
 disruptedPods: null
 disruptionsAllowed: 1
 expectedPods: 3
 observedGeneration: 1

PDB 的工作原理及源碼分析

PDB Object 定義是遇到 voluntary disruption 時用戶的期望狀態，真正去維護這個期望狀態的也是一個由 kube-controller-manager 管理的 Controller，那便是 Disruption Controller。

Disruption Controller 主要 watch Pods 和 PDBs，當監聽到 pod/pdb 的 Add/Del/Update 事件后，并會將對應的 pdb object 放到 rate limit queue 中等待 worker 處理，worker 的主要邏輯就是計算 PodDisruptionBudgetStatus 的 currentHealthy, desiredHealthy, expectedCount, disruptedPods, 然后調用 api 更新 PDB Status。

pkg/controller/disruption/disruption.go:498
func (dc *DisruptionController) trySync(pdb *policy.PodDisruptionBudget) error {pods, err := dc.getPodsForPdb(pdb)
 if err != nil {dc.recorder.Eventf(pdb, v1.EventTypeWarning,  NoPods ,  Failed to get pods: %v , err)
 return err
 if len(pods) == 0 {dc.recorder.Eventf(pdb, v1.EventTypeNormal,  NoPods ,  No matching pods found)
 expectedCount, desiredHealthy, err := dc.getExpectedPodCount(pdb, pods)
 if err != nil {dc.recorder.Eventf(pdb, v1.EventTypeWarning,  CalculateExpectedPodCountFailed ,  Failed to calculate the number of expected pods: %v , err)
 return err
 currentTime := time.Now()
 disruptedPods, recheckTime := dc.buildDisruptedPodMap(pods, pdb, currentTime)
 currentHealthy := countHealthyPods(pods, disruptedPods, currentTime)
 err = dc.updatePdbStatus(pdb, currentHealthy, desiredHealthy, expectedCount, disruptedPods)
 if err == nil   recheckTime != nil {
 // There is always at most one PDB waiting with a particular name in the queue,
 // and each PDB in the queue is associated with the lowest timestamp
 // that was supplied when a PDB with that name was added.
 dc.enqueuePdbForRecheck(pdb, recheckTime.Sub(currentTime))
 return err
}

下面是 PodDisruptionBudgetStatus 的定義：

pkg/apis/policy/types.go:48
type PodDisruptionBudgetStatus struct {
 // Most recent generation observed when updating this PDB status. PodDisruptionsAllowed and other
 // status informatio is valid only if observedGeneration equals to PDB s object generation.
 // +optional
 ObservedGeneration int64 `json: observedGeneration,omitempty  protobuf: varint,1,opt,name=observedGeneration `
 // DisruptedPods contains information about pods whose eviction was
 // processed by the API server eviction subresource handler but has not
 // yet been observed by the PodDisruptionBudget controller.
 // A pod will be in this map from the time when the API server processed the
 // eviction request to the time when the pod is seen by PDB controller
 // as having been marked for deletion (or after a timeout). The key in the map is the name of the pod
 // and the value is the time when the API server processed the eviction request. If
 // the deletion didn t occur and a pod is still there it will be removed from
 // the list automatically by PodDisruptionBudget controller after some time.
 // If everything goes smooth this map should be empty for the most of the time.
 // Large number of entries in the map may indicate problems with pod deletions.
 DisruptedPods map[string]metav1.Time `json: disruptedPods  protobuf: bytes,2,rep,name=disruptedPods `
 // Number of pod disruptions that are currently allowed.
 PodDisruptionsAllowed int32 `json: disruptionsAllowed  protobuf: varint,3,opt,name=disruptionsAllowed `
 // current number of healthy pods
 CurrentHealthy int32 `json: currentHealthy  protobuf: varint,4,opt,name=currentHealthy `
 // minimum desired number of healthy pods
 DesiredHealthy int32 `json: desiredHealthy  protobuf: varint,5,opt,name=desiredHealthy `
 // total number of pods counted by this disruption budget
 ExpectedPods int32 `json: expectedPods  protobuf: varint,6,opt,name=expectedPods `
}

PodDisruptionBudgetStatus 最重要的元素就是 **DisruptedPods 和 PodDisruptionsAllowed**：

DisruptedPods：用來保存那些已經通過 apiserver pod eviction subresource 處理的 pods，但是還沒被 PDB Controller 發現處理的 Pods，是 Map 類型，key 為 Pod Name，value 是 apiserver 接受 eviction subresource 請求的時間。加入里面的 Pod 有 2min 的超時時間，如果 2min 后 Pod 仍然沒有被刪除，則會將該 Pod 從隊列中剔除。

PodDisruptionsAllowed：表示當前允許 Disruption 的 Pods 數。

Disruption Controller 的主要邏輯就是更新 PDB.Status，那么問題來了，到底是誰去控制 voluntary distribution 時 eviction 的 maxUnavailable 或者 minAvailable 的呢？

要再次提醒的是，PDB Controller 只處理那些通過 pod eviction subresource 請求對應的 pods，因此上面的這個問題就要到對應的 Pod 的 evictionRest 中去找了。

pkg/registry/core/pod/storage/eviction.go:81
// Create attempts to create a new eviction. That is, it tries to evict a pod.
func (r *EvictionREST) Create(ctx genericapirequest.Context, obj runtime.Object, createValidation rest.ValidateObjectFunc, includeUninitialized bool) (runtime.Object, error) {eviction := obj.(*policy.Eviction)
 obj, err := r.store.Get(ctx, eviction.Name,  metav1.GetOptions{})
 if err != nil {
 return nil, err
 pod := obj.(*api.Pod)
 var rtStatus *metav1.Status
 var pdbName string
 err = retry.RetryOnConflict(EvictionsRetry, func() error {pdbs, err := r.getPodDisruptionBudgets(ctx, pod)
 if err != nil {
 return err
 if len(pdbs)   1 {
 rtStatus =  metav1.Status{
 Status: metav1.StatusFailure,
 Message:  This pod has more than one PodDisruptionBudget, which the eviction subresource does not support. ,
 Code: 500,
 return nil
 } else if len(pdbs) == 1 {pdb := pdbs[0]
 pdbName = pdb.Name
 // Try to verify-and-decrement
 // If it was false already, or if it becomes false during the course of our retries,
 // raise an error marked as a 429.
 if err := r.checkAndDecrement(pod.Namespace, pod.Name, pdb); err != nil {
 return err
 return nil
 if err == wait.ErrWaitTimeout {err = errors.NewTimeoutError(fmt.Sprintf( couldn t update PodDisruptionBudget %q due to conflicts , pdbName), 10)
 if err != nil {
 return nil, err
 if rtStatus != nil {
 return rtStatus, nil
 // At this point there was either no PDB or we succeded in decrementing
 // Try the delete
 _, _, err = r.store.Delete(ctx, eviction.Name, eviction.DeleteOptions)
 if err != nil {
 return nil, err
 // Success!
 return  metav1.Status{Status: metav1.StatusSuccess}, nil
}

通過 EvictionREST 去請求 evict pod 的時候，會檢查 pod 只有一個對應的 pdb，否則報錯。關于 Eviction API 的使用，請參考 The Eviction API, 下面只給出簡單的 Sample：

{
  apiVersion :  policy/v1beta1 ,
  kind :  Eviction ,
  metadata : {
  name :  quux ,
  namespace :  default 
 }
$ curl -v -H  Content-type: application/json  http://127.0.0.1:8080/api/v1/namespaces/default/pods/quux/eviction -d @eviction.json

然后通過 checkAndDecrement 去檢查是否滿足 PDB 的 manUnavailable 或者 minAvailable，如果滿足的話對 pdb.Status.PodDisruptionsAllowed 減 1 處理。

checkAndDecrement 成功的話，就真正去 delete 對應的 Pod。

// checkAndDecrement checks if the provided PodDisruptionBudget allows any disruption.
func (r *EvictionREST) checkAndDecrement(namespace string, podName string, pdb policy.PodDisruptionBudget) error {
 if pdb.Status.ObservedGeneration   pdb.Generation {// TODO(mml): Add a Retry-After header. Once there are time-based
 // budgets, we can sometimes compute a sensible suggested value. But
 // even without that, we can give a suggestion (10 minutes?) that
 // prevents well-behaved clients from hammering us.
 err := errors.NewTooManyRequests(Cannot evict pod as it would violate the pod s disruption budget. , 0)
 err.ErrStatus.Details.Causes = append(err.ErrStatus.Details.Causes, metav1.StatusCause{Type:  DisruptionBudget , Message: fmt.Sprintf( The disruption budget %s is still being processed by the server. , pdb.Name)})
 return err
 if pdb.Status.PodDisruptionsAllowed   0 {return errors.NewForbidden(policy.Resource( poddisruptionbudget), pdb.Name, fmt.Errorf(pdb disruptions allowed is negative))
 if len(pdb.Status.DisruptedPods)   MaxDisruptedPodSize {return errors.NewForbidden(policy.Resource( poddisruptionbudget), pdb.Name, fmt.Errorf(DisruptedPods map too big - too many evictions not confirmed by PDB controller))
 if pdb.Status.PodDisruptionsAllowed == 0 {err := errors.NewTooManyRequests( Cannot evict pod as it would violate the pod s disruption budget. , 0)
 err.ErrStatus.Details.Causes = append(err.ErrStatus.Details.Causes, metav1.StatusCause{Type:  DisruptionBudget , Message: fmt.Sprintf( The disruption budget %s needs %d healthy pods and has %d currently , pdb.Name, pdb.Status.DesiredHealthy, pdb.Status.CurrentHealthy)})
 return err
 pdb.Status.PodDisruptionsAllowed--
 if pdb.Status.DisruptedPods == nil {pdb.Status.DisruptedPods = make(map[string]metav1.Time)
 // Eviction handler needs to inform the PDB controller that it is about to delete a pod
 // so it should not consider it as available in calculations when updating PodDisruptions allowed.
 // If the pod is not deleted within a reasonable time limit PDB controller will assume that it won t
 // be deleted at all and remove it from DisruptedPod map.
 pdb.Status.DisruptedPods[podName] = metav1.Time{Time: time.Now()}
 if _, err := r.podDisruptionBudgetClient.PodDisruptionBudgets(namespace).UpdateStatus(pdb); err != nil {
 return err
 return nil
}

checkAndDecrement 主要檢查 pdb.Status.PodDisruptionsAllowed 是否大于 0，并且 DisruptedPods 包含的 Pods 數不能超過 2000（Disruption Controller 性能可能不足以支撐這么多）。

檢查通過，就對 pdb.Status.PodDisruptionsAllowed 減 1，然后將該 Pod 加到 DisruptedPods 這個 Map 中，map 的 value 就是當前時間（apiserver 接受該 eviction request 的時間）。

更新 PDB，PDB Controller 因為監聽了 PDB 的 Update Event，接著就會觸發 PDB Controller 的邏輯，再次去維護 PDB Status。

Note：PDB 在 scheduler 中也有用到。基于 Pod Priority 進行搶占式調度時，generic_scheduler 進行 preempte pod 時會對 Node 上所有 Pod 進行 PDB 驗證，統計違背 PDB 的 Pods 數量，Select Node 時盡量選擇違背 PDB Pods 數更少的 node。

到此，關于“Kubernetes 的 PDB 怎么應用”的學習就結束了，希望能夠解決大家的疑惑。理論與實踐的搭配能更好的幫助大家學習，快去試試吧！若想繼續學習更多相關知識，請繼續關注丸趣 TV 網站，丸趣 TV 小編會繼續努力為大家帶來更多實用的文章！

正文完