Linux中如何實現進程D狀態死鎖檢測

143次閱讀

共計 9361 個字符，預計需要花費 24 分鐘才能閱讀完成。

丸趣 TV 小編給大家分享一下 Linux 中如何實現進程 D 狀態死鎖檢測，相信大部分人都還不怎么了解，因此分享這篇文章給大家參考一下，希望大家閱讀完這篇文章后大有收獲，下面讓我們一起去了解一下吧！

Linux 的進程存在多種狀態，如 TASK_RUNNING 的運行態、EXIT_DEAD 的停止態和 TASK_INTERRUPTIBLE 的接收信號的等待狀態等等 (可在 include/linux/sched.h 中查看)。其中有一種狀態等待為 TASK_UNINTERRUPTIBLE，稱為 D 狀態，該種狀態下進程不接收信號，只能通過 wake_up 喚醒。處于這種狀態的情況有很多，例如 mutex 鎖就可能會設置進程于該狀態，有時候進程在等待某種 IO 資源就緒時(wait_event 機制) 會設置進程進入該狀態。一般情況下，進程處于該狀態的時間不會太久，但若 IO 設備出現故障或者出現進程死鎖等情況，進程就可能長期處于該狀態而無法再返回到 TASK_RUNNING 態。因此，內核為了便于發現這類情況設計出了 hung task 機制專門用于檢測長期處于 D 狀態的進程并發出告警。本文分析內核 hung task 機制的源碼并給出一個示例演示。

一、hung task 機制分析

內核在很早的版本中就已經引入了 hung task 機制，本文以較新的 Linux 4.1.15 版本源碼為例進行分析，代碼量并不多，源代碼文件為 kernel/hung_task.c。

首先給出整體流程框圖和設計思想：

圖 D 狀態死鎖流程圖

其核心思想為創建一個內核監測進程循環監測處于 D 狀態的每一個進程(任務)，統計它們在兩次檢測之間的調度次數，如果發現有任務在兩次監測之間沒有發生任何的調度則可判斷該進程一直處于 D 狀態，很有可能已經死鎖，因此觸發報警日志打印，輸出進程的基本信息，棧回溯以及寄存器保存信息以供內核開發人員定位。

下面詳細分析實現方式：

[cpp] view plain copy  在 CODE 上查看代碼片派生到我的代碼片  static int __init hung_task_init(void) { atomic_notifier_chain_register( panic_notifier_list,  panic_block); watchdog_task = kthread_run(watchdog, NULL,  khungtaskd  return 0; } subsys_initcall(hung_task_init);

首先，若在內核配置中啟用了該機制，在內核的 subsys 初始化階段就會調用 hung_task_init()函數啟用功能，首先向內核的 panic_notifier_list 通知鏈注冊回調：

[cpp] view plain copy  在 CODE 上查看代碼片派生到我的代碼片  static struct notifier_block panic_block = { .notifier_call = hung_task_panic, };

在內核觸發 panic 時就會調用該 hung_task_panic()函數，這個函數的作用稍后再看。繼續往下初始化，調用 kthread_run()函數創建了一個名為 khungtaskd 的線程，執行 watchdog()函數，立即嘗試調度執行。該線程就是專用于檢測 D 狀態死鎖進程的后臺內核線程。

[cpp] view plain copy  在 CODE 上查看代碼片派生到我的代碼片  /* * kthread which checks for tasks stuck in D state */ static int watchdog(void *dummy) { set_user_nice(current, 0); for ( ; ; ) { unsigned long timeout = sysctl_hung_task_timeout_secs; while (schedule_timeout_interruptible(timeout_jiffies(timeout))) timeout = sysctl_hung_task_timeout_secs; if (atomic_xchg( reset_hung_task, 0)) continue; check_hung_uninterruptible_tasks(timeout); } return 0; }

本進程首先設置優先級為 0，即一般優先級，不影響其他進程。然后進入主循環(每隔 timeout 時間執行一次)，首先讓進程睡眠，設置的睡眠時間為

CONFIG_DEFAULT_HUNG_TASK_TIMEOUT，可以通過內核配置選項修改，默認值為 120s，睡眠結束被喚醒后判斷原子變量標識 reset_hung_task，若被置位則跳過本輪監測，同時會清除該標識。該標識通過 reset_hung_task_detector()函數設置(目前內核中尚無其他程序使用該接口)：

[cpp] view plain copy  在 CODE 上查看代碼片派生到我的代碼片  void reset_hung_task_detector(void) { atomic_set( reset_hung_task, 1); } EXPORT_SYMBOL_GPL(reset_hung_task_detector);

接下來循環的 *** 即為監測函數 check_hung_uninterruptible_tasks()，函數入參為監測超時時間。

[cpp] view plain copy  在 CODE 上查看代碼片派生到我的代碼片  /* * Check whether a TASK_UNINTERRUPTIBLE does not get woken up for * a really long time (120 seconds). If that happens, print out * a warning. */ static void check_hung_uninterruptible_tasks(unsigned long timeout) { int max_count = sysctl_hung_task_check_count; int batch_count = HUNG_TASK_BATCHING; struct task_struct *g, *t; /* * If the system crashed already then all bets are off, * do not report extra hung tasks: */ if (test_taint(TAINT_DIE) || did_panic) return; rcu_read_lock(); for_each_process_thread(g, t) { if (!max_count--) goto unlock; if (!--batch_count) { batch_count = HUNG_TASK_BATCHING; if (!rcu_lock_break(g, t)) goto unlock; } /* use  ==  to skip the TASK_KILLABLE tasks waiting on NFS */ if (t- state == TASK_UNINTERRUPTIBLE) check_hung_task(t, timeout); } unlock: rcu_read_unlock(); }

首先檢測內核是否已經 DIE 了或者已經 panic 了，如果是則表明內核已經 crash 了，無需再進行監測了，直接返回即可。注意這里的 did_panic 標識在前文中的 panic 通知鏈回調函數中 hung_task_panic()置位：

[cpp] view plain copy  在 CODE 上查看代碼片派生到我的代碼片  static int hung_task_panic(struct notifier_block *this, unsigned long event, void *ptr) { did_panic = 1; return NOTIFY_DONE; }

接下去若尚無觸發內核 crash，則進入監測流程并逐一檢測內核中的所有進程(任務 task)，該過程在 RCU 加鎖的狀態下進行，因此為了避免在進程較多的情況下加鎖時間過長，這里設置了一個 batch_count，一次最多檢測 HUNG_TASK_BATCHING 個進程。于此同時用戶也可以設定 *** 的檢測個數 max_count=sysctl_hung_task_check_count，默認值為 ***PID 個數 PID_MAX_LIMIT(通過 sysctl 命令設置)。

函數調用 for_each_process_thread()函數輪詢內核中的所有進程 (任務 task)，僅對狀態處于 TASK_UNINTERRUPTIBLE 狀態的進程進行超時判斷，調用 check_hung_task() 函數，入參為 task_struct 結構和超時時間(120s)：

[cpp] view plain copy  在 CODE 上查看代碼片派生到我的代碼片  static void check_hung_task(struct task_struct *t, unsigned long timeout) { unsigned long switch_count = t- nvcsw + t- nivcsw; /* * Ensure the task is not frozen. * Also, skip vfork and any other user process that freezer should skip. */ if (unlikely(t- flags   (PF_FROZEN | PF_FREEZER_SKIP))) return; /* * When a freshly created task is scheduled once, changes its state to * TASK_UNINTERRUPTIBLE without having ever been switched out once, it * musn t be checked. */ if (unlikely(!switch_count)) return; if (switch_count != t- last_switch_count) { t- last_switch_count = switch_count; return; } trace_sched_process_hang(t); if (!sysctl_hung_task_warnings) return; if (sysctl_hung_task_warnings   0) sysctl_hung_task_warnings--;

首先通過 t - nvcsw 和 t - nivcsw 的計數累加表示進程從創建開始至今的調度次數總和，其中 t - nvcsw 表示進程主動放棄 CPU 的次數，t- nivcsw 表示被強制搶占的次數。隨后函數判斷幾個標識：(1)如果進程被 frozen 了那就跳過檢測;(2)調度次數為 0 的不檢測。

接下來判斷從上一次檢測時保存的進程調度次數和本次是否相同，若不相同則表明這輪 timeout(120s)時間內進程發生了調度，則更新該調度值返回，否則則表明該進程已經有 timeout(120s)時間沒有得到調度了，一直處于 D 狀態。接下來的 trace_sched_process_hang()暫不清楚作用，然后判斷 sysctl_hung_task_warnings 標識，它表示需要觸發報警的次數，用戶也可以通過 sysctl 命令配置，默認值為 10，即若當前檢測的進程一直處于 D 狀態，默認情況下此處每 2 分鐘發出一次告警，一共發出 10 次，之后不再發出告警。下面來看告警代碼：

[cpp] view plain copy  在 CODE 上查看代碼片派生到我的代碼片  /* * Ok, the task did not get scheduled for more than 2 minutes, * complain: */ pr_err(INFO: task %s:%d blocked for more than %ld seconds.\n , t- comm, t- pid, timeout); pr_err( %s %s %.*s\n , print_tainted(), init_utsname()- release, (int)strcspn(init_utsname()- version,    ), init_utsname()- version); pr_err(\ echo 0   /proc/sys/kernel/hung_task_timeout_secs\    disables this message.\n  sched_show_task(t); debug_show_held_locks(t); touch_nmi_watchdog();

這里會在控制臺和日志中打印死鎖任務的名稱、PID 號、超時時間、內核 tainted 信息、sysinfo、內核棧 barktrace 以及寄存器信息等。如果開啟了 debug lock 則打印鎖占用的情況，并 touch nmi_watchdog 以防止 nmi_watchdog 超時(對于我的 ARM 環境無需考慮 nmi_watchdog)。

[cpp] view plain copy  在 CODE 上查看代碼片派生到我的代碼片  if (sysctl_hung_task_panic) { trigger_all_cpu_backtrace(); panic(hung_task: blocked tasks  }

*** 如果設置了 sysctl_hung_task_panic 標識則直接觸發 panic(該值可通過內核配置文件配置也可以通過 sysctl 設置)。

二、示例演示

演示環境：樹莓派 b(Linux 4.1.15)

1、首先確認內核配置選項以確認開啟 hung stak 機制

[cpp] view plain copy  在 CODE 上查看代碼片派生到我的代碼片  #include #include #include #include DEFINE_MUTEX(dlock); static int __init dlock_init(void) { mutex_lock( dlock); mutex_lock(dlock); return 0; } static void __exit dlock_exit(void) { return; } module_init(dlock_init); module_exit(dlock_exit); MODULE_LICENSE(GPL

本示例程序定義了一個 mutex 鎖，然后在模塊的 init 函數中重復加鎖，人為造成死鎖現象 (mutex_lock() 函數會調用__mutex_lock_slowpath()將進程設置為 TASK_UNINTERRUPTIBLE 狀態)，進程進入 D 狀態后是無法退出的。可以通過 ps 命令來查看：

root@apple:~# busybox ps PID USER TIME COMMAND ...... 521 root 0:00 insmod dlock.ko ......

然后查看該進程的狀態，可見已經進入了 D 狀態。

root@apple:~# cat /proc/521/status Name: insmod State: D (disk sleep) Tgid: 521 Ngid: 0 Pid: 521

至此在等待兩分鐘后調試串口就會輸出以下信息，可見每兩分鐘就會輸出一次：

[ 360.625466] INFO: task insmod:521 blocked for more than 120 seconds. [ 360.631878] Tainted: G O 4.1.15 #5 [ 360.637042]  echo 0   /proc/sys/kernel/hung_task_timeout_secs  disables this message. [ 360.644986] [] (__schedule) from [] (schedule+0x40/0xa4) [ 360.652129] [] (schedule) from [] (schedule_preempt_disabled+0x18/0x1c) [ 360.660570] [] (schedule_preempt_disabled) from [] (__mutex_lock_slowpath+0x6c/0xe4) [ 360.670142] [] (__mutex_lock_slowpath) from [] (mutex_lock+0x44/0x48) [ 360.678432] [] (mutex_lock) from [] (dlock_init+0x20/0x2c [dlock]) [ 360.686480] [] (dlock_init [dlock]) from [] (do_one_initcall+0x90/0x1e8) [ 360.694976] [] (do_one_initcall) from [] (do_init_module+0x6c/0x1c0) [ 360.703170] [] (do_init_module) from [] (load_module+0x1690/0x1d34) [ 360.711284] [] (load_module) from [] (SyS_init_module+0xdc/0x130) [ 360.719239] [] (SyS_init_module) from [] (ret_fast_syscall+0x0/0x54) [ 480.725351] INFO: task insmod:521 blocked for more than 120 seconds. [ 480.731759] Tainted: G O 4.1.15 #5 [ 480.736917]  echo 0   /proc/sys/kernel/hung_task_timeout_secs  disables this message. [ 480.744842] [] (__schedule) from [] (schedule+0x40/0xa4) [ 480.752029] [] (schedule) from [] (schedule_preempt_disabled+0x18/0x1c) [ 480.760479] [] (schedule_preempt_disabled) from [] (__mutex_lock_slowpath+0x6c/0xe4) [ 480.770066] [] (__mutex_lock_slowpath) from [] (mutex_lock+0x44/0x48) [ 480.778363] [] (mutex_lock) from [] (dlock_init+0x20/0x2c [dlock]) [ 480.786402] [] (dlock_init [dlock]) from [] (do_one_initcall+0x90/0x1e8) [ 480.794897] [] (do_one_initcall) from [] (do_init_module+0x6c/0x1c0) [ 480.803085] [] (do_init_module) from [] (load_module+0x1690/0x1d34) [ 480.811188] [] (load_module) from [] (SyS_init_module+0xdc/0x130) [ 480.819113] [] (SyS_init_module) from [] (ret_fast_syscall+0x0/0x54) [ 600.825353] INFO: task insmod:521 blocked for more than 120 seconds. [ 600.831759] Tainted: G O 4.1.15 #5 [ 600.836916]  echo 0   /proc/sys/kernel/hung_task_timeout_secs  disables this message. [ 600.844865] [] (__schedule) from [] (schedule+0x40/0xa4) [ 600.852005] [] (schedule) from [] (schedule_preempt_disabled+0x18/0x1c) [ 600.860445] [] (schedule_preempt_disabled) from [] (__mutex_lock_slowpath+0x6c/0xe4) [ 600.870014] [] (__mutex_lock_slowpath) from [] (mutex_lock+0x44/0x48) [ 600.878303] [] (mutex_lock) from [] (dlock_init+0x20/0x2c [dlock]) [ 600.886339] [] (dlock_init [dlock]) from [] (do_one_initcall+0x90/0x1e8) [ 600.894835] [] (do_one_initcall) from [] (do_init_module+0x6c/0x1c0) [ 600.903023] [] (do_init_module) from [] (load_module+0x1690/0x1d34) [ 600.911133] [] (load_module) from [] (SyS_init_module+0xdc/0x130) [ 600.919059] [] (SyS_init_module) from [] (ret_fast_syscall+0x0/0x54)

以上是“Linux 中如何實現進程 D 狀態死鎖檢測”這篇文章的所有內容，感謝各位的閱讀！相信大家都有了一定的了解，希望分享的內容對大家有所幫助，如果還想學習更多知識，歡迎關注丸趣 TV 行業資訊頻道！

正文完