如何使用Trace

180次閱讀

共計 5162 個字符，預計需要花費 13 分鐘才能閱讀完成。

這篇文章主要為大家展示了“如何使用 Trace-Event 解決系統不能深度睡眠的問題”，內容簡而易懂，條理清晰，希望能夠幫助大家解決疑惑，下面讓丸趣 TV 小編帶領大家一起研究并學習一下“如何使用 Trace-Event 解決系統不能深度睡眠的問題”這篇文章吧。

最近遇到一個問題，系統不能睡眠到 c7s, 只能睡眠到 c3. (c-state 不能到 c7s, cpu 的 c -state, c0 是運行態，其它狀態都是 idle 態，睡眠的越深，c-state 的值越大)

這時候第一感覺是不是系統很忙導致, 使用 pert top 看一下耗 cpu 的進程和熱點函數：

perf top -E 100 --stdio   perf-top.txt 19.85% perf [.] __symbols__insert 7.68% perf [.] rb_next 4.60% libc-2.26.so [.] __strcmp_sse2_unaligned 4.20% libelf-0.168.so [.] gelf_getsym 3.92% perf [.] dso__load_sym 3.86% libc-2.26.so [.] _int_malloc 3.60% libc-2.26.so [.] __libc_calloc 3.30% libc-2.26.so [.] vfprintf 2.95% perf [.] rb_insert_color 2.61% [kernel] [k] prepare_exit_to_usermode 2.51% perf [.] machine__map_x86_64_entry_trampolines 2.31% perf [.] symbol__new 2.22% [kernel] [k] do_syscall_64 2.11% libc-2.26.so [.] __strlen_avx2

發現系統中只有 perf 工具本身比較耗 cpu :(

然后就想到是不是系統中某個進程搞的鬼，不讓 cpu 睡眠到 c7s. 這時候使用 trace event 監控一下系統中 sched_switch 事件. 使用 trace-cmd 工具監控所有 cpu 上的 sched_switch(進程切換) 事件 30 秒:

#trace-cmd record -e sched:sched_switch -M -1 sleep 30 CPU0 data recorded at offset=0x63e000 102400 bytes in size CPU1 data recorded at offset=0x657000 8192 bytes in size CPU2 data recorded at offset=0x659000 20480 bytes in size CPU3 data recorded at offset=0x65e000 20480 bytes in size

使用 trace-cmd report 查看一下監控結果，但是查看這樣的原始數據不夠直觀，沒有某個進程被切換到的統計信息:

#trace-cmd report cpus=4 trace-cmd-19794 [001] 225127.464466: sched_switch: trace-cmd:19794 [120] S ==  swapper/1:0 [120] trace-cmd-19795 [003] 225127.464601: sched_switch: trace-cmd:19795 [120] S ==  swapper/3:0 [120] sleep-19796 [002] 225127.464792: sched_switch: sleep:19796 [120] S ==  swapper/2:0 [120]  idle -0 [003] 225127.471948: sched_switch: swapper/3:0 [120] R ==  rcu_sched:11 [120] rcu_sched-11 [003] 225127.471950: sched_switch: rcu_sched:11 [120] W ==  swapper/3:0 [120]  idle -0 [003] 225127.479959: sched_switch: swapper/3:0 [120] R ==  rcu_sched:11 [120] rcu_sched-11 [003] 225127.479960: sched_switch: rcu_sched:11 [120] W ==  swapper/3:0 [120]  idle -0 [003] 225127.487959: sched_switch: swapper/3:0 [120] R ==  rcu_sched:11 [120] rcu_sched-11 [003] 225127.487961: sched_switch: rcu_sched:11 [120] W ==  swapper/3:0 [120]  idle -0 [002] 225127.491959: sched_switch: swapper/2:0 [120] R ==  kworker/2:2:19735 [120] kworker/2:2-19735 [002] 225127.491972: sched_switch: kworker/2:2:19735 [120] W ==  swapper/2:0 [120]

trace-cmd report 的結果使用正則表達式過濾一下，然后排序統計：

trace-cmd report | grep -o  ==  [^ ]\+:\?  | sort | uniq -c 3 ==  irqbalance:1034 3 ==  khugepaged:43 20 ==  ksoftirqd/0:10 1 ==  ksoftirqd/1:18 18 ==  ksoftirqd/3:30 1 ==  kthreadd:19798 1 ==  kthreadd:2 4 ==  kworker/0:0:19785 1 ==  kworker/0:1:19736 5 ==  kworker/0:1:19798 5 ==  kworker/0:1H:364 53 ==  kworker/0:2:19614 19 ==  kworker/1:1:7665 30 ==  tuned:19498 ...

發現可疑線程 tuned，30 秒內被切換到運行了 30 次，其它線程都是常規線程。

此時查看一下系統中是否開啟了 tuned 服務:

果真是系統開啟了 tuned 服務，然后拉起了名字為 tuned 的線程.

查看一下 tuned 服務的配置文件：

localhost:/home/jeff # tuned-adm active Current active profile: sap-hana localhost:/home/jeff # cat /usr/lib/tuned/sap-hana/tuned.conf [main] summary=Optimize for SAP NetWeaver, SAP HANA and HANA based products [cpu] force_latency = 70

發現關于 cpu 這一項，設置強制延遲時間為 70 秒 force_latency = 70 , 這個是為了優化 HANA 數據庫。

到底 force_latency 怎樣起作用，經過一頓搜索，發現這個值是被設置進了 /dev/cpu_dma_latency

使用 lsof /dev/cpu_dma_latency, 發現 tuned 線程確實是在操作這個文件

#lsof /dev/cpu_dma_latency COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME tuned 18734 root 9w CHR 10,60 0t0 11400 /dev/cpu_dma_latency

而且 Linux 內核文檔也說明了 /dev/cpu_dma_latency 文件, 如果要對它進行寫操作，要 open 之后寫數據之后不 close, 如果釋放掉了文件描述符它就又會恢復到默認值，這也印證了上面 lsof /dev/cpu_dma_latency 是有輸出結果的.

https://github.com/torvalds/linux/blob/v5.8/Documentation/trace/coresight/coresight-cpu-debug.rst As specified in the PM QoS documentation the requested parameter will stay in effect until the file descriptor is released. For example: # exec 3  /dev/cpu_dma_latency; echo 0  3 ... Do some work... ... # exec 3 -

查看一下 /dev/cpu_dma_latency 文件的內容，確實是 70，也就是 (force_latency = 70)

localhost:/home/jeff # cat /dev/cpu_dma_latency | hexdump -Cv 00000000 46 00 00 00 |F...| localhost:/home/jeff # echo $((0x46)) 70

此時查看一下系統中 cpu 各個睡眠態的描述和延遲時間值:

# cd /sys/devices/system/cpu/cpu0/cpuidle/ # for state in * ; do echo -e \  STATE: $state\t\ DESC: $(cat $state/desc)\t\ NAME: $(cat $state/name)\t\ LATENCY: $(cat $state/latency)\t\ RESIDENCY: $(cat $state/residency)  done

發現 C3 態的延遲時間是 33 微秒，C4 的延時時間是 133 微秒，所以 (force_latency = 70)，

系統就只能睡眠到 C3 了 .(延遲時間就是從此睡眠態喚醒到運行態的時間)

STATE: state0 DESC: CPUIDLE CORE POLL IDLE NAME: POLL LATENCY: 0 RESIDENCY: 0 STATE: state1 DESC: MWAIT 0x00 NAME: C1 LATENCY: 2 RESIDENCY: 2 STATE: state2 DESC: MWAIT 0x01 NAME: C1E LATENCY: 10 RESIDENCY: 20 STATE: state3 DESC: MWAIT 0x10 NAME: C3 LATENCY: 33 RESIDENCY: 100 STATE: state4 DESC: MWAIT 0x20 NAME: C6 LATENCY: 133 RESIDENCY: 400 STATE: state5 DESC: MWAIT 0x32 NAME: C7s LATENCY: 166 RESIDENCY: 500

此時關閉 tuned 服務, 再查看一下 /dev/cpu_dma_latency 的值，變成了默認的 2000 秒

localhost:/home/jeff # tuned-adm off localhost:/home/jeff # cat /dev/cpu_dma_latency | hexdump -Cv 00000000 00 94 35 77 |..5w| localhost:/home/jeff # echo $((0x77359400)) 2000000000

然后驗證一下，此時系統可以睡眠到 C7s 了，此問題得到解決 :)

解決此問題，主要用到了 Linux 內核本身提供的 trace-event.

所以任何一個功能都不能小看，內核就是這樣，一般看上去很無聊的功能，被一些工程師用很認真的態度打磨出來之后，潛力還是非常大的

以上是“如何使用 Trace-Event 解決系統不能深度睡眠的問題”這篇文章的所有內容，感謝各位的閱讀！相信大家都有了一定的了解，希望分享的內容對大家有所幫助，如果還想學習更多知識，歡迎關注丸趣 TV 行業資訊頻道！

正文完