共計 10161 個字符,預計需要花費 26 分鐘才能閱讀完成。
行業資訊
數據庫
關系型數據庫
Oracle 11.2.0.3 數據庫 CJQ 進程造成 row cache lock 等待事件影響 job 無法停止怎么辦
丸趣 TV 小編給大家分享一下 Oracle 11.2.0.3 數據庫 CJQ 進程造成 row cache lock 等待事件影響 job 無法停止怎么辦,希望大家閱讀完這篇文章之后都有所收獲,下面讓我們一起去探討吧!
一, 現象描述
某生產系統應用人員反映自己發起的 job 異常,沒有正常運行,導致相關表無數據, 自己嘗試停止 job 但未成功, 也無法重新發起 job。
二, 問題分析
job 的登錄模式一般為’DBMS_SCHEDULER’, 通過 PL/SQL 工具登錄運行如下語句進行查看
select S.STATUS,
S.INST_ID,
S.SID,
S.PADDR,
S.SQL_ID,
S.SQL_EXEC_START,
S.ACTION,
S.LAST_CALL_ET,
S.BLOCKING_INSTANCE,
S.BLOCKING_SESSION,
S.EVENT,
S.WAIT_CLASS
from v$session S
where module = DBMS_SCHEDULER
查詢到一個被 kill 的 job,狀態為 killed,并未釋放,通過下面的語句查詢到 spid 后,通過登錄后臺 kill -9 的方式殺掉了系統進程。
Select p.spid from v$session s,v$processes p where p.addr=s.paddr and s.sid=508
之后讓應用再次發起 job,但 job 依然無法執行,通過后臺發現該 job 發生了等待事件 row cache lock,等待會話為 1209,P1 值為 11, 通過下面的語句查看無返回結果,不能定位 row cache lock 發生在哪里
Select * from v$rowcache where cache#=11;
查詢 1209 會話發現為系統后臺進程 CJQ0, CJQn 和 Jnnn 均為為 oracle 作業隊列進程,后臺進程為什么會鎖正常的 job 呢?
Select * from v$session where sid=1209;
應用人員通過 PL/SQL 提供的 JOB 任務發現存在如下異常現象,
(上面的截圖缺少了任務 CSH_BRANCH_D_1DAY, 因為截取的是已經處理后的圖像,其實在出現問的時候是有的,只不過忘了截取了)
當前運行很多 job,但沒有都沒有 session_id, 感覺像是僵尸 job,既然無法通過會話殺掉 job,
我們嘗試通過 dbms_scheduler 包進行停止,但無法停止,而且依然出現了 row cache lock 事件,依然被 1209 后臺進程鎖住。
Row cache lock 無法通過 P1 定位到具體的鎖對象,后臺進程為什么會鎖住 job 呢,通過上面的結果初步分析是因為之前有很多僵死的 job 沒有被正常停止,導致再次發起同樣的 job 任務或對該 job 進行處理就會出現鎖的情況,而且是被后臺進程 CJQn 鎖住。
SQL Show parameter processes
通過查看參數 job_queue_processes 值為 1000,aq_tm_processes 為 10000,并未達到峰值
進一步分析
登錄后臺通過 DEBUG 命令收集相關信息分析
用 sysdba 登錄到數據庫上:
$sqlplus / as sysdba
或者
$sqlplus -prelim / as sysdba == 當數據庫已經很慢或者 hang 到無法連接
以下為操作截屏,采用的 hanganalyze 3,systemstate dump 258 級別。
相關說明:
Level 的含義:
1-2: 只有 hanganalyze 輸出,不 dump 任何進程
3:Level2+Dump 出在 IN_HANG 狀態的進程
4:Level3+Dump 出在等待鏈里面的 blockers(狀態為 LEAF/LEAF_NW/IGN_DMP)
5:Level4+Dump 出所有在等待鏈中的進程 (狀態為 NLEAF)
Oracle 官方建議不要超過 level 3,一般 level 3 也能夠解決問題,超過 level 3 會給系統帶來額外負擔。
systemstate dump 有多個級別:
2: dump (不包括 lock element)
10: dump
11: dump + global cache of RAC
256: short stack (函數堆棧)
258: 256+2 — short stack +dump(不包括 lock element)
266: 256+10 — short stack+ dump
267: 256+11 — short stack+ dump + global cache of RAC
level 11 和 267 會 dump global cache, 會生成較大的 trace 文件,一般情況下不推薦。
一般情況下,如果進程不是太多,推薦用 266,因為這樣可以 dump 出來進程的函數堆棧,可以用來分析進程在執行什么操作。
但是生成 short stack 比較耗時,如果進程非常多,比如 2000 個進程,那么可能耗時 30 分鐘以上。這種情況下,可以生成 level 10 或者 level 258,level 258 比 level 10 會多收集 short short stack, 但比 level 10 少收集一些 lock element data.
以下為 hanganalyze 的截取輸出,可以看到主要等待為 row cache lock,session id:7/porcess id:128 被 session id:1209/process id: 49 鎖住,和上面的分析一致。
Chains most likely to have caused the hang:
[a] Chain 1 Signature: rdbms ipc message = row cache lock
Chain 1 Signature Hash: 0xfe4c0898
Chain 1:
——————————————————————————-
Oracle session identified by:
{
instance: 2 (cmbd.cmbd2)
os id: 43516004
process id: 128, oracle@jkptdb2
session id: 7
session serial #: 47563
}
is waiting for row cache lock with wait info:
{
p1: cache id =0xb
p2: mode =0x0
p3: request =0x5
time in wait: 0.077179 sec
heur. time in wait: 16 min 18 sec
timeout after: 2.922821 sec
wait id: 13467
blocking: 0 sessions
wait history:
* time between current wait and wait #1: 0.000045 sec
1. event: row cache lock
time waited: 3.000020 sec
wait id: 13466 p1: cache id =0xb
p2: mode =0x0
p3: request =0x5
* time between wait #1 and #2: 0.000030 sec
2. event: row cache lock
time waited: 3.000016 sec
wait id: 13465 p1: cache id =0xb
p2: mode =0x0
p3: request =0x5
* time between wait #2 and #3: 0.000032 sec
3. event: row cache lock
time waited: 3.000016 sec
wait id: 13464 p1: cache id =0xb
p2: mode =0x0
p3: request =0x5
}
and is blocked by
= Oracle session identified by:
{
instance: 2 (cmbd.cmbd2)
os id: 19792004
process id: 49, oracle@jkptdb2 (CJQ0)
session id: 1209
session serial #: 9
}
which is waiting for rdbms ipc message with wait info:
{
p1: timeout =0x14
time in wait: 0.168064 sec
heur. time in wait: 2.171697 sec
timeout after: 0.031936 sec
wait id: 72756966
blocking: 1 session
wait history:
* time between current wait and wait #1: 0.000350 sec
1. event: rdbms ipc message
time waited: 0.200014 sec
wait id: 72756965 p1: timeout =0x14
* time between wait #1 and #2: 0.000304 sec
2. event: rdbms ipc message
time waited: 0.200016 sec
wait id: 72756964 p1: timeout =0x14
* time between wait #2 and #3: 0.000307 sec
3. event: rdbms ipc message
time waited: 0.200041 sec
wait id: 72756963 p1: timeout =0x14
}
Chain 1 Signature: rdbms ipc message = row cache lock
Chain 1 Signature Hash: 0xfe4c0898
以下為 systemstate dump(部分截取信息)輸出如下:
上鎖進程(session id:129):
PROCESS 49: CJQ0
下面的描述大致為:系統進程 49,也就是 session id 129 鎖住一個會話是 sid:7, 等待事件為 row cache lock,和上面的分析吻合。
There are 1 sessions blocked by this session.
Dumping one waiter:
inst: 2, sid: 7, ser: 47563
wait event: row cache lock
p1: cache id =0xb
p2: mode =0x0
p3: request =0x5
row_wait_obj#: 520, block#: 11728, row#: 0, file# 1
min_blocked_time: 0 secs, waiter_cache_ver: 21656
下面可以看到 CJQ0 進程持有 S 鎖,而對象正是 =CSH_BRANCH_D_1DAY_JOB,也就是我們上面打算停止的 job 名稱。
kssdmh: so 0x7000001b029ced8 (type 75, row cache enqueues) may not have children
—————————————-
SO: 0x7000001b02dd868, type: 75, owner: 0x700000191e0d808, flag: INIT/-/-/0x00 if: 0x1 c: 0x1
proc=0x7000001aa6dc268, name=row cache enqueues, file=kqr.h LINE:2064 ID:, pg=0
row cache enqueue: count=2 session=7000001adb58460 object=7000001810e7238, mode=S
savepoint=0x351e5fa5
row cache parent object: address=7000001810e7238 cid=11(dc_objects)
hash=af8ef272 typ=21 transaction=0 flags=00000002
objectno=352585 ownerid=57 nsp=1
name=CSH_BRANCH_D_1DAY_JOB own=7000001810e7300[7000001b02dd910,7000001b02dd910] wat=7000001810e7310[7000001b2c1bf00,7000001b2c1bf00] mode=S
status=VALID/-/-/-/-/-/-/-/-
request=N release=FALSE flags=8
instance lock=QI af8ef272 41d157e6
set=0, complete=FALSE
data=
00000039 00154353 485f4252 414e4348 5f445f31 4441595f 4a4f4200 00000000
00000000 01000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
被鎖進程(session id:7)
PROCESS 128:
—————————————-
Short stack dump:
ksedsts()+360 -ksdxfstk()+44 -ksdxcb()+3384 -sspuser()+116 -47dc -sskgpwwait()+32 -skgpwwait()+180
-ksliwat()+11032 -kslwaitctx()+180 -kqrigt()+1424 -kqrLockAndPinPo()+532 -kqrpre1()+864 -kqrpre()+28
-jskaObjGet()+1024 -jskaJobRun()+2788 -jsiRunJob()+776 -jsaRunJob()+780 -spefcmpa()+600
-spefmccallstd()+1340 -pextproc()+168 -peftrusted()+164 -psdexsp()+276 -rpiswu2()+480
-psdextp()+632 -pefccal()+588 -pefcal()+260 -pevm_FCAL()+144 -pfrinstr_FCAL()+96
-pfrrun_no_tool()+96 -pfrrun()+1032 -plsql_run()+716 -peicnt()+320 -kkxexe()+508
-opiexe()+17168 -kpoal8()+1856 -opiodr()+720 -ttcpip()+1028 -opitsk()+1508 -opiino()+940
-opiodr()+720 -opidrv()+1132 -sou2o()+136 -opimai_real()+608 -ssthrdmain()+268
-main()+204 -__start()+112
當前正在等待事件 row cache lock
Current Wait Stack:
0: waiting for row cache lock
cache id=0xb, mode=0x0, request=0x5
wait_id=13602 seq_num=13603 snap_id=1
wait times: snap=0.003162 sec, exc=0.003162 sec, total=0.003162 sec
wait times: max=3.000000 sec, heur=23 min 0 sec
wait counts: calls=1 os=1
in_wait=1 iflags=0x5a2
下面顯示了該進程在請求一個 X 鎖,請求的對象是正是 CSH_BRANCH_D_1DAY_JOB,和上面的分析是溫和的,CJQ 進程對 CSH_BRANCH_D_1DAY_JOB 持有一個 S 鎖,而當前的進程需要對 CSH_BRANCH_D_1DAY_JOB 進程操作,需要一個 X 鎖。
SO: 0x7000001b2c1be58, type: 75, owner: 0x70000019c72b078, flag: INIT/-/-/0x00 if: 0x1 c: 0x1
proc=0x7000001ab6cb410, name=row cache enqueues, file=kqr.h LINE:2064 ID:, pg=0
row cache enqueue: count=1 session=7000001ab826ce0 object=7000001810e7238, request=X
savepoint=0x893a
row cache parent object: address=7000001810e7238 cid=11(dc_objects)
hash=af8ef272 typ=21 transaction=0 flags=00000002
objectno=352585 ownerid=57 nsp=1
name=CSH_BRANCH_D_1DAY_JOB own=7000001810e7300[7000001b02dd910,7000001b02dd910] wat=7000001810e7310[7000001b2c1bf00,7000001b2c1bf00] mode=S
status=VALID/-/-/-/-/-/-/-/-
request=N release=FALSE flags=8
instance lock=QI af8ef272 41d157e6
set=0, complete=FALSE
data=
00000039 00154353 485f4252 414e4348 5f445f31 4441595f 4a4f4200 00000000
00000000 01000000 00000000 00000000 00000000 00000000 00000000 00000000
011044c6 00000000 07000001 810e7300
下面記錄了操作語句:
begin sys.dbms_scheduler.run_job(CMBD.CSH_BRANCH_D_1DAY_JOB , false); end;
—————————————-
SO: 0x70000015d5f35c8, type: 78, owner: 0x7000001ab826ce0, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
proc=0x7000001ab6cb410, name=LIBRARY OBJECT LOCK, file=kgl.h LINE:8548 ID:, pg=0
LibraryObjectLock: Address=70000015d5f35c8 Handle=7000000ba27f998 Mode=N CanBeBrokenCount=1 Incarnation=1 ExecutionCount=1
User=7000001ab826ce0 Session=7000001ab826ce0 ReferenceCount=1 Flags=CNB/[0001] SavepointNum=5a52c6da
LibraryHandle: Address=7000000ba27f998 Hash=a53cfcdd LockMode=N PinMode=0 LoadLockMode=0 Status=VALD
ObjectName:Name=begin sys.dbms_scheduler.run_job(CMBD.CSH_BRANCH_D_1DAY_JOB , false); end;
三,問題定位
通過相關關鍵字 row cache lock cjq 在 metlink 搜索查看,最終定位到了一個 BUG16994952 –Unable to unscheduled propagation due to CJQ self-deadlock/hang。通過 BUG 的描述發現和上面的分析吻合。
Bug 16994952 Unable to unschedule propagation due to CJQ self-deadlock / hang
This note gives a brief overview of bug 16994952.
The content was last updated on: 23-AUG-2015
Click here for details of each of the sections below.
Affects:
Product (Component)
Oracle Server (Rdbms)
Range of versions believed to be affected
(Not specified)
Versions confirmed as being affected
· 11.2.0.3
Platforms affected
Generic (all / most platforms affected)
Fixed:
The fix for 16994952 is first included in
· 12.1.0.1 (Base Release)
Interim patches may be available for earlier versions – click here to check.
Symptoms:
Related To:
· Deadlock
· Hang (Process Hang)
· Waits for row cache lock
· Stack is likely to include jskaObjGet
· Advanced Queuing
· Job Queues
Description
Unable to unschedule propagation due to CJQ self-deadlock / hang.
Rediscovery Information:
This bug may be the cause if the CJQ0 process is stuck in a
self-deadlock waiting on row cache lock for a dc_objects
entry in X mode for an object that it already holds in S mode.
The stack is likely to include jskaObjGet
Workaround:
Kill the hung CJQ process
HOOKS WAITEVENT:row cache lock STACKHAS:jskaObjGet WAITEVENT:row cache lock STACKHAS:jskaObjGet LIKELYAFFECTS XAFFECTS_11.2.0.1 XAFFECTS_V11020001 AFFECTS=11.2.0.1 XAFFECTS_11.2.0.2 XAFFECTS_V11020002 AFFECTS=11.2.0.2 XAFFECTS_11.2.0.3 XAFFECTS_V11020003 AFFECTS=11.2.0.3 XAFFECTS_11.2.0.4 XAFFECTS_V11020004 AFFECTS=11.2.0.4 XPRODID_5 PRODUCT_ID=5 PRODID-5 RDBMS XCOMP_RDBMS COMPONENT=RDBMS TAG_AQ TAG_DEADLOCK TAG_HANG TAG_JOBQ AQ DEADLOCK HANG JOBQ FIXED_12.1.0.1 SOLVED_SR=2 SOLVED_SR_BAND_2
Please note: The above is a summary description only. Actual symptoms can vary. Matching to any symptoms here does not confirm that you are encountering this problem. For questions about this bug please consult Oracle Support.
References
Bug:16994952 (This link will only work for PUBLISHED bugs)
Note:245840.1 Information on the sections in this article
四,問題處理
Kill the hung CJQ process,殺掉 CJQ 進程
SQL Select spid from v$session s,v$process p where p.addr=s.paddr and s.sid=129;
再次通過 dbms_scheduler 包進行停止, 可以正常停止,依次停止僵死的 job 后,應用可以再次正常發起 job,恢復正常。
看完了這篇文章,相信你對“Oracle 11.2.0.3 數據庫 CJQ 進程造成 row cache lock 等待事件影響 job 無法停止怎么辦”有了一定的了解,如果想了解更多相關知識,歡迎關注丸趣 TV 行業資訊頻道,感謝各位的閱讀!