一套AIX上的11.1.0.7系统,应用启动时出现大量row cache lock等待,具体的systemstate dump信息如下:
FILE VERSIONS
-----------------------
Oracle Database 11g Enterprise Edition Release 11.1.0.7.0 - 64bit Production
With the Partitioning, Real Application Clusters, Data Mining and Real Application Testing options
ORACLE_HOME = /oracle/product/11.1.0/database
System name: AIX
Node name: HA5_4ADB01
Release: 3
Version: 5
Machine: 000687C2D900
Instance name: HN4A1
TRACE FILE
---------------------------
Filename=HN4A1_ora_2061038.trc
Chains most likely to have caused the hang:
[a] Chain 1 Signature: 'row cache lock'<='row cache lock' (cycle)
Chain 1 Signature Hash: 0x75bdd0c
[b] Chain 2 Signature: 'row cache lock'<='row cache lock' (cycle)
Chain 2 Signature Hash: 0x75bdd0c
[c] Chain 3 Signature: 'row cache lock'<='row cache lock' (cycle)
Chain 3 Signature Hash: 0x75bdd0c
===============================================================================
Cycles:
-------------------------------------------------------------------------------
Chain 1:
-------------------------------------------------------------------------------
Oracle session identified by:
{
instance: 1 (hn4a.hn4a1)
os id: 2114372
process id: 211, oracle@HA5_4ADB01
session id: 610
session serial #: 5
}
is waiting for 'row cache lock' with wait info:
{
p1: 'cache id'=0x7
p2: 'mode'=0x0
p3: 'request'=0x5
time in wait: 0.491964 sec
heur. time in wait: 26.859741 sec
timeout after: 2.508036 sec
wait id: 311
blocking: 1 session
wait history:
1. event: 'row cache lock'
wait id: 310 p1: 'cache id'=0x7
time waited: 2.929713 sec p2: 'mode'=0x0
p3: 'request'=0x5
2. event: 'row cache lock'
wait id: 309 p1: 'cache id'=0x7
time waited: 2.929726 sec p2: 'mode'=0x0
p3: 'request'=0x5
3. event: 'row cache lock'
wait id: 308 p1: 'cache id'=0x7
time waited: 2.929720 sec p2: 'mode'=0x0
p3: 'request'=0x5
}
and is blocked by
=> Oracle session identified by:
{
instance: 1 (hn4a.hn4a1)
os id: 2118476
process id: 209, oracle@HA5_4ADB01
session id: 616
session serial #: 5
}
which is waiting for 'row cache lock' with wait info:
{
p1: 'cache id'=0x7
p2: 'mode'=0x0
p3: 'request'=0x5
time in wait: 0.491910 sec
heur. time in wait: 26.859705 sec
timeout after: 2.508090 sec
wait id: 308
blocking: 1 session
wait history:
1. event: 'row cache lock'
wait id: 307 p1: 'cache id'=0x7
time waited: 2.929713 sec p2: 'mode'=0x0
p3: 'request'=0x5
2. event: 'row cache lock'
wait id: 306 p1: 'cache id'=0x7
time waited: 2.929729 sec p2: 'mode'=0x0
p3: 'request'=0x5
3. event: 'row cache lock'
wait id: 305 p1: 'cache id'=0x7
time waited: 2.929715 sec p2: 'mode'=0x0
p3: 'request'=0x5
}
and is blocked by the session at the start of the chain.
Chain 1 Signature: 'row cache lock'<='row cache lock' (cycle)
Chain 1 Signature Hash: 0x75bdd0c
...==>many similar chains
PROCESS 211:
----------------------------------------
SO: 0x7000008a1370178, type: 2, owner: 0x0, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
proc=0x7000008a1370178, name=process, file=ksu.h LINE:10706 ID:, pg=0
(process) Oracle pid:211, ser:3, calls cur/top: 0x7000008ac61a0e8/0x7000008ac61a0e8
flags : (0x0) -
flags2: (0x0), flags3: (0x0)
int error: 0, call error: 0, sess error: 0, txn error 0
ksudlp FALSE at location: 0
(post info) last post received: 0 0 201
last post received-location: kqr.h LINE:2181 ID:kqrbgl: compatible mode
last process to post me: 70000089134b800 1 6
last post sent: 0 0 0
last post sent-location: No post
last process posted by me: none
(latch info) wait_event=0 bits=0
Process Group: DEFAULT, pseudo proc: 0x7000008a153a4c8
O/S info: user: oracle, term: UNKNOWN, ospid: 2114372
OSD pid info: Unix process pid: 2114372, image: oracle@HA5_4ADB01
Short stack dump:
<-ksedsts()+0254<-ksdxfstk()+0028<-ksdxcb()+05d8<-sspuser()+0074<-44c0<-skgpwwait()+00b8<-ksliwat()+0c48<-kslwaitctx()+0150<-kqrigt()+06ac<-kqrLockAndPinPo()+0250<-kqrpre1()+061c<-kqrpre()+001c<-kziavua()+0384<-kpolnb()+0ffc<-kpoauth()+07a0<-opiodr()+0b98<-ttcpip()+115c<-opitsk()+1700<-opiino()+09f0<-opiodr()+0b98<-opidrv()+0440<-sou2o()+0090<-opimai_real()+01b0<-main()+0090<-__start()+0098
service name: SYS$USERS
client details:
O/S info: user: aiuap, term: unknown, ospid: 1234
machine: HA5-4A26 program: JDBC Thin Client
application name: JDBC Thin Client, hash value=2546894660
...
proc=0x7000008a1370178, name=row cache enqueues, file=kqr.h LINE:2004 ID:, pg=0
row cache enqueue: count=1 session=7000008916dee00 object=7000008dfd84428, request=X
savepoint=0x3e
row cache parent object: address=7000008dfd84428 cid=7(dc_users)
hash=481062cf typ=21 transaction=700000888da17a0 flags=00000002
own=7000008dfd844f0[7000008bf6e18f0,7000008bf6e18f0] wat=7000008dfd84500[7000008bf5d2750,7000008ef34e818] mode=X
status=VALID/-/-/-/-/-/-/-/-
request=N release=FALSE flags=8
instance lock=QK 481062cf bb6f8fe9
...
LIBRARY OBJECT LOCK: 7000008ef5069d8 handle=7000008dfd22338 mod=N
pnc=0 pns=0 cbb=1 rpr=1 exc=1 ilh=0 ctx=0
use=7000008916dee00 ses=7000008a15c44f0 cnt=1 flg=CNB/[0001] spn=0x4c3d09ea
LIBRARY HANDLE:7000008dfd22338 bid=105620 hid=a3c59c94 lmd=N pmd=0 sta=VALD
name=select privilege#,level from sysauth$ connect by grantee#=prior privilege# and privilege#>0 start with grantee#=:1 and privilege#>0
hash=8973f7c629f09e3f081962c2a3c59c94 idn=0
众多wait chain的源头均是row cache lock,systemstate转储中该等待事件相关的p1,也就是cache id均为7;cache#为7的字典缓存代表着dc_users,即字典中的用户信息(包括用户名,密码等信息):
SQL> select cache#,parameter from v$rowcache group by cache#,parameter order by cache#;
CACHE# PARAMETER
---------- --------------------------------
0 dc_tablespaces
1 dc_free_extents
2 dc_segments
3 dc_rollback_segments
4 dc_used_extents
5 dc_tablespace_quotas
6 dc_files
7 dc_users
8 dc_object_grants
8 dc_objects
9 dc_qmc_cache_entries
CACHE# PARAMETER
---------- --------------------------------
10 dc_usernames
11 dc_object_ids
12 dc_constraints
13 dc_sequences
14 dc_profiles
15 dc_database_links
16 dc_histogram_data
16 dc_histogram_defs
17 dc_global_oids
18 dc_outlines
19 dc_partition_scns
CACHE# PARAMETER
---------- --------------------------------
19 dc_table_scns
20 rule_info
21 rule_fast_operators
21 rule_or_piece
22 dc_awr_control
23 dc_qmc_ldap_cache_entries
24 outstanding_alerts
25 dc_hintsets
26 global database name
27 qmtmrcin_cache_entries
28 qmtmrctn_cache_entries
CACHE# PARAMETER
---------- --------------------------------
29 qmtmrcip_cache_entries
30 qmtmrctp_cache_entries
31 qmtmrciq_cache_entries
32 qmtmrctq_cache_entries
33 kqlsubheap_object
34 realm auth
34 realm cache
35 Command rule cache
36 Realm Object cache
36 Realm Subordinate Cache
37 event map
CACHE# PARAMETER
---------- --------------------------------
38 format
39 audit collector
40 Rule Set Cache
47 rows selected.
大量的dc_users字典缓存锁可能引起后续会话登录挂起,用户登录无法成功,乃至整个实例hang住。那么又是什么样的操作可能引起这种row cache lock并发锁等待事件呢?
应用维护人员反馈问题发生时部分应用程序配置文件中的数据库用户密码不正确,同时该应用会在短期内发起大量会话登录数据库,但应用人员观察到这一问题后并没有联想到可能引发进一步的实例上的dc_user row cache lock问题,仅尝试修改密码并重启该部分的应用程序。至此该问题的root cause已经较为明了:由于短期内出现大量连接请求但连接发起端给出的密码均不正确,引起了Server端严重的用户信息字典缓存锁(users dictionary cache lock)争用。
该问题诊断过程中提交了SR,Oracle GCS给出了进一步的结论,在11g release 1上Bug 7715339也可能造成类似的row cache lock问题,且该Bug一开始在AIX平台上没有one-off的patch,之后客户要求MOS开发出了该平台上的补丁,目前AIX上也有7715339的one-off patch了:
1. On 11.1.0.7, Many 'row cache lock' happened , application can not login to database.
2. The issue is caused by password has been changed but some application still used old password, so these application continuously trying connect to database and leads to dc_users contention.
3. This symptom matches Bug 7715339, but there is no patch for AIX available.
4. Customer requested one-off patch 7715339 on AIX Power System(64-bit), because they are afraid the issues again after changing password.
其实我一直认为Oracle的登录机制存在一定的问题,实际上不管用户提供的密码是否正确,Oracle都会为新的connection分配一个shadow process,该服务进程为了进一步验证用户信息都不得不获取少量资源;如果以此为契机,即便在无法正常登录的情况下依然能在短期内造成实例僵死。