Recently i find many session are waiting for Event <cursor pin s on x> in my customer's AIX 10.2.0.4 instance, it's a new event in the 10gr2,after mutex take place of latch .
When a session is going to parse or reparse one cursor, need to pin X this cursor first. Then the others who what to execute this child cursor have to wait until pin X released.
I looked for the session who was pining that cursor on X mode ,and saw it was waiting on event single-task messag:
the mutex idn is d322e445,and operated in EXCL mode; It correspond to the cursor whose hash is "ff2c2fd459ac5ee188586781d322e445" as after:
PROCESS 1753:
----------------------------------------
SO: 70000048e9d5e68, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=1753, calls cur/top: 70000043308cb08/70000043798a6e0, flag: (0) -
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 109 0 4
last post received-location: kslpsr
last process to post me: 70000048f8deaf0 1 6
last post sent: 0 0 24
last post sent-location: ksasnd
last process posted by me: 70000048f8deaf0 1 6
(latch info) wait_event=0 bits=0
Process Group: DEFAULT, pseudo proc: 70000048ca877c0
O/S info: user: orauser, term: UNKNOWN, ospid: 3318124
OSD pid info: Unix process pid: 3318124, image: oracle@p595crm1
.........................
----------------------------------------
SO: 70000048eed0d30, type: 4, owner: 70000048e9d5e68, flag: INIT/-/-/0x00
(session) sid: 3311 trans: 70000046be86948, creator: 70000048e9d5e68, flag: (100041) USR/- BSY/-/-/-/-/-
DID: 0001-06D9-008347C7, short-term DID: 0001-06D9-008347C8
txn branch: 70000046a261488
oct: 3, prv: 0, sql: 700000414e4f570, psql: 700000415f47a90, user: 50/SHUCRM1C
service name: CRMDB1
O/S info: user: pausr12, term: unknown, ospid: 1234, machine: p570web2
program: JDBC Thin Client
application name: JDBC Thin Client, hash value=2546894660
waiting for 'single-task message' blocking sess=0x0 seq=22963 wait_time=0 seconds since wait started=6
=0, =0, =0
Dumping Session Wait History
for 'SQL*Net message from dblink' count=1 wait_time=325
driver id=54435000, #bytes=1, =0
for 'SQL*Net message to dblink' count=1 wait_time=2
driver id=54435000, #bytes=1, =0
for 'SQL*Net message from dblink' count=1 wait_time=383
driver id=54435000, #bytes=1, =0
for 'SQL*Net message to dblink' count=1 wait_time=1
driver id=54435000, #bytes=1, =0
for 'SQL*Net message from dblink' count=1 wait_time=265
driver id=54435000, #bytes=1, =0
for 'SQL*Net message to dblink' count=1 wait_time=2
driver id=54435000, #bytes=1, =0
for 'SQL*Net message from dblink' count=1 wait_time=677
driver id=54435000, #bytes=1, =0
for 'SQL*Net message to dblink' count=1 wait_time=0
driver id=54435000, #bytes=1, =0
for 'SQL*Net message from dblink' count=1 wait_time=237
driver id=54435000, #bytes=1, =0
for 'SQL*Net message to dblink' count=1 wait_time=1
driver id=54435000, #bytes=1, =0
temporary object counter: 0
SO: 70000043e695968, type: 53, owner: 70000048eed0d30, flag: INIT/-/-/0x00
LIBRARY OBJECT LOCK: lock=70000043e695968 handle=700000410dd46c8 mode=N --the child cursor
call pin=0 session pin=0 hpc=0000 hlc=0000
htl=70000043e6959e8[70000041e9eb830,70000042411c178] htb=70000042411c178 ssga=70000042411bc10
user=70000048eed0d30 session=70000048eed0d30 count=1 flags=CBK[0020] savepoint=0x0
LIBRARY OBJECT HANDLE: handle=700000410dd46c8 mtx=700000410dd47f8(0) cdp=0
namespace=CRSR flags=RON/KGHP/PN0/EXP/[10010100]
kkkk-dddd-llll=0000-0001-0001 lock=N pin=X latch#=23 hpc=fffc hlc=fffc --lock in NULL mode,pin on X mode
lwt=700000410dd4770[700000410dd4770,700000410dd4770] ltm=700000410dd4780[700000445169a08,700000453b69228]
pwt=700000410dd4738[700000410dd4738,700000410dd4738] ptm=700000410dd4748[700000410dd4748,700000410dd4748]
ref=700000410dd47a0[70000043d1b5e58,70000043d1b5e58] lnd=700000410dd47b8[700000410dd47b8,700000410dd47b8]
LIBRARY OBJECT: object=7000004354983b0
type=CRSR flags=EXS[0001] pflags=[0000] status=VALD load=0
DEPENDENCIES: count=4 size=16
TRANSLATIONS: count=2 size=16
DATA BLOCKS:
data# heap pointer status pins change whr
----- -------- -------- --------- ---- ------ ---
0 7000004359e4768 7000004354984c8 I/P/A/-/- 0 NONE 00
6 700000414b0e4d0 700000450c0bc28 I/P/A/-/E 0 NONE 00
----------------------------------------
KGX Atomic Operation Log 70000041853c068
Mutex 700000414b0e3d8(3311, 0) idn d322e445 oper EXCL
Cursor Pin uid 3311 efd 0 whr 1 slp 0 -- pin sid 3311
opr=3 pso=70000043e695968 flg=0 -- operate code is 3
pcs=700000414b0e3d8 nxt=0 flg=35 cld=0 hd=700000410dd46c8 par=70000041d78b1e0 -- child cursor handle address is 700000410dd46c8
ct=0 hsh=0 unp=0 unn=0 hvl=1d78b4b8 nhv=1 ses=70000048eed0d30 -- heap 0 pointer address 70000041d78b1e0
hep=700000414b0e458 flg=80 ld=1 ob=7000004354983b0 ptr=700000450c0bc28 fex=700000450c0af38
----------------------------------------
SO: 70000041e9eb7b0, type: 53, owner: 70000048eed0d30, flag: INIT/-/-/0x00
LIBRARY OBJECT LOCK: lock=70000041e9eb7b0 handle=700000414e4f570 mode=N
call pin=0 session pin=0 hpc=0000 hlc=0000
htl=70000041e9eb830[70000044590d030,70000043e6959e8] htb=70000042411c178 ssga=70000042411bc10
user=70000048eed0d30 session=70000048eed0d30 count=1 flags=[0000] savepoint=0x4c08a856
LIBRARY OBJECT HANDLE: handle=700000414e4f570 mtx=700000414e4f6a0(0) cdp=1
name=
SELECT * FROM(SELECT A.*, rownum r FROM( select account_id,billing_nbr,calling_nbr,called_nbr,to_date,call_duration,charge_item_name,to_char(rate /100, '99999999999990.99') rate, charge,channel_id from ct_05
where 1=1 and account_id ='300187744' and key_source_type ='52001' order by to_date ) A WHERE rownum <= 15 ) B WHERE r > 0
hash=ff2c2fd459ac5ee188586781d322e445 timestamp=06-04-2010 15:14:33
namespace=CRSR flags=RON/KGHP/TIM/PN0/MED/DBN/[50010040]
kkkk-dddd-llll=0000-0001-0001 lock=N pin=0 latch#=23 hpc=0018 hlc=0018
lwt=700000414e4f618[700000414e4f618,700000414e4f618] ltm=700000414e4f628[700000414e4f628,700000414e4f628]
pwt=700000414e4f5e0[700000414e4f5e0,700000414e4f5e0] ptm=700000414e4f5f0[700000414e4f5f0,700000414e4f5f0]
ref=700000414e4f648[700000414e4f648,700000414e4f648] lnd=700000414e4f660[700000414e4f660,700000414e4f660]
LIBRARY OBJECT: object=70000041d78b0c8
type=CRSR flags=EXS[0001] pflags=[0000] status=VALD load=0
CHILDREN: size=16
child# table reference handle
------ -------- --------- --------
0 70000043d1b61e8 70000043d1b5e58 700000410dd46c8 -- the child cursor
DATA BLOCKS:
data# heap pointer status pins change whr
----- -------- -------- --------- ---- ------ ---
0 7000004467f6078 70000041d78b1e0 I/P/A/-/- 0 NONE 00 -- heap 0
The session is active,and its wait history indicates there are some dblink operation inside the running SQL, When this session waited to reparse the only child cursor for that SQL , it pin this child cursor in X mode first,and then it need more information from remote DB , but no response until SQL*NET MESSAGE from DBLINK timeout,it was waiting single-task message for more than 6 seconds when tracing.
In this scenario other sessions who wait to execute this child cursor were all waiting , so sadly.
The oracle document describes <single-task message> as When running single task, this event indicates that the session waits for the client side of the executable.
A little odd,the blocked also did some single task when it lost one dblink connection.
I searched the metalink with keyword: single-task message, there were few documents found , but fortunately below note:
Hdr: 7757687 10.2.0.3.0 RDBMS 10.2.0.3.0 PRG INT/DISTR PRODID-5 PORTID-212
Abstract: CURSOR: PIN S WAIT ON X BLOCKER "SINGLE-TASK MESSAGE"
PROBLEM:
--------
The databse has multiple processes waiting on "CURSOR: PIN S WAIT ON X"
The holder of the mutes is waiitng on "single-task message" form almost
745687
seconds.
The query run by the holder does conatin a dblink.
DIAGNOSTIC ANALYSIS:
--------------------
All the process waiting on the same mutex:-
waiting for 'cursor: pin S wait on X' blocking sess=0x0
idn=5d779f21, value=7af00000000, where|sleeps=57a0dfdb6
Holder
PROCESS 26:
KGX Atomic Operation Log 700000061490300
Mutex 70000006226e0f0(1967, 0) idn 5d779f21 oper EXCL
Cursor Pin uid 1967 efd 0 whr 1 slp 0
opr=3 pso=7000000670f93a0 flg=0
pcs=70000006226e0f0 nxt=0 flg=35 cld=0 hd=7000000606838f0
par=7000000629006a8
ct=0 hsh=0 unp=0 unn=0 hvl=62900980 nhv=1 ses=700000079543fe0
hep=70000006226e170 flg=80 ld=1 ob=700000062fe69a8
ptr=700000065f124f8 fex=700000065f11808
Session object for this process:-
SO: 700000079543fe0, type: 4, owner: 70000007a6ad538, flag: INIT/-/-/0x00
(session) sid: 1967 trans: 0, creator: 70000007a6ad538, flag: (41)
USR/- BSY/-/-/-/-/-
DID: 0001-001A-00000A38, short-term DID: 0001-001A-00000A39
txn branch: 0
oct: 3, prv: 0, sql: 700000062900e18, psql:
70000005d797bc8, user: 21/V500
O/S info: user: d_c11, term: , ospid: 6271140, machine: aruput5
program: cpm_srvscript@aruput5 (TNS V1-V3)
application name: cpm_srvscript@aruput5 (TNS V1-V3), hash value=0
waiting for 'single-task message' blocking sess=0x0 seq=7071
wait_time=0 seconds since wait
started=745687
This bug which occurs on Version 10.2.0.3 is so similar to my case on AIX version 10.2.0.4, so as the stack trace:
ksdxfstk+002c<-ksdxcb+04e4<-sspuser+0074<-000044C0<-nttrd+0120<-nsprecv+07a0<-
nscon+0218<-nsdo+157c<-nscall3+012c<-nscall+0778<-niotns+0888<-nigcall+0028<-osnco
n+0540<-kpkiadef+006c<-upiini+03fc<-upiah0+00ac<-kpuatch+0808<-OCIServerAttach+011
4<-kpnconn+02a4<-npicon0+042c<-kpndbcon+0630<-OCIKDBLinkConn2+0038<-OCIKDBLinkConn
+002c<-ddfnet2Normal+011c<-kkmfcbrm+009c<-kkmpfcbk+02a0<-qcsprfro+0538<-qcsprfro_tree
+0318<-qcsprfro_tree+01c8<-qcspafq+0068<-qcspqb+0470<-kkmdrv+003c<-opiSem+13b4<-opiDe
ferredSem+0234<-opitca+01e8<-kksFullTypeCheck+001c<-rpiswu2+034c<-kksLoadChild+30b0<-
kxsGetRuntimeLock+0810<-kksfbc+2930<-kkspsc0+0ffc<-kksParseCursor+00d4<-opiosq0+0b30<
-kpooprx+0168<-kpoal8+0400<-opiodr+0adc<-ttcpip+1020<-opitsk+10b0<-opiino+0990<-opiod
r+0adc<-opidrv+0474<-sou2o+0090<-opimai_real+01bc<-main+0098<-__start+0098
My case stack trace:
ksdxfstk+002c<-ksdxcb+04e4<-sspuser+0074<-000044C0<-nttcni+01b8<-nttcon+04f4<-ntconn+0160<
-nsopen+0960<-nscall1+01b0<-nscall+049c<-niotns+0880<-nigcall+0028<-osncon+0540<-kpkiadef+006c<
-upiini+0408<-upiah0+00ac<-kpuatch+0808<-OCIServerAttach+0114<-kpnconn+02a4<-npicon0+042c<
-kpndbcon+0630<-OCIKDBLinkConn2+0038<-OCIKDBLinkConn+002c<-ddfnet2Normal+011c<-kkmfcbrm+009c<
-kkmpfcbk+02a0<-qcsprfro+0538<-qcsprfro_tree+03f0<-qcsprfro_tree+0228<-qcspafq+0068<
-qcspqbDescendents+03a0<-qcspqb+00ac<-qcsevw+02e0<-kkmevw+0a58<-kkmfcbvw+0178<
-kkmfcblo+0e38<-kkmpfcbk+0250<-qcsprfro+0538<-qcsprfro_tree+03f0<-qcsprfro_tree+0228<
-qcspafq+0068<-qcspqbDescendents+03a0<-qcspqb+00ac<-qcsevw+02e0<-qcsfpsq+003c<
-kkmfcbsq+01e8<-kkmpfcbk+0260<-qcsprfro+0538<-qcsprfro_tree+03f0<-qcsprfro_tree+0228<
-qcspafq+0068<-qcspqbDescendents+03a0<-qcspqb+00ac<-qcsevw+02e0<-qcsfpsq+003c<
-kkmfcbsq+01e8<-kkmpfcbk+0260<-qcsprfro+0538<-qcsprfro_tree+03f0<-qcsprfro_tree+0228<
-qcspafq+0068<-qcspqbDescendents+03a0<-qcspqb+00ac<-kkmdrv+003c<-opiSem+13c0<
-opiprs+01ac<-kksParseChildCursor+05e0<-rpiswu2+034c<-kksLoadChild+1d88<
-kxsGetRuntimeLock+0810<-kksfbc+28b0<-kkspsc0+0ffc<-kksParseCursor+00d4<-opiosq0+0ae0<
-kpooprx+0168<-kpoal8+0400<-opiodr+0ae0<-ttcpip+1020<-opitsk+1124<-opiino+0990<-opiodr+0ae0<
-opidrv+0484<-sou2o+0090<-opimai_real+01bc<-main+0098<-__start+0098
Both of them have called OCIKDBLinkConn->OCIKDBLinkConn2->kpndbcon->npicon0->kpnconn->OCIServerAttach->kpuatch and follow.
Till now this bug has no patch to apply or workaround way what's awful.
As "Bug 5111335: STREAMS PROPAGATION STUCK ON "SINGLE-TASK MESSAGE" PRODUCE LIBRARY CACHE LOCK CO" described:
Hdr: 5111335 10.2.0.2 RDBMS 10.2.0.2 STREAMS PRODID-5 PORTID-226
Abstract: STREAMS PROPAGATION STUCK ON "SINGLE-TASK MESSAGE" PRODUCE LIBRARY CACHE LOCK CO
This is a three cluster environment. Two nodes per cluster and on that
cluster running RAC with 2 instances.
The databases involved are:
GCTGCMU is the "GCT" database in GCM = Greenwich USA, will be named USA from
now on
GCTHKGU is the "GCT" database in HKG = Hong Kong, will be named HKG from now
on
GCTLONU is the "GCT" database in LON = London, will be named LON from now on
On this environment there is a Streams bi-directional replication environment
defined between the 3 databases.
USA database is propagating changes captured to LON and HKG.
There are two propagations defined from this site to every site. 4
propagations as a total.
GCTHKGU_BATCH_PROPAGATE
GCTLONU_BATCH_PROPAGATE
GCTHKGU_DAILY_PROPAGATE
GCTLONU_DAILY_PROPAGATE
Daily propagation propagates changes stored on streams queue
STRMADMIN.STREAMSOUT_DAILY.
Batch propagation propagates changes stored on streams queue
STRMADMIN.STREAMSOUT_BATCH
This databases have been upgraded to 10.2.0.2 from 10.2.0.1 last 13-Mar-2006,
and no problems were reported.
Customer reported that he could see that propagations from USA to HKG were
stucked on library cache lock.
This situation was solved by customer by bouncing USA instances.
Analyzing the systemstate dumps we could see that there were propagations
locked on this situation and QMON slaves too.
The library cache lock was on the queue STRMADMIN.STREAMSOUT_DAILY and the
owner of the lock was process J000 that was stucked 'single-task message'.
From the information on the systemstate dump we can see that this process has
been on that situation for more than 3 days and 20 hours.
Based on the call stack it seems that the process was about or in the middle
of a call to a listener.
DIAGNOSTIC ANALYSIS:
--------------------
Files to be uploaded:
USA:
- RDA.zip , RDA node 1
- GCTGCMU1_healthchk.html, streams healthcheck node 1
- GCTGCMU1_healthchk.html, streams healthcheck node 2
- gctgcmu2_ora_21349.trc, hanganalyze and systemstate dumps when the hang is
ocurring. Node 2
- gctgcmu1_ora_858.trc, hanganalyze and systemstate dumps when the hang is
ocurring. Node 1.
HKG:
- alert_GCTHKGU1.log, alert log file node 1
- alert_GCTHKGU2.log, alert log file node 2
- listener_hkg0223xus.log_20060320, listener log file node 1.
- listener_hkg0224xus.log_20060319, listener log file node 2.
LON:
- alert_GCTLONU1.log, alert log file node 1
- alert_GCTLONU2.log, alert log file node 2
- listener_lon0223xus.log_20060318, listener log file node 1.
- listener_lon3166xus.log_20060320, listener log file node 2.
Hang was identified at USA on node 1.
Analysis of gctgcmu1_ora_858.trc:
System State 1
~~~~~~~~~~~~~~
21: waiting for 'library cache lock' [LOCK: handle=25ea4afc8] seq=21
42: waiting for 'library cache lock' [LOCK: handle=25ea4afc8] seq=404
51: waiting for 'library cache lock' [LOCK: handle=25ea4afc8] seq=17
67: waiting for 'library cache lock' [LOCK: handle=25ea4afc8] seq=90
71: waiting for 'library cache lock' [LOCK: handle=25ea4afc8] seq=56
76: waiting for 'library cache lock' [LOCK: handle=25ea4afc8] seq=344
Cmd: PL/SQL Execute
77: waiting for 'library cache lock' [LOCK: handle=25ea4afc8] seq=87
System State 2
~~~~~~~~~~~~~~
21: waiting for 'library cache lock' [LOCK: handle=25ea4afc8] seq=21
42: waiting for 'library cache lock' [LOCK: handle=25ea4afc8] seq=404
51: waiting for 'library cache lock' [LOCK: handle=25ea4afc8] seq=17
67: waiting for 'library cache lock' [LOCK: handle=25ea4afc8] seq=90
71: waiting for 'library cache lock' [LOCK: handle=25ea4afc8] seq=56
76: waiting for 'library cache lock' [LOCK: handle=25ea4afc8] seq=344
Cmd: PL/SQL Execute
77: waiting for 'library cache lock' [LOCK: handle=25ea4afc8] seq=87
LOCK: handle=25ea4afc8 QUEU:STRMADMIN.STREAMSOUT_DAILY
Resource Holder State
LOCK: handle=25ea4afc8 50: waiting for 'single-task message'
PROCESS 50:
~~~~~~~~~~~
OSD pid info: Unix process pid: 12949, image: oracle@slodb001 (J000)
From the call stack looks like process was doing an SQL*Net connection to a
listenerand doing this
connection we get hanged because the stack trace is the same for the two
process dumps shown on
both system states.
(session)
oct: 0, prv: 0, sql: (nil), psql: (nil), user: 92/STRMADMIN
waiting for 'single-task message' blocking sess=0x(nil) seq=5 wait_time=0
seconds since wait
started=332308
=0, =0, =0
It is running a propagation job: next_date :=
sys.dbms_aqadm.aq$_propaq(job);
Knowing that the seconds in wait are 332308 and the dump of the process
occurred on 2006-03-21 08:15:12.640,
this process is waiting on this event:
1. On USA the process is waiting since 17-MAR-2006 11:56:44
2. That time on London is 17-MAR-2006 16:56:44
3. That time on HONG KONG is 18-MAR-2006 01:56:44
At that time nothing relevant was found on alert log files on listener log
files.
Processes waiting on library cache lock
~~~~~~~~~~~~~~~
21 (PZ99) : Slave of J002
42 (q004) : Slave of process 44 (QMNC)
51 (J001) :
67 (q003) : Slave of process 44 (QMNC)
71 (q005) : Slave of process 44 (QMNC)
51 (TNS shadow) :
77 (q007) : Slave of process 44 (QMNC)
WORKAROUND:
-----------
No workaround found so far.
Killing the process or bouncing instance should solve the problem.
RELATED BUGS:
-------------
No similar bugs found.
This environment is also involved on bug 5089630
REPRODUCIBILITY:
----------------
Problem has reproduced just one time on customer site.
TEST CASE:
----------
STACK TRACE:
------------
For processes waiting on library cache lock we can see different call stacks,
here is the one for process 21:
ksdxfstk ksdxcb sspuser pthread_sighandler_rt GI___libc_sigaction GI___poll
kslwait kjusuc ksipgetctx kqlmli kgllkal kglget
kglgob kwqicgob kwqicrqo0 kwqmsnrmsg kwqsldqs kwqsif
qerfxFetch qervwFetch qertqoFetch qerpxSlaveFetch qerpxFetch
opiexe kpoal8 opiodr kpoodr upirtrc kpurcsc kpuexecv8 kpuexec OCIStmtExecute
kxfxsStmtExecute kxfxsExecute kxfxsp kxfxmai kxfprdp opirip opidrv sou2o
opimai_real main __libc_start_main _start
For process waiting on single-task message, we can see the following call
stack:
ksdxfstk ksdxcb sspuser pthread_sighandler_rt GI___libc_sigaction GI___poll
clntudp_call do_ypcall yp_match internal_gethostbyname2_r
nss_nis_gethostbyname_r
gethostbyname_r
gethostbyname snlpcgthstbynm snlinGetAddrInfo nttbnd2addr ntacbbnd2addr
ntacbnd2addr
nsc2addr nscall1 nscall niotns nigcall osncon
kpkiadef upiini upiah0 kpuatch OCIServerAttach kpnconn
npicon0 kpndbcon OCIKDBLinkConn2 OCIKDBLinkConn kwqpcon kwqpgps
spefmccallstd pextproc peftrusted psdexsp rpiswu2 psdextp pefccal pefca
pevm_FCAL pfrinstr_FCAL pfrrun_no_tool pfrrun plsql_run peicnt kkxexe opiexe
opiodr rpidrus skgmstack rpidru rpiswu2 rpidrv rpiexe kkjex1e kkjsexe kkjrdp
opirip opidrv sou2o opimai_real main libc_start_main start
I do not see any Oracle bug here but it is hard to be
certain without the SQLNET.ORA settings and multiple
snapshots of the problem process. It is quite valid
for an Oracle session to be holding a library cache lock
when it makes an outbound DBlink call. If another session
wants that lock in an incompatible mode then it has to wait.
In this case the blocking session making the outbound link
call appears from the one stack dump to be in OS code
under gethostbyname().
However it could have been spinning down the stack in
Oracle code but I cannot comment on the info we have.
The "single-task message" wait event is started / ended
in kpnconn() so if there was a "spin" it would have to be
above that point in the stack.
In 10.2 there is the option to configure NET to timeout
on outbound connect attempts by setting
SQLNET.OUTBOUND_CONNECT_TIMEOUT = N seconds.
The timeout occurs in niotns() in the stack so if this
is set you would expect a session in this state to then
timeout which may be useful if there are underlying issues with
TCP hostname resolution.
(Watch out for bug 4933023 if you use this SQLNET parameter).
If the issue recurs then get CPU usage info, multiple stacks
and you may want to attach with gdb and see if the code ever
returns above gethostbyname or not to get an idea if this
Sorry , my explanation is probably not clear .
Here is the stack you have but annotated:
Callback to get the short stack. Confirms we are not HUNG in a device
driver.
ksdxfstk
ksdxcb
sspuser
pthread_sighandler_rt
GI___libc_sigaction
Inside OS supplied code executing gethostbyname() C call.
GI___poll
clntudp_call
do_ypcall
yp_match
internal_gethostbyname2_r
nss_nis_gethostbyname_r
gethostbyname_r
gethostbyname
Oracle Net code:
snlpcgthstbynm
snlinGetAddrInfo
nttbnd2addr
ntacbbnd2addr
ntacbnd2addr
nsc2addr
nscall1
nscall
niotns <<< This is where SQLNET.OUTBOUND_CONNECT_TIMEOUT
is implemented IF it has been configured.
nigcall
Oracle RDBMS code:
osncon
kpkiadef
upiini
upiah0
kpuatch
OCIServerAttach
kpnconn <<< This is where "single-task message" starts/ends
npicon0
...
If we just get two stack snapshots of the process, as was the case, and
both have the above stack then we can say:
As the "seq=" of both snaps is the same (5) then we did not return
below kpnconn() in the call stack.
But as poll() is typically a blocking call, even if only for a short
time, then we cannot be sure if this process was blocked permanently
in the poll() waiting on some large timout or similar, or if the
code was executing and looping. If it was looping it could have done so
at any point on the above stack down to kpnconn() where we would have
changed the seq# on the wait event if we exited here.
We can say that as the short stack dump triggered a callout then the
Linux kernel was not stuck inside a device driver.
We can also say it looks like the gethostbyname is in YP code doing a UDP
call so at OS level you may want to check out the name resolution
configuration. You probably dont want to be falling all the way back to
UDP requests to get hostnames resolved.
So if the problem recurs it would help to get:
ps information of the process (to see if it uses CPU at all)
strace for a few minutes - this should help show if the
process is entering / leaving any system call (eg: poll)
A full call stack with arguments from either gdb or errorstack
and these should help show the arguments passed up the stack.
In gdb you can also do something like attach to the problem process
and :
break osncon
break nigcall
break niotns
etc.. up the stack to gethostbyname and then "cont"
and it should stop if it enters into those functions from below
helping indicate a spin/loop point below that point.
It may also be worth getting Linux "crash" information for the
process to see what the process looks like from the Linux side
but follow up with the OS team on that side.
SQLNET.OUTBOUND_CONNECT_TIMEOUT is a backup mechanism to help prevent
This note claimed that one session may hold library cache lock , and wait in single-task message
after it makes an outbound DBlink call.If another session wants that lock in an incompatible mode then it has to wait.
The "single-task message" wait event is started / ended in kpnconn() so if there was a "spin" it would have to be
above that point in the stack. Oracle support advise to set SQLNET.OUTBOUND_CONNECT_TIMEOUT so that
we can expect a session in this state to then timeout which may be useful if there are underlying issues with
TCP hostname resolution. The principle in these cases is identical.
Do deep digging ,you can see this Knowledge:
Hdr: 8427478 10.2.0.3 RDBMS 10.2.0.3 PI/DISTRIB PRODID-5 PORTID-23 ORA-12170
Abstract: WHEN USING DBLINK,IT CONNECT TWICE AND TAKE DOBULE-TIME FOR TIME-OUT( ORA-12170)
PROBLEM:
--------
When client connect to database using TCP/IP, if there are some problems
in the network and server , The client recevies ORA-12170 (ETIMEDOUT)
The timeout time is depend on OS TCP parameter.(e.g. tcp_syn_retries
parameter on Linux platform).
ERROR:
ORA-12170 : TNS:Connect timeout occurred
** sqlnet.log ***************************************************
VERSION INFORMATION:
TNS for Linux: Version 10.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for Linux: Version 10.2.0.3.0 - Production
Time: 10-APR-2009 21:38:41
Tracing not turned on.
Tns error struct:
ns main err code: 12535
TNS-12535: TNS:operation timed out
ns secondary err code: 12560
nt main err code: 505
TNS-505: Operation timed out
nt secondary err code: 110 <= ETIMEDOUT
nt OS err code: 0
Client address:
** sqlnet.log ***************************************************
client server
| |
connect() |
|-------x-->|--------
| SYN | |
|-------x-->| |
| SYN | |
|-------x-->| |
| SYN | |when setting tcp_syn_retries=5(default)
|-------x-->| | It takes about 189 seconds for time-out.
| SYN | |
|-------x-->| |
| SYN | |
|-------x-->| |
| SYN | |
---------------------
ORA-12170 occur.
But, the connection via DBLINK, it sometimes try to connect twice. and
it takes double-time for time-out.
local site remote site
| |
connect() |
|-------x-->|--------
| SYN | |
|-------x-->| |
| SYN | |
|-------x-->| |
| SYN | |
|-------x-->| |
| SYN | |
|-------x-->| |
| SYN | |
|-------x-->| |
| SYN | |
ORA-12170 occur ,but it not returning to client.
and server process of local site try to connect automatically.(2nd connect)
connect() | |
|-------x-->| |
| SYN | |
|-------x-->| | It takes about 378 seconds for time-out.
| SYN | |
|-------x-->| |
| SYN | |
|-------x-->| |
| SYN | |
|-------x-->| |
| SYN | |
|-------x-->| |
| SYN | |
---------------------
ORA-12170 occur.
The problem is that it takes dobule-time by this behavior using DBLINK.
DIAGNOSTIC ANALYSIS:
--------------------
We got stack trace when trying connect.
1st connect
connect()+36<- nttcon()+965<- ntconn()+265<- nsopen()+1335
<- nscall1()+449<- nscall()+609<- niotns()+1678<- nigcall()+112
<- osncon()+692<- kpkiadef()+58<- upiini()+683<- upiah0()+104
<- kpuatch()+1064<- OCIServerAttach()+129<- kpnconn()+463
<- npicon0()+1250<- kpndbcon()+909<- OCIKDBLinkConn2()+40
<- OCIKDBLinkConn()+36<- ddfnet3Share()+151<- kksarm()+480<- kksauc()+481
<- kksCheckCriteria()+1711<- kksCheckCursor()+226 <-
kksSearchChildList()+1496
<- kksfbc()+10276<- kkspsc0()+1903<- kksParseCursor()+132<-
opiosq0()+1349
<- kpooprx()+215<- kpoal8()+673<- opiodr()+985<- ttcpip()+1093<-
opitsk()+1031
<- opiino()+821<- opiodr()+985<- opidrv()+466<- sou2o()+91<-
opimai_real()+117
<- main()+111<- __libc_start_main()+161
2nd connect
<- connect()+36<- nttcon()+965<- ntconn()+265<- nsopen()+1335<-
nscall1()+449<- nscall()+609
<- niotns()+1678<- nigcall()+112<- osncon()+692<- kpkiadef()+58<-
upiini()+683<- upiah0()+104
<- kpuatch()+1064<- OCIServerAttach()+129<- kpnconn()+463<-
npicon0()+1250<- kpndbcon()+909
<- OCIKDBLinkConn2()+40<- OCIKDBLinkConn()+36<- ddfnet2Normal()+204<-
kkmfcbrm()+132
<- __PGOSF172_kkmpfcbk()+473<- qcsprfro()+503<- qcsprfro_tree()+302<-
qcsprfro_tree()+118
<- qcspafq()+102<- qcspqb()+347<- kkmdrv()+60<- opiSem()+1044<-
opiDeferredSem()+283
<- opitca()+275<- __PGOSF302_kksFullTypeCheck()+20<- rpiswu2()+334<-
kksLoadChild()+5852
<- kxsGetRuntimeLock()+1314<- kksfbc()+12270<- kkspsc0()+1903<-
kksParseCursor()+132
<- opiosq0()+1349<- kpooprx()+215<- kpoal8()+673<- opiodr()+985<-
ttcpip()+1093
<- opitsk()+1031<- opiino()+821<- opiodr()+985<- opidrv()+466<-
sou2o()+91<- opimai_real()+117
<- main()+111<- __libc_start_main()+161<-
WORKAROUND:
-----------
execute following command before connect.
SQL> alter system flush shared_pool;
You can find the kpnconn call in stack trace , so the single-task message may occur in two time dblink reconnect and take a long time.
The document provide one workaround way: flush shared_pool.which I suspected.
In summary , still no Mature program to resolve this problem , All we can do is pray.
In my opinion , Version 10.2.0.4 is stable enough within most layer , but still a lots of bug or unexpected behavior reside in K2(kernel Distributed Execution Layer). Distributed system is a really complicated subject in computer science.