原文:http://book.51cto.com/art/200906/130068.htm
9.3.3 db2trc案例2(1)
在AIX操作系统上,系统原先运行良好,而后用户从DB2 V8 FP7B升级到DB2 V8 FP15,并且把实例从32位迁移到64位。在该机器上中有3个实例,其中一个实例在执行了db2iupdt命令后无法连接数据库。该实例中仅有一个数据库。
问题诊断步骤如下:
首先查看db2diag.log诊断日志文件:
- cd ~/sqllib/db2dump
- mv db2diag.log db2diag.log.bak
- <reproduce problem>--注:重新运行出问题的命令或SQL
- vi db2diag.log
得到的错误输出信息具体如下所示:
- 2007-12-12-20.42.45.029130-360 I1838C472 LEVEL: Warning
- PID:1593388 TID:1 PROC:db2agent (U88FLAQ) 0
- INSTANCE: iu88flaq NODE : 000 DB : U88FLAQ
- APPHDL : 0-7 APPID: *LOCAL.iu88flaq.071213024245
- FUNCTION: DB2 UDB, base sys utilities, sqleFirstConnect, probe:15
- RETCODE : ZRC=0x820F0004=-2112946172=SQLO_MEM_SIZE "Mem Mgt invalid size"
- DIA8563C An invalid memory size was requested.
- 2007-12-12-20.42.45.029814-360 I2311C755 LEVEL: Warning
- PID: 1593388 TID: 1 PROC: db2agent (U88FLAQ) 0
- INSTANCE: iu88flaq NODE : 000 DB : U88FLAQ
- APPHDL : 0-7 APPID: *LOCAL.iu88flaq.071213024245
- FUNCTION: DB2 UDB, base sys utilities, sqleFirstConnect, probe:16
- DATA #1 : String, 297 bytes
- Failed to allocate the desired database shared memory set.Check to make sure
- the configured DATABASE_MEMORY + overflowdoes not exceed the maximum shared
- memory on the system. Willattempt to allocate the minimum possible db shared
- memory size.
- Desired database shared memory set size is (bytes):
- DATA #2 : Hexdump, 4 bytes
- 0x2FF13E20 : C601 4000 ..@.
- ------------------------------略----------------------------------------
- 2007-12-12-20.42.45.031566-360 I4206C472 LEVEL: Warning
- PID: 1593388 TID: 1 PROC: db2agent (U88FLAQ) 0
- INSTANCE: iu88flaq NODE : 000 DB: U88FLAQ
- APPHDL : 0-7 APPID: *LOCAL.iu88flaq.071213024245
- FUNCTION: DB2 UDB, base sys utilities, sqleFirstConnect, probe:15
- RETCODE : ZRC=0x820F0004=-2112946172=SQLO_MEM_SIZE "Mem Mgt invalid size"
- DIA8563C An invalid memory size was requested.
- 2007-12-12-20.42.45.032023-360 I5435C491 LEVEL: Severe
- PID: 1593388 TID: 1 PROC: db2agent (U88FLAQ) 0
- INSTANCE: iu88flaq NODE: 000 DB: U88FLAQ
- APPHDL : 0-7 APPID: *LOCAL.iu88flaq.071213024245
- FUNCTION: DB2 UDB, base sys utilities, sqleFirstConnect, probe:17
- RETCODE : ZRC=0x850F0005=-2062614523=SQLO_NOSEG
- "No Storage Available for allocation"
- DIA8305C Memory allocation failure occurred.
尝试分配"0xC6014000"的内存失败!为什么会产生这样的错误呢?从输出结果中我们看到的SQLO_MEM_SIZE和SQLO_NOSEG开始分析,此类问题应该是内存段分配问题。执行db2level命令,结果发现该实例仍然是32位,因此执行"db2iupdt -w 64"命令,将实例升级到64位,这个问题就解决了。
可是接下来我们却发现仍然无法连接数据库。于是,查看新产生的db2diag.log文件。结果发现这一次报了一个不同的错误信息,如下所示:
- 2007-12-14-12.15.17.656977-360 I1A1093 LEVEL: Event
- PID : 1052786 TID : 1 PROC : db2
- INSTANCE: iu88flaq NODE : 000
- FUNCTION: DB2 UDB, RAS/PD component, _pdlogInt, probe:120
- START : New db2diag.log file
- DATA #1 : Build Level, 144 bytes
- Instance "iu88flaq"uses"64" bits and DB2 code release"SQL08028"--64位实例
- ------------------------------略----------------------------------------
- 2007-12-14-12.15.17.655334-360 I1095A383 LEVEL: Error
- PID : 1052786 TID : 1 PROC : db2
- INSTANCE: iu88flaq NODE : 000
- FUNCTION: DB2 UDB, command line process, clp_start_bp, probe:3
- MESSAGE : CLP frontend unable to get REQUEST queue handle
- DATA #1 : Hexdump, 4 bytes
- 0x0FFFFFFFFFFFD6C8 : 870F 0042 ...B
- 2007-12-14-12.15.45.546601-360 I1479A383 LEVEL: Error
- PID : 2765134 TID : 1 PROC : db2
- INSTANCE: iu88flaq NODE : 000
- FUNCTION: DB2 UDB, command line process, clp_start_bp, probe:3
- MESSAGE : CLP frontend unable to get REQUEST queue handle
- DATA #1 : Hexdump, 4 bytes
- 0x0FFFFFFFFFFFD688 : 870F 0042 ...B
- --注:使用下面命令查看870F0042的含义:
- db2diag -rc 870F0042
- Input ZRC string '0x870f0042' parsed as 0x870F0042 (-2029060030).
- ZRC value to map: 0x870F0042 (-2029060030)
- V7 Equivalent ZRC value: 0xFFFFF642 (-2494)
- ZRC class : Global Processing Error (Class Index: 7)
- Component: SQLO ; oper system services (Component Index: 15)
- Reason Code: 66 (0x0042)
- Identifer: SQLO_QUE_NOT_EXIST
- Identifer (without component): SQLZ_RC_QNOEXS
- Description: Queue does not exist
- Associated information: Sqlcode -902
- SQL0902C A system error (reason code = "" occurred. Subsequent SQL
- statements cannot be processed.
- Number of sqlca tokens : 1
- Diaglog message number: 8558
错误信息告诉我们"queue doesn't exist",首先我们要理解一下这个"queue doesn't exist"是什么意思?当我们用"db2 xxxx"命令执行特定操作的时候,这个DB2是一个可执行程序。这个可执行程序被称作"frontend process"。当DB2这个进程执行的时候,会隐式地在后台生成一个db2bp的进程。这个进程叫做"db2 backend process"。所以说,当我们用"db2 connect to <db>"这个命令连接数据库的时候,真正与数据库相连接的是其后台的db2bp进程。而当连接成功以后,后面所有的比如select操作什么的,都是通过queue这种IPC手段与db2bp进程进行通信的。
9.3.3 db2trc案例2(2)
这里我们得到的错误信息意思是说,当DB2前台进程"frontend process"想找后台进程"backend process"的时候,发现用来通信的queue不存在!我们知道既然是IPC通信,那"frontend process"就不可能只是得到一个queue不存在就立刻报错。而实际上是用户在发出connect命令后等待了很长时间才返回错误。现在让我们来看看这个后台进程正在做什么,竟然让前台进程等了半天也得不到queue,前台进程不得不退出说明queue不存在了……。下面我们利用db2trc命令来跟踪:
- db2trc on -t -f db2trc.dmp
- <reproduce problem>
- db2trc off
- db2trc flw -t db2trc.dmp db2trc.flw
- db2trc fmt db2trc.dmp db2trc.fmt
跟踪出来的FLW文件有40多MB,大概51万行,别被这51万行吓着。我们不全看,查看TRACE最关键的就是要知道"你现在正在干什么?"、"想要看什么东西?"。我们现在正在调查"backend process"为什么不创建queue,首先让我们从得到的db2diag.log入手。先到FMT文件里找db2diag.log中出现的错误关键字"870F"(不包括引号),为什么我们找这个关键字呢?因为这个是db2diag.log中返回的错误信息。我们就用它作为搜索条件。搜索后得到了一个ENTRY, 具体信息如下所示:
- 2170 data DB2 UDB command line process clp_start_bp fnc (3.3.41.7.0.3)
- pid 2765134 tid 1 cpid -1 node -1 sec 64 nsec 869465587 probe 3
- bytes 392
- Data1 (PD_TYPE_DIAG_LOG_REC,384) Diagnostic log record:
- 2007-12-14-12.15.45.546601-360 I1479A383 LEVEL: Error
- PID : 2765134 TID : 1 PROC : db2
- INSTANCE: iu88flaq NODE : 000
- FUNCTION: DB2 UDB, command line process, clp_start_bp, probe:3
- MESSAGE : CLP frontend unable to get REQUEST queue handle
- DATA #1 : Hexdump, 4 bytes
- 0x0FFFFFFFFFFFD688 : 870F 0042 ...B
从输出结果中我们可以发现2170对应着FLW文件里的2170, 然后在这个ENTRY里我们看到CLP_START_BP,顾名思义,就是CLP用来开启"backend process"的"function call"的,也就是说,这个2170是在"frontend process"中的。
然后回到FLW文件,找2170,具体信息如下所示:
- 2166 64:869290682 | | | | | | | sqlochgfileptr exit
- 2167 64:869405455 | | | | | | _pdlogInt data [probe 90]
- 2168 64:869450954 | | | | | | | sqlowrite entry
- 2169 64:869464745 | | | | | | | sqlowrite exit
- 2170 64:869465587 | | | clp_start_bp data [probe 3]
- 2171 64:869466392 | | | | | | | sqloclose entry
- 2172 64:869472755 | | | | | | | sqloclose exit
- 2173 64:869479918 | | | | | | _pdlogInt exit
- 2174 64:869480276 | | | | | pdLog exit
我们发现直到64秒才跑到2170,往上看,看看它一直在做什么:我们看到了好多"OpenMLNQue",基本上是一秒钟一个。具体信息如下所示:
- 1437 22:867813790 | | | | | sqloOpenMLNQue data [probe 1]
- 1438 22:867817987 | | | | | | sqlogkey entry
- 1439 22:867818164 | | | | | | sqlogkey data [probe 1]
- 1440 22:867819491 | | | | | | sqlogkey exit [rc = 0x6F02C86F = 1862453359]
- 1441 22:867823317 | | | | | sqloOpenMLNQue exit [rc = 0x870F0042 =
- -2029060030 = SQLO_QUE_NOT_EXIST]
- 1442 22:867823675 | | | | | sqlorest entry
- 1443 22:867823869 | | | | | sqlorest data [probe 10]
- 1444 23:867842081 | | | | | sqlorest exit
- 1445 23:867845317 | | | | | sqloOpenMLNQue entry
- 1446 23:867847142 | | | | | sqloOpenMLNQue data [probe 1]
- 1447 23:867850175 | | | | | | sqlogkey entry
- 1448 23:867850525 | | | | | | sqlogkey data [probe 1]
- 1449 23:867852012 | | | | | | sqlogkey exit [rc = 0x6F02C86F = 1862453359]
- 1450 23:867855105 | | | | | sqloOpenMLNQue exit [rc = 0x870F0042 =
- -2029060030 = SQLO_QUE_NOT_EXIST]
- 1451 23:867855472 | | | | | sqlorest entry
- 1452 23:867855657 | | | | | sqlorest data [probe 10]
- 1453 24:867872812 | | | | | sqlorest exit
- 1454 24:867876195 | | | | | sqloOpenMLNQue entry
- 1455 24:867877582 | | | | | sqloOpenMLNQue data [probe 1]
- 1456 24:867880910 | | | | | | sqlogkey entry
- 1457 24:867881256 | | | | | | sqlogkey data [probe 1]
- 1458 24:867882545 | | | | | | sqlogkey exit [rc = 0x6F02C86F = 1862453359]
- 1459 24:867885558 | | | | | sqloOpenMLNQue exit [rc = 0x870F0042 =
- -2029060030 = SQLO_QUE_NOT_EXIST]
这说明什么问题呢?"frontend process"每秒钟检测一次"backend process queue"是不是可用,然后等了一分钟左右发现一直找不到,然后就TIMEOUT退出了。不过现在还是不知道"backend process"在这段时间干了些什么呀?怎么办?既然我们知道了这个进程是"frontend process",那么"backend process"肯定在另外的地方。而且想要知道"backend process"在干什么,就要找这64秒之前的东西,像那些100多秒以后的事情就可以忽略不计了。但是这里可是有51万行呢,怎么找?我们知道程序的入口都是MAIN,所以我们来搜索一下关键字MAIN,获得的信息如下所示:
- 826 4:873512436 clp bp main entry
- 7699 152:813841403 | | | | sqeDomainSocketManager::GetShareSocketPair entry
- 7702 152:813842874 | | | | sqeDomainSocketManager::GetShareSocketPair exit
- 7705 152:813844661 | | | | sqeDomainSocketPair::WriteConn entry
- 7706 152:813859956 | | | | sqeDomainSocketPair::WriteConn data [probe 20]
- 7707 152:813860226 | | | | sqeDomainSocketPair::WriteConn exit
- 7759 152:813896576 | | sqeDomainSocketPair::ReadConn entry
- 7767 152:813910788 | | sqeDomainSocketPair::ReadConn data [probe 20]
- 7768 152:813911404 | | sqeDomainSocketPair::ReadConn exit
- 7769 152:813911922 | | sqeDomainSocketManager::FreeShareSocketPair entry
- 7770 152:813912487 | | sqeDomainSocketPair::ClrAsyncRead entry
- 7771 152:813919886 | | sqeDomainSocketPair::ClrAsyncRead exit
- 7772 152:813920518 | | sqeDomainSocketPair::ClrAsyncWrite entry
- 7773 152:813923539 | | sqeDomainSocketPair::ClrAsyncWrite exit
- 7776 152:813924917 | | sqeDomainSocketManager::FreeShareSocketPair exit