国庆节凌晨0:36收到邮件报网通Tomcat服务器挂掉,Tomcat总是那么脆弱,重启是家常便饭了,但这次和往常不一样,WEB服务始终没有起来,马上重启整台服务器再启动Tomcat,故障依旧。
马上查看Tomcat日志:
2010-10-1 1:10:42 com.fsm.util.SearchDispatchFilter init 信息: SearchDispatchFilter.init() done ***********************创建数据源******************** createDataSource:DriverClassName=oracle.jdbc.driver.OracleDriver createDataSource:Url=jdbc:oracle:thin:@db_server:1521/ora8i ***************************************************
过了3分钟,才出现久违“Server startup”:
[FSM-ERROR]:2010-10-01 01:13:55,394-org.hibernate.util.JDBCExceptionReporter.logExceptions(LINE:78) Cannot create PoolableConnectionFactory (Io 异常: The Network Adapter could not establish the connection) org.tuckey.web.filters.urlrewrite.UrlRewriteFilter INFO: loaded (conf ok) 2010-10-1 1:13:58 org.apache.coyote.http11.Http11BaseProtocol start 信息: Starting Coyote HTTP/1.1 on http-80 2010-10-1 1:13:58 org.apache.jk.common.ChannelSocket init 信息: JK: ajp13 listening on /0.0.0.0:8009 2010-10-1 1:13:58 org.apache.jk.server.JkMain start 信息: Jk running ID=0 time=0/84 config=null 2010-10-1 1:13:58 org.apache.catalina.storeconfig.StoreLoader load 信息: Find registry server-registry.xml at classpath resource 2010-10-1 1:13:58 org.apache.catalina.startup.Catalina start 信息: Server startup in 205603 ms
通常Tomcat也就是启动20来秒,这次却启动了200多秒,不太正常,再往上看出现了“JDBC”和“The Network Adapter could not establish the connection”,99%就是与数据库建立连接时候出现了问题,马上telnet数据库的1521端口,可连接是通的:
[root@primary_node bin]# telnet db_server 1521 Trying 192.168.4.20... Connected to db_server (192.168.4.20). Escape character is '^]'. aaaaa aaa aaaaa aaaaaa aaaaa
上面那堆'”a”是我输入的,检查连接是否真的通了。
于是将矛头指向配置文件和运行所需的jar包上,也没有发现最近被修改过的痕迹,将另一台服务器的配置文件和jar包copy过来,重启Tomcat,故障依旧。此时已是2:11分,此时该台网通Tomcat服务器的VIP早就被电信服务器抢过去了,服务没有受到影响,此时精神上有点吃不消了,赶紧睡觉去,获取周公能给我灵感。
早上7:25分醒来,第一件事睁开眼睛,第二件事登录到服务器上继续调试问题,首先将数据库IP指向测试数据库,启动Tomcat,WEB服务跑起来了,这个现象再次让我把矛头指向Oracle服务器,由于服务器上没有安装Oracle的客户端只能通过telnet这种原始的方式进行测试,telnet测试的Oracle服务器:
[root@primary_node bin]# telnet test_db_server 1521 Trying 192.168.4.74... Connected to db_server (192.168.4.74). Escape character is '^]'. aaaaa aaa Connection closed by foreign host.
奇怪了,测试用的Oracle服务器会主动close连接,而生产用的Oracle服务器不会主动close连接,终于找到突破口了。 登录到Oracle服务器,查看以下listener的状态:
[oracle@wz_oracle1 adump]$ lsnrctl status LSNRCTL for Linux: Version 10.2.0.2.0 - Production on 01-OCT-2010 09:08:09 Copyright (c) 1991, 2005, Oracle. All rights reserved. Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.4.20)(PORT=1521)))
Hang住,取消,直接 lsnrctl stop ,还是Hang住,出必杀 killall –9 tnslsnr,再启动:
[oracle@wz_oracle1 adump]$ lsnrctl start LSNRCTL for Linux: Version 10.2.0.2.0 - Production on 01-OCT-2010 09:10:17 Copyright (c) 1991, 2005, Oracle. All rights reserved. Starting /u01/app/oracle/bin/tnslsnr: please wait... TNSLSNR for Linux: Version 10.2.0.2.0 - Production System parameter file is /u01/app/oracle/network/admin/listener.ora Log messages written to /u01/app/oracle/network/log/listener.log Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.4.20)(PORT=1521))) Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.4.30)(PORT=1521))) Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.4.20)(PORT=1521))) STATUS of the LISTENER ------------------------ Alias LISTENER Version TNSLSNR for Linux: Version 10.2.0.2.0 - Production Start Date 01-OCT-2010 09:10:17 Uptime 0 days 0 hr. 0 min. 0 sec Trace Level off Security ON: Local OS Authentication SNMP OFF Listener Parameter File /u01/app/oracle/network/admin/listener.ora Listener Log File /u01/app/oracle/network/log/listener.log Listening Endpoints Summary... (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.4.20)(PORT=1521))) (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.4.30)(PORT=1521))) Services Summary... Service "PLSExtProc" has 1 instance(s). Instance "PLSExtProc", status UNKNOWN, has 1 handler(s) for this service... Service "ora8i" has 1 instance(s). Instance "ora8i", status UNKNOWN, has 1 handler(s) for this service... Service "shdb" has 1 instance(s). Instance "ora8i", status UNKNOWN, has 1 handler(s) for this service... The command completed successfully
搞定,重启Tomcat,一切恢复正常。
这次故障处理时间长达3小时,最开始以为是一个普通的故障,重启一下Tomcat就能完事,但没那么简单,然后被 telnet 的现象所迷惑,走入了一条长长的弯路,“工欲善其事,必先利其器”如果当时找一台服务器用sqlplus 远程连一下数据库就不用折腾那么久了。