大家好,非常抱歉,在昨天下午(12月3日)的访问高峰,园子迎来更高的并发,在这样的高并发下,突发的数据库连接故障造成博客站点全线崩溃,由此给您带来很大的麻烦,请您谅解。
最近,我们一边在忙于AWS合作项目,一边在加快产品的改进速度,一边在统一全园UI,一边在忙于解决高并发下出现的各种问题。园子正处于发展的关键时期,我们正全力应对挑战,迎接园子的新阶段。感谢大家的支持,也请大家谅解这段时间给大家带来的麻烦。
今天下午的故障开始于 14:09 左右,最开始出现的故障是访问博客后台502。
发生故障时博客后台第1条错误日志是 SqlClient 连接 SQL Server 数据库失败(我们用的是阿里云 RDS SQL Server 实例)
2020-12-03T14:09:48 ERR [Path:/healthz]/[Action:]/[Version:]
Health check "blogdb" completed after 0.3522ms with status Unhealthy and description 'null'
Microsoft.Data.SqlClient.SqlException (0x80131904): Connection Timeout Expired. The timeout period elapsed while attempting to consume the pre-login handshake acknowledgement. This could be because the pre-login handshake failed or the server was unable to respond back in time. This failure occurred while attempting to connect to the Principle server. The duration spent while attempting to connect to this server was - [Pre-Login] initialization=20025; handshake=3;
---> System.ComponentModel.Win32Exception (258): Unknown error 258
at Microsoft.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, UInt32 waitForMultipleObjectsTimeout, Boolean allowCreate, Boolean onlyOneCheckConnection, DbConnectionOptions userOptions, DbConnectionInternal& connection)
3分钟后,博客站点也开始出现故障,表现为访问有时出现500错误。
发生故障时博客站点第1个错误日志是 SqlClient 解析数据库服务器名称失败
2020-12-03 14:12:46.729 [Error] An exception occurred while iterating over the results of a query for context type '"BlogServer.Infrastructure.Data.EfUnitOfWork"'."
""Microsoft.Data.SqlClient.SqlException (0x80131904): A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 35 - An internal exception was caught)
---> System.Net.Internals.SocketExceptionFactory+ExtendedSocketException (00000005, 0xFFFDFFFF): Name or service not known
at System.Net.Dns.GetHostEntryOrAddressesCore(String hostName, Boolean justAddresses)
at System.Net.Dns.GetHostAddresses(String hostNameOrAddress)
at Microsoft.Data.SqlClient.SNI.SNITCPHandle.Connect(String serverName, Int32 port, TimeSpan timeout, Boolean isInfiniteTimeout, String cachedFQDN, SQLDNSInfo& pendingDNSInfo)
之后就是博客后台一直 502,博客站点访问速度慢,频繁出现500错误。
在之后的故障处理过程中,我们进行了数据库服务器的主备切换,切换后博客后台恢复了正常。但高并发压力下的博客站点怎么也无法恢复正常,数据库主备切换后,数据库连接数飙升
之后我们使劲浑身解数,也无法让博客站点完全恢复正常,恢复到一定程度后发现,访问有时飞快有时非常缓慢,这与请求落在哪个 pod 有关,后来我们向 k8s 集群添加了更多服务器,scale 更多 pod ,然后强制一个一个停用运行时间最早的一批 pod ,这才有所缓解,但真正恢复是在过了访问高峰之后。
先发布这篇博文向大家汇报一下故障的大致情况,对于故障的原因,我们需要进一步排查与分析,再次请大家谅解这次故障给您带来的麻烦。