Kubernetes 常见错误

Kubernetes 常见错误
Pod 异常
1. OOMKilled: Pod 的内存使用超出了 resources.limits 中的限制，被强制杀死。
2. CrashLoopBackoff: Pod 进入 崩溃-重启循环，重启间隔时间从 10 20 40 80 一直翻倍到上限 300 秒，然后以 300 秒为间隔无限重启。
3. Pod 一直 Pending: 这说明没有任何节点能满足 Pod 的要求，容器无法被调度。比如端口被别的容器用 hostPort 占用，节点有污点等。
4. FailedCreateSandBox: Failed create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded：很可能是 CNI 网络插件的问题（比如 ip 地址溢出），
5. SandboxChanged: Pod sandbox changed, it will be killed and re-created: 很可能是由于内存限制导致容器被 OOMKilled，或者其他资源不足
6. FailedSync: error determining status: rpc error: code = DeadlineExceeded desc = context deadline exceeded: 常和前两个错误先后出现，很可能是 CNI 网络插件的问题。
7. 开发集群，一次性部署所有服务时，各 Pod 互相争抢资源，导致 Pod 生存探针失败，不断重启，重启进一步加重资源使用。恶性循环。
  - 需要给每个 Pod 加上 resources.requests，这样资源不足时，后续 Pod 会停止调度，直到资源恢复正常。
8. Pod 出现大量的 Failed 记录，Deployment 一直重复建立 Pod: 通过 kubectl describe/edit pod <pod-name> 查看 pod Events 和 Status，一般会看到失败信息，如节点异常导致 Pod 被驱逐。
- Kubernetes 问题排查：Pod 状态一直 Terminating
节点异常
1. DiskPressure：节点的可用空间不足。（通过df -h 查看，保证可用空间不小于 15%）
2. The node was low on resource: ephemeral-storage: 同上，节点的存储空间不够了。
网络异常

1. Ingress/Istio Gateway 返回值
1. 404：不存在该 Service/Istio Gateway
2. 503：Service 对应的 Pods NotReady
3. 504：主要有两种可能
  1. 考虑是不是 Ingress Controller 的 IP 表未更新，将请求代理到了不存在的 Pod ip，导致得不到响应。
  2. Pod 响应太慢，代码问题。
Ingress 相关网络问题的排查流程：
1. Which ingress controller?
2. Timeout between client and ingress controller, or between ingress controller and backend service/pod?
3. HTTP/504 generated by the ingress controller, proven by logs from the ingress controller?
4. If you port-forward to skip the internet between client and ingress controller, does the timeout still happen?
kubectl/istioctl 等客户端工具异常
1. socat not found: kubectl 使用 socat 进行端口转发，集群的所有节点，以及本机都必须安装有 socat 工具。
参考
- Kubernetes管理经验
- 504 Gateway Timeout when accessing workload via ingress
相关阅读:
DMALL刘江峰：生鲜市场具有巨大O2O改造空间
 互联网产品经理能力体系
 Review Board——在线代码审查工具
 Java Dns Cache Manipulator
Keepalived 双机web服务宕机检测切换系统软件
 weblogic性能调优参考
 spring的ResultSetWrappingSqlRowSet使用rs.getTimestamp取oracle数据库时分秒问题
 oracle ORA-01747(系统保留关键字)user.table.column, table.column 或列说明无效 hibernate映射oracle保留关键字
 javascript模块化编程(AMD规范的加载器)
oracle定时运行存储过程
原文地址：https://www.cnblogs.com/kirito-c/p/11923824.html

Kubernetes 常见错误

Pod 异常

节点异常

网络异常

1. Ingress/Istio Gateway 返回值

kubectl/istioctl 等客户端工具异常

参考