在过去的两年间,我和许多团队合作,一起用 Kubernetes 来部署他们的应用。要想让开发者跟上 Kubernetes 术语的发展速度是很困难的,因此当部署失败的时候,我总是被要求指出哪个地方错了。
和客户一起工作,我的一个主要目标是自动化,尽可能把自己从繁琐的定位工作中解放出来。因此我努力给开发者必要的工具,使得他们自己就能定位部署失败的原因。我总结了 Kubernetes 部署失败的最普遍的原因,下面将和你分享我的定位过程。
言归正传,以下就是 Kubernetes 部署失败的十大原因。
1. 错误的容器镜像/非法的仓库权限
其中两个最普遍的问题是:(a)指定了错误的容器镜像,(b)使用私有镜像却不提供仓库认证信息。这在首次使用 Kubernetes 或者绑定 CI/CD 环境时尤其棘手。让我们看个例子。首先我们创建一个名为 fail 的 deployment,它指向一个不存在的 Docker 镜像:
然后我们查看 Pods,可以看到有一个状态为 ErrImagePull 或者 ImagePullBackOff 的 Pod:
想查看更多信息,可以 describe 这个失败的 Pod:
查看 describe 命令的输出中 Events 这部分,我们可以看到如下内容:
显示错误的那句话:Failed to pull image "rosskukulinski/dne:v1.0.0": Error: image rosskukulinski/dne not found 告诉我们 Kubernetes无法找到镜像 rosskukulinski/dne:v1.0.0。
因此问题变成:为什么 Kubernetes 拉不下来镜像?
除了网络连接问题外,还有三个主要元凶:
- 镜像 tag 不正确
- 镜像不存在(或者是在另一个仓库)
- Kubernetes 没有权限去拉那个镜像
如果你没有注意到你的镜像 tag 的拼写错误,那么最好就用你本地机器测试一下。
通常我会在本地开发机上,用 docker pull 命令,带上 完全相同的镜像 tag,来跑一下。比如上面的情况,我会运行命令 docker pull rosskukulinski/dne:v1.0.0。
- 如果这成功了,那么很可能 Kubernetes 没有权限去拉取这个镜像。参考镜像拉取 Secrets 来解决这个问题。
- 如果失败了,那么我会继续用不显式带 tag 的镜像测试 - docker pull rosskukulinski/dne - 这会尝试拉取 tag 为 latest 的镜像。如果这样成功,表明原来指定的 tag 不存在。这可能是人为原因,拼写错误,或者 CI/CD 的配置错误。
如果 docker pull rosskukulinski/dne(不指定 tag)也失败了,那么我们碰到了一个更大的问题:我们所有的镜像仓库中都没有这个镜像。默认情况下,Kubernetes 使用 Dockerhub 镜像仓库,如果你在使用 Quay.io,AWS ECR,或者 Google Container Registry,你要在镜像地址中指定这个仓库的 URL,比如使用 Quay,镜像地址就变成 quay.io/rosskukulinski/dne:v1.0.0。
如果你在使用 Dockerhub,那你应该再次确认你发布镜像到 Dockerhub 的系统,确保名字和 tag 匹配你的 deployment 正在使用的镜像。
注意:观察 Pod 状态的时候,镜像缺失和仓库权限不正确是没法区分的。其它情况下,Kubernetes 将报告一个 ErrImagePull 状态。
2. 应用启动之后又挂掉
无论你是在 Kubernetes 上启动新应用,还是迁移应用到已存在的平台,应用在启动之后就挂掉都是一个比较常见的现象。我们创建一个 deployment,它的应用会在1秒后挂掉:
我们看一下 Pods 的状态:
CrashLoopBackOff 告诉我们,Kubernetes 正在尽力启动这个 Pod,但是一个或多个容器已经挂了,或者正被删除。
让我们 describe 这个 Pod 去获取更多信息:
好可怕,Kubernetes 告诉我们这个 Pod 正被 Terminated,因为容器里的应用挂了。我们还可以看到应用的 Exit Code 是 1。后面我们可能还会看到一个 OOMKilled 错误。
我们的应用正在挂掉?为什么?
首先我们查看应用日志。假定你发送应用日志到 stdout(事实上你也应该这么做),你可以使用 kubectl logs看到应用日志:
不幸的是,这个 Pod 没有任何日志。这可能是因为我们正在查看一个新起的应用实例,因此我们应该查看前一个容器:
什么!我们的应用仍然不给我们任何东西。这个时候我们应该给应用加点启动日志了,以帮助我们定位这个问题。我们也可以本地运行一下这个容器,以确定是否缺失环境变量或者挂载卷。
3. 缺失 ConfigMap 或者 Secret
Kubernetes 最佳实践建议通过 ConfigMaps 或者 Secrets 传递应用的运行时配置。这些数据可以包含数据库认证信息,API endpoints,或者其它配置信息。一个常见的错误是,创建的 deployment 中引用的 ConfigMaps 或者 Secrets 的属性不存在,有时候甚至引用的 ConfigMaps 或者 Secrets 本身就不存在。
缺失 ConfigMap
第一个例子,我们将尝试创建一个 Pod,它加载 ConfigMap 数据作为环境变量:让我们创建一个 Pod:kubectl create -f configmap-pod.yaml。在等待几分钟之后,我们可以查看我们的 Pod:
Pod 状态是 RunContainerError 。我们可以使用 kubectl describe 了解更多:
Events 章节的最后一条告诉我们什么地方错了。Pod 尝试访问名为 special-config 的 ConfigMap,但是在该 namespace 下找不到。一旦我们创建这个 ConfigMap,Pod 应该重启并能成功拉取运行时数据。
在 Pod 规格说明中访问 Secrets 作为环境变量会产生相似的错误,就像我们在这里看到的 ConfigMap错误一样。
但是假如你通过 Volume 来访问 Secrets 或者 ConfigMap会发生什么呢?
缺失 Secrets
下面是一个pod规格说明,它引用了名为 myothersecret 的 Secrets,并尝试把它挂为卷:让我们用 kubectl create -f missing-secret.yaml 来创建一个 Pod。
几分钟后,我们 get Pods,可以看到 Pod 仍处于 ContainerCreating 状态:
这就奇怪了。我们 describe 一下,看看到底发生了什么:
Events 章节再次解释了问题的原因。它告诉我们 Kubelet 无法从名为 myothersecret 的 Secret 挂卷。为了解决这个问题,我们可以创建 myothersecret ,它包含必要的安全认证信息。一旦 myothersecret 创建完成,容器也将正确启动。
4. 活跃度/就绪状态探测失败
在 Kubernetes 中处理容器问题时,开发者需要学习的重要一课是,你的容器应用是 running 状态,不代表它在工作。Kubernetes 提供了两个基本特性,称作活跃度探测和就绪状态探测。本质上来说,活跃度/就绪状态探测将定期地执行一个操作(例如发送一个 HTTP 请求,打开一个 tcp 连接,或者在你的容器内运行一个命令),以确认你的应用和你预想的一样在工作。
如果活跃度探测失败,Kubernetes 将杀掉你的容器并重新创建一个。如果就绪状态探测失败,这个 Pod 将不会作为一个服务的后端 endpoint,也就是说不会流量导到这个 Pod,直到它变成 Ready。
如果你试图部署变更你的活跃度/就绪状态探测失败的应用,滚动部署将一直悬挂,因为它将等待你的所有 Pod 都变成 Ready。
这个实际是怎样的情况?以下是一个 Pod 规格说明,它定义了活跃度/就绪状态探测方法,都是基于8080端口对 /healthy 路由进行健康检查:
让我们创建这个 Pod:kubectl create -f liveness.yaml,过几分钟后查看发生了什么:
2分钟以后,我们发现 Pod 仍然没处于 Ready 状态,并且它已被重启了4次。让我们 describe 一下查看更多信息:
Events 章节再次救了我们。我们可以看到活跃度探测和就绪状态探测都失败了。关键的一句话是 container "test-container" is unhealthy, it will be killed and re-created。这告诉我们 Kubernetes 正在杀这个容器,因为容器的活跃度探测失败了。
这里有三种可能性:
- 你的探测不正确,健康检查的 URL 是否改变了?
- 你的探测太敏感了, 你的应用是否要过一会才能启动或者响应?
- 你的应用永远不会对探测做出正确响应,你的数据库是否配置错了
查看 Pod 日志是一个开始调测的好地方。一旦你解决了这个问题,新的 deployment 应该就能成功了。
5. 超出CPU/内存的限制
Kubernetes 赋予集群管理员限制 Pod 和容器的 CPU 或内存数量的能力。作为应用开发者,你可能不清楚这个限制,导致 deployment 失败的时候一脸困惑。我们试图部署一个未知 CPU/memory 请求限额的 deployment:
你会看到我们设了 5Gi 的资源请求。让我们创建这个 deployment:kubectl create -f gateway.yaml。
现在我们可以看到我们的 Pod:
为啥,让我们用 describe 来观察一下我们的 deployment:
基于最后一行,我们的 deployment 创建了一个 ReplicaSet(gateway-764140025) 并把它扩展到 1。这个是用来管理 Pod 生命周期的实体。我们可以 describe 这个 ReplicaSet:
哈知道了。集群管理员设置了每个 Pod 的最大内存使用量为 100Mi(好一个小气鬼!)。你可以运行 kubectl describe limitrange 来查看当前租户的限制。
你现在有3个选择:
- 要求你的集群管理员提升限额
- 减少 deployment 的请求或者限额设置
- 直接编辑限额
查看 Part 2
6. Resource Quotas
Similar to resource limits, Kubernetes also allows admins to set Resource Quotas per namespace. These quotas can set soft & hard limits on resources such as number of Pods, Deployments, PersistentVolumes, CPUs, Memory, and more.
Let's see what happens when we exceed a Resource Quota. Here's our example Deployment again:
# test-quota.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: gateway-quota
spec:
template:
spec:
containers:
- name: test-container
image: nginx
We can create it with kubectl create -f test-quota.yaml
, then inspect our Pods.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
gateway-quota-551394438-pix5d 1/1 Running 0 16s
Looks good! Now let's scale up to 3 replicas, kubectl scale deploy/gateway-quota --replicas=3
, and then inspect our Pods again.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
gateway-quota-551394438-pix5d 1/1 Running 0 9m
Huh? Where are our pods? Let's inspect the Deployment.
$ kubectl describe deploy/gateway-quota
Name: gateway-quota
Namespace: fail
CreationTimestamp: Sat, 11 Feb 2017 16:33:16 -0500
Labels: app=gateway
Selector: app=gateway
Replicas: 1 updated | 3 total | 1 available | 2 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 1 max unavailable, 1 max surge
OldReplicaSets:
NewReplicaSet: gateway-quota-551394438 (1/3 replicas created)
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
9m 9m 1 {deployment-controller } Normal ScalingReplicaSet Scaled up replica set gateway-quota-551394438 to 1
5m 5m 1 {deployment-controller } Normal ScalingReplicaSet Scaled up replica set gateway-quota-551394438 to 3
In the last line, we can see the ReplicaSet was told to scale to 3. Let's inspect the ReplicaSet using describe
to learn more.
kubectl describe replicaset gateway-quota-551394438
Name: gateway-quota-551394438
Namespace: fail
Image(s): nginx
Selector: app=gateway,pod-template-hash=551394438
Labels: app=gateway
pod-template-hash=551394438
Replicas: 1 current / 3 desired
Pods Status: 1 Running / 0 Waiting / 0 Succeeded / 0 Failed
No volumes.
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
11m 11m 1 {replicaset-controller } Normal SuccessfulCreate Created pod: gateway-quota-551394438-pix5d
11m 30s 33 {replicaset-controller } Warning FailedCreate Error creating: pods "gateway-quota-551394438-" is forbidden: exceeded quota: compute-resources, requested: pods=1, used: pods=1, limited: pods=1
Aha! Our ReplicaSet wasn't able to create any more pods due to the quota: exceeded quota: compute-resources, requested: pods=1, used: pods=1, limited: pods=1
. Similar to Resource Limits, we have three options:
- Ask your cluster admin to increase the Quota for this namespace
- Delete or scale back other Deployments in this namespace
- Go rogue and edit the Quota
7. Insufficient Cluster Resources
Unless your cluster administrator has wired up the cluster-autoscaler, chances are that someday you will run out of CPU or Memory resources in your cluster.
That's not to say that CPU & Memory are fully utilized - just that they have been fully accounted for by the Kubernetes Scheduler. As we saw in #5, Cluster Administrators can limit the amount of CPU or memory a developer can request to be allocated to a Pod or container. Wise administrators will also set a default CPU/Memory request that will be applied if you (the developer) don't request anything.
If you do all your work in the default
namespace, you probably have a default Container CPU Request of 100m
and you don't even know it! Check yours by running kubectl describe ns default
.
Let's say you have a Kubernetes cluster with 1 Node that has 1 CPU. Your Kubernetes cluster has 1000m
of available CPU to schedule.
Ignoring other system Pods (kubectl -n kube-system get pods
) for the moment, you will be able to deploy 10 Pods (with 1 container each at 100m
) to your single-Node cluster.
10 Pods * (1 Container * 100m) = 1000m == Cluster CPUs
So what happens when you turn it up to 11?
Here's an example Deployment that has a CPU request of 1CPU (1000m).
# cpu-scale.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: cpu-scale
spec:
template:
metadata:
labels:
app: cpu-scale
spec:
containers:
- name: test-container
image: nginx
resources:
requests:
cpu: 1
I'm deploying this application to a Cluster that has 2 total CPUs available. In addition to my cpu-scale
application, the Kubernetes internal services are consuming more CPU/Memory Requests.
So we can deploy this with kubectl create -f cpu-scale.yaml
, and then inspect the Pods:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
cpu-scale-908056305-xstti 1/1 Running 0 5m
So the first Pod was scheduled and is running. Let's see what happens when we scale up by one:
$ kubectl scale deploy/cpu-scale --replicas=2
deployment "cpu-scale" scaled
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
cpu-scale-908056305-phb4j 0/1 Pending 0 4m
cpu-scale-908056305-xstti 1/1 Running 0 5m
Uh oh. Our second Pod is stuck with a status of Pending
. We can describe
that second Pod for more information:
$ kubectl describe pod cpu-scale-908056305-phb4j
Name: cpu-scale-908056305-phb4j
Namespace: fail
Node: gke-ctm-1-sysdig2-35e99c16-qwds/10.128.0.4
Start Time: Sun, 12 Feb 2017 08:57:51 -0500
Labels: app=cpu-scale
pod-template-hash=908056305
Status: Pending
IP:
Controllers: ReplicaSet/cpu-scale-908056305
[...]
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
3m 3m 1 {default-scheduler } Warning FailedScheduling pod (cpu-scale-908056305-phb4j) failed to fit in any node
fit failure on node (gke-ctm-1-sysdig2-35e99c16-wx0s): Insufficient cpu
fit failure on node (gke-ctm-1-sysdig2-35e99c16-tgfm): Insufficient cpu
fit failure on node (gke-ctm-1-sysdig2-35e99c16-qwds): Insufficient cpu
Alright! So the Events block tells us that the Kubernetes scheduler (default-scheduler
) was unable to schedule this Pod because it failed to fit on any node. It even tells us which scalability aspect failed (Insufficient cpu
) for each Node.
So how do we resolve this? Well, if you've been too eager with the size of your Requested CPU/Memory, you could reduce the request size and re-deploy. Alternatively, you could kindly ask your Cluster admin to scale up the cluster (chances are you're not the only one running into this problem).
Now, you might be thinking to yourself: "Our Kubernetes Nodes are in auto-scaling groups with our Cloud Provider. Why aren't they working?"
The answer is that your cloud provider doesn't have any insight into what the Kubernetes Scheduler is doing. Leveraging the Kubernetes cluster-autoscalerwill allow your Cluster to resize itself based on the Scheduler's requirements. If you're using Google Container Engine, the cluster-autoscaler is a Beta feature.
8. PersistentVolume fails to mount
Another common error is trying to create a Deployment that references PersistentVolumes that don't exist. Whether you're using PersistentVolumeClaims (which you should be!) or just directly accessing a PersistentDisk, the end result is very similar.
Here's our test Deployment that is trying to use a GCE PersistentDisk named my-data-disk
.
# volume-test.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: volume-test
spec:
template:
metadata:
labels:
app: volume-test
spec:
containers:
- name: test-container
image: nginx
volumeMounts:
- mountPath: /test
name: test-volume
volumes:
- name: test-volume
# This GCE PD must already exist (oops!)
gcePersistentDisk:
pdName: my-data-disk
fsType: ext4
Let's create this Deployment, kubectl create -f volume-test.yaml
and check the Pods after a few minutes.
kubectl get pods
NAME READY STATUS RESTARTS AGE
volume-test-3922807804-33nux 0/1 ContainerCreating 0 3m
Three minutes is a long time to wait for a Container to create. Let's inspect the Pod with describe
and see what's happening under the hood:
$ kubectl describe pod volume-test-3922807804-33nux
Name: volume-test-3922807804-33nux
Namespace: fail
Node: gke-ctm-1-sysdig2-35e99c16-qwds/10.128.0.4
Start Time: Sun, 12 Feb 2017 09:24:50 -0500
Labels: app=volume-test
pod-template-hash=3922807804
Status: Pending
IP:
Controllers: ReplicaSet/volume-test-3922807804
[...]
Volumes:
test-volume:
Type: GCEPersistentDisk (a Persistent Disk resource in Google Compute Engine)
PDName: my-data-disk
FSType: ext4
Partition: 0
ReadOnly: false
[...]
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
4m 4m 1 {default-scheduler } Normal Scheduled Successfully assigned volume-test-3922807804-33nux to gke-ctm-1-sysdig2-35e99c16-qwds
1m 1m 1 {kubelet gke-ctm-1-sysdig2-35e99c16-qwds} Warning FailedMount Unable to mount volumes for pod "volume-test-3922807804-33nux_fail(e2180d94-f12e-11e6-bd01-42010af0012c)": timeout expired waiting for volumes to attach/mount for pod "volume-test-3922807804-33nux"/"fail". list of unattached/unmounted volumes=[test-volume]
1m 1m 1 {kubelet gke-ctm-1-sysdig2-35e99c16-qwds} Warning FailedSync Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "volume-test-3922807804-33nux"/"fail". list of unattached/unmounted volumes=[test-volume]
3m 50s 3 {controller-manager } Warning FailedMount Failed to attach volume "test-volume" on node "gke-ctm-1-sysdig2-35e99c16-qwds" with: GCE persistent disk not found: diskName="my-data-disk" zone="us-central1-a"
Surprise! The Events
section holds the hidden clues we were looking for. Our Pod was correctly scheduled to a Node (Successfully assigned volume-test-3922807804-33nux to gke-ctm-1-sysdig2-35e99c16-qwds
), but then the Kubelet on that Node is unable to mount the expected volume, test-volume
. That Volume would have been created when the PersistentDisk was attached to the Node, but, as we see further down, the controller-manager failed Failed to attach volume "test-volume" on node "gke-ctm-1-sysdig2-35e99c16-qwds" with: GCE persistent disk not found: diskName="my-data-disk" zone="us-central1-a"
.
This last message is pretty clear: To resolve the issue, we need to create a PersistentVolume in GKE with the name, my-data-disk
in the zone, us-central1-a
. Once that disk is created, the controller-manager will mount the disk and kickstart the Container creation process.
9. Validation Errors
Few things are more frustrating than watching an entire build-test-deploy job get all the way to the deploy step, only to fail due to invalid Kubernetes Spec objects.
You may have gotten error like this before:
$ kubectl create -f test-application.deploy.yaml
error: error validating "test-application.deploy.yaml": error validating data: found invalid field resources for v1.PodSpec; if you choose to ignore these errors, turn validation off with --validate=false
In this example, I tried creating the following Kubernetes Deployment:
# test-application.deploy.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: test-app
spec:
template:
metadata:
labels:
app: test-app
spec:
containers:
- image: nginx
name: nginx
resources:
limits:
cpu: 100m
memory: 200Mi
requests:
cpu: 100m
memory: 100Mi
At first glance, this YAML looks fine - but the error message proves to be helpful. The error says that it found invalid field resources for v1.PodSpec
. Upon deeper inspection of the v1.PodSpec, we can see that the resources
object is (incorrectly) a child of v1.PodSpec
. It should be a child of v1.Container. After indenting the resources
object one level, this Deployment works just fine!
In addition to looking out for indentation mistakes, another common error is a typo in an Object name (e.g. peristentVolumeClaim
vs persistentVolumeClaim
). That one briefly tripped up a senior engineer and I when we were in a hurry.
To help catch these errors early, I recommend adding some verification steps to your pre-commit hooks or test-phase of your build.
For example, you can:
- Validate your YAML with
python -c 'import yaml,sys;yaml.safe_load(sys.stdin)' < test-application.deployment.yaml
- Validate your Kubernetes API objects using the
--dry-run
flag like this:kubectl create -f test-application.deploy.yaml --dry-run --validate=true
Important Note: This mechanism for validating Kubernetes Objects leverages server-side validation. This means that kubectl
must have a working Kubernetes cluster to communicate with. Unfortunately, there currently is not a client-side validation option for kubectl, but there are open issues (kubernetes/kubernetes #29410 and kubernetes/kubernetes #11488 that are tracking that missing feature.
10. Container Image Not Updating
Most people I've talked to who have worked with Kubernetes have run into this problem, and it's a real kicker.
The story goes something like this:
- Create a Deployment using an image tag (e.g.
rosskulinski/myapplication:v1
) - Notice that there's a bug in
myapplication
- Build a new image and push to the same tag (
rosskukulinski/myapplication:v1
) - Delete any
myapplication
Pods, watch new ones get created by the Deployment - Realize that the bug is still present
- Repeat 3-5 until you pull your hair out
The problem relates to how Kubernetes decides whether to do a docker pull
when starting a container in a Pod.
In the v1.Container specification there's an option called ImagePullPolicy
:
Image pull policy. One of Always, Never, IfNotPresent. Defaults to Always if :latest tag is specified, or IfNotPresent otherwise.
Since we tagged our image as :v1
, the default pull policy is IfNotPresent. The Kubelet already has a local copy of rosskukulinski/myapplication:v1
, so it doesn't attempt to do a docker pull
. When the new Pods come up, they're still using the old broken Docker image.
There are three ways to resolve this:
- Switch to using
:latest
(DO NOT DO THIS!) - Specify
ImagePullPolicy: Always
in your Deployment. - Use unique tags (e.g. based on your source control commit id)
During development or if I'm quickly prototyping something, I will specify ImagePullPolicy: Always
so that I can build & push container images with the same tag.
However, in all of my production deployments I use unique tags based on the Git SHA-1 of the commit used to build that image. This makes it trivial to identify and check-out the source code that's running in production for any deployed application.
Summary
Phew! That's a lot of things to watch out for. By now, you should now be a pro at debugging, identifying, and fixing failed Kubernetes Deployments.
In general, most of the common deployment failures can be debugged using these commands:
kubectl describe deployment/<deployname>
kubectl describe replicaset/<rsname>
kubectl get pods
kubectl describe pod/<podname>
kubectl logs <podname> --previous
In the quest to automate myself out of a job, I created a bash script that runs anytime a CI/CD deployment fails. Helpful Kubernetes information will show up in the Jenkins/CircleCI/etc build output so that developers can quickly find any obvious problems.
I hope you have enjoyed these two posts!
How have you seen Kubernetes Deployments fail? Any other troubleshooting tips to share? Leave a comment!