6. Resource Quotas

Similar to resource limits, Kubernetes also allows admins to set Resource Quotas per namespace. These quotas can set soft & hard limits on resources such as number of Pods, Deployments, PersistentVolumes, CPUs, Memory, and more.

Let's see what happens when we exceed a Resource Quota. Here's our example Deployment again:


# test-quota.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: gateway-quota
spec:
  template:
    spec:
      containers:
        - name: test-container
          image: nginx

We can create it with kubectl create -f test-quota.yaml, then inspect our Pods.


$ kubectl get pods
NAME                            READY     STATUS    RESTARTS   AGE
gateway-quota-551394438-pix5d   1/1       Running   0          16s

Looks good! Now let's scale up to 3 replicas, kubectl scale deploy/gateway-quota --replicas=3, and then inspect our Pods again.


$ kubectl get pods
NAME                            READY     STATUS    RESTARTS   AGE
gateway-quota-551394438-pix5d   1/1       Running   0          9m

Huh? Where are our pods? Let's inspect the Deployment.


$ kubectl describe deploy/gateway-quota
Name:            gateway-quota
Namespace:        fail
CreationTimestamp:    Sat, 11 Feb 2017 16:33:16 -0500
Labels:            app=gateway
Selector:        app=gateway
Replicas:        1 updated | 3 total | 1 available | 2 unavailable
StrategyType:        RollingUpdate
MinReadySeconds:    0
RollingUpdateStrategy:    1 max unavailable, 1 max surge
OldReplicaSets:        
NewReplicaSet:        gateway-quota-551394438 (1/3 replicas created)
Events:
  FirstSeen    LastSeen    Count   From                SubObjectPath   Type        Reason          Message
  ---------    --------    -----   ----                -------------   --------    ------          -------
  9m        9m      1   {deployment-controller }            Normal      ScalingReplicaSet   Scaled up replica set gateway-quota-551394438 to 1
  5m        5m      1   {deployment-controller }            Normal      ScalingReplicaSet   Scaled up replica set gateway-quota-551394438 to 3

In the last line, we can see the ReplicaSet was told to scale to 3. Let's inspect the ReplicaSet using describe to learn more.


kubectl describe replicaset gateway-quota-551394438
Name:        gateway-quota-551394438
Namespace:    fail
Image(s):    nginx
Selector:    app=gateway,pod-template-hash=551394438
Labels:        app=gateway
        pod-template-hash=551394438
Replicas:    1 current / 3 desired
Pods Status:    1 Running / 0 Waiting / 0 Succeeded / 0 Failed
No volumes.
Events:
  FirstSeen    LastSeen    Count   From                SubObjectPath   Type        Reason          Message
  ---------    --------    -----   ----                -------------   --------    ------          -------
  11m        11m     1   {replicaset-controller }            Normal      SuccessfulCreate    Created pod: gateway-quota-551394438-pix5d
  11m        30s     33  {replicaset-controller }            Warning     FailedCreate        Error creating: pods "gateway-quota-551394438-" is forbidden: exceeded quota: compute-resources, requested: pods=1, used: pods=1, limited: pods=1

Aha! Our ReplicaSet wasn't able to create any more pods due to the quota: exceeded quota: compute-resources, requested: pods=1, used: pods=1, limited: pods=1. Similar to Resource Limits, we have three options:

  1. Ask your cluster admin to increase the Quota for this namespace
  2. Delete or scale back other Deployments in this namespace
  3. Go rogue and edit the Quota

7. Insufficient Cluster Resources

Unless your cluster administrator has wired up the cluster-autoscaler, chances are that someday you will run out of CPU or Memory resources in your cluster.

That's not to say that CPU & Memory are fully utilized - just that they have been fully accounted for by the Kubernetes Scheduler. As we saw in #5, Cluster Administrators can limit the amount of CPU or memory a developer can request to be allocated to a Pod or container. Wise administrators will also set a default CPU/Memory request that will be applied if you (the developer) don't request anything.

If you do all your work in the default namespace, you probably have a default Container CPU Request of 100m and you don't even know it! Check yours by running kubectl describe ns default.

Let's say you have a Kubernetes cluster with 1 Node that has 1 CPU. Your Kubernetes cluster has 1000m of available CPU to schedule.

Ignoring other system Pods (kubectl -n kube-system get pods) for the moment, you will be able to deploy 10 Pods (with 1 container each at 100m) to your single-Node cluster.

10 Pods * (1 Container * 100m) = 1000m == Cluster CPUs

So what happens when you turn it up to 11?

Here's an example Deployment that has a CPU request of 1CPU (1000m).


# cpu-scale.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: cpu-scale
spec:
  template:
    metadata:
      labels:
        app: cpu-scale
    spec:
      containers:
        - name: test-container
          image: nginx
          resources:
            requests:
              cpu: 1

I'm deploying this application to a Cluster that has 2 total CPUs available. In addition to my cpu-scale application, the Kubernetes internal services are consuming more CPU/Memory Requests.

So we can deploy this with kubectl create -f cpu-scale.yaml, and then inspect the Pods:


$ kubectl get pods
NAME                        READY     STATUS    RESTARTS   AGE
cpu-scale-908056305-xstti   1/1       Running   0          5m

So the first Pod was scheduled and is running. Let's see what happens when we scale up by one:


$ kubectl scale deploy/cpu-scale --replicas=2
deployment "cpu-scale" scaled
$ kubectl get pods
NAME                        READY     STATUS    RESTARTS   AGE
cpu-scale-908056305-phb4j   0/1       Pending   0          4m
cpu-scale-908056305-xstti   1/1       Running   0          5m

Uh oh. Our second Pod is stuck with a status of Pending. We can describe that second Pod for more information:


$ kubectl describe pod cpu-scale-908056305-phb4j
Name:        cpu-scale-908056305-phb4j
Namespace:    fail
Node:        gke-ctm-1-sysdig2-35e99c16-qwds/10.128.0.4
Start Time:    Sun, 12 Feb 2017 08:57:51 -0500
Labels:        app=cpu-scale
        pod-template-hash=908056305
Status:        Pending
IP:        
Controllers:    ReplicaSet/cpu-scale-908056305
[...]
Events:
  FirstSeen    LastSeen    Count   From            SubObjectPath   Type        Reason          Message
  ---------    --------    -----   ----            -------------   --------    ------          -------
  3m        3m      1   {default-scheduler }            Warning     FailedScheduling    pod (cpu-scale-908056305-phb4j) failed to fit in any node
fit failure on node (gke-ctm-1-sysdig2-35e99c16-wx0s): Insufficient cpu
fit failure on node (gke-ctm-1-sysdig2-35e99c16-tgfm): Insufficient cpu
fit failure on node (gke-ctm-1-sysdig2-35e99c16-qwds): Insufficient cpu

Alright! So the Events block tells us that the Kubernetes scheduler (default-scheduler) was unable to schedule this Pod because it failed to fit on any node. It even tells us which scalability aspect failed (Insufficient cpu) for each Node.

So how do we resolve this? Well, if you've been too eager with the size of your Requested CPU/Memory, you could reduce the request size and re-deploy. Alternatively, you could kindly ask your Cluster admin to scale up the cluster (chances are you're not the only one running into this problem).

Now, you might be thinking to yourself: "Our Kubernetes Nodes are in auto-scaling groups with our Cloud Provider. Why aren't they working?"

The answer is that your cloud provider doesn't have any insight into what the Kubernetes Scheduler is doing. Leveraging the Kubernetes cluster-autoscalerwill allow your Cluster to resize itself based on the Scheduler's requirements. If you're using Google Container Engine, the cluster-autoscaler is a Beta feature.

8. PersistentVolume fails to mount

Another common error is trying to create a Deployment that references PersistentVolumes that don't exist. Whether you're using PersistentVolumeClaims (which you should be!) or just directly accessing a PersistentDisk, the end result is very similar.

Here's our test Deployment that is trying to use a GCE PersistentDisk named my-data-disk.


# volume-test.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: volume-test
spec:
  template:
    metadata:
      labels:
        app: volume-test
    spec:
      containers:
        - name: test-container
          image: nginx
          volumeMounts:
          - mountPath: /test
            name: test-volume
      volumes:
      - name: test-volume
        # This GCE PD must already exist (oops!)
        gcePersistentDisk:
          pdName: my-data-disk
          fsType: ext4

Let's create this Deployment, kubectl create -f volume-test.yaml and check the Pods after a few minutes.


kubectl get pods
NAME                           READY     STATUS              RESTARTS   AGE
volume-test-3922807804-33nux   0/1       ContainerCreating   0          3m

Three minutes is a long time to wait for a Container to create. Let's inspect the Pod with describe and see what's happening under the hood:


$ kubectl describe pod volume-test-3922807804-33nux
Name:        volume-test-3922807804-33nux
Namespace:    fail
Node:        gke-ctm-1-sysdig2-35e99c16-qwds/10.128.0.4
Start Time:    Sun, 12 Feb 2017 09:24:50 -0500
Labels:        app=volume-test
        pod-template-hash=3922807804
Status:        Pending
IP:        
Controllers:    ReplicaSet/volume-test-3922807804
[...]
Volumes:
  test-volume:
    Type:    GCEPersistentDisk (a Persistent Disk resource in Google Compute Engine)
    PDName:    my-data-disk
    FSType:    ext4
    Partition:    0
    ReadOnly:    false
[...]
Events:
  FirstSeen    LastSeen    Count   From                        SubObjectPath   Type        Reason      Message
  ---------    --------    -----   ----                        -------------   --------    ------      -------
  4m        4m      1   {default-scheduler }                        Normal      Scheduled   Successfully assigned volume-test-3922807804-33nux to gke-ctm-1-sysdig2-35e99c16-qwds
  1m        1m      1   {kubelet gke-ctm-1-sysdig2-35e99c16-qwds}           Warning     FailedMount Unable to mount volumes for pod "volume-test-3922807804-33nux_fail(e2180d94-f12e-11e6-bd01-42010af0012c)": timeout expired waiting for volumes to attach/mount for pod "volume-test-3922807804-33nux"/"fail". list of unattached/unmounted volumes=[test-volume]
  1m        1m      1   {kubelet gke-ctm-1-sysdig2-35e99c16-qwds}           Warning     FailedSync  Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "volume-test-3922807804-33nux"/"fail". list of unattached/unmounted volumes=[test-volume]
  3m        50s     3   {controller-manager }                       Warning     FailedMount Failed to attach volume "test-volume" on node "gke-ctm-1-sysdig2-35e99c16-qwds" with: GCE persistent disk not found: diskName="my-data-disk" zone="us-central1-a"

Surprise! The Events section holds the hidden clues we were looking for. Our Pod was correctly scheduled to a Node (Successfully assigned volume-test-3922807804-33nux to gke-ctm-1-sysdig2-35e99c16-qwds), but then the Kubelet on that Node is unable to mount the expected volume, test-volume. That Volume would have been created when the PersistentDisk was attached to the Node, but, as we see further down, the controller-manager failed Failed to attach volume "test-volume" on node "gke-ctm-1-sysdig2-35e99c16-qwds" with: GCE persistent disk not found: diskName="my-data-disk" zone="us-central1-a".

This last message is pretty clear: To resolve the issue, we need to create a PersistentVolume in GKE with the name, my-data-disk in the zone, us-central1-a. Once that disk is created, the controller-manager will mount the disk and kickstart the Container creation process.

9. Validation Errors

Few things are more frustrating than watching an entire build-test-deploy job get all the way to the deploy step, only to fail due to invalid Kubernetes Spec objects.

You may have gotten error like this before:


$ kubectl create -f test-application.deploy.yaml
error: error validating "test-application.deploy.yaml": error validating data: found invalid field resources for v1.PodSpec; if you choose to ignore these errors, turn validation off with --validate=false

In this example, I tried creating the following Kubernetes Deployment:


# test-application.deploy.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: test-app
spec:
  template:
    metadata:
      labels:
        app: test-app
    spec:
      containers:
      - image: nginx
        name: nginx
      resources:
        limits:
          cpu: 100m
          memory: 200Mi
        requests:
          cpu: 100m
          memory: 100Mi

At first glance, this YAML looks fine - but the error message proves to be helpful. The error says that it found invalid field resources for v1.PodSpec. Upon deeper inspection of the v1.PodSpec, we can see that the resources object is (incorrectly) a child of v1.PodSpec. It should be a child of v1.Container. After indenting the resources object one level, this Deployment works just fine!

In addition to looking out for indentation mistakes, another common error is a typo in an Object name (e.g. peristentVolumeClaim vs persistentVolumeClaim). That one briefly tripped up a senior engineer and I when we were in a hurry.

To help catch these errors early, I recommend adding some verification steps to your pre-commit hooks or test-phase of your build.

For example, you can:

  1. Validate your YAML with python -c 'import yaml,sys;yaml.safe_load(sys.stdin)' < test-application.deployment.yaml
  2. Validate your Kubernetes API objects using the --dry-run flag like this: kubectl create -f test-application.deploy.yaml --dry-run --validate=true

Important Note: This mechanism for validating Kubernetes Objects leverages server-side validation. This means that kubectl must have a working Kubernetes cluster to communicate with. Unfortunately, there currently is not a client-side validation option for kubectl, but there are open issues (kubernetes/kubernetes #29410 and kubernetes/kubernetes #11488 that are tracking that missing feature.

10. Container Image Not Updating

Most people I've talked to who have worked with Kubernetes have run into this problem, and it's a real kicker.

The story goes something like this:

  1. Create a Deployment using an image tag (e.g. rosskulinski/myapplication:v1)
  2. Notice that there's a bug in myapplication
  3. Build a new image and push to the same tag (rosskukulinski/myapplication:v1)
  4. Delete any myapplication Pods, watch new ones get created by the Deployment
  5. Realize that the bug is still present
  6. Repeat 3-5 until you pull your hair out

The problem relates to how Kubernetes decides whether to do a docker pullwhen starting a container in a Pod.

In the v1.Container specification there's an option called ImagePullPolicy:

Image pull policy. One of Always, Never, IfNotPresent. Defaults to Always if :latest tag is specified, or IfNotPresent otherwise.

Since we tagged our image as :v1, the default pull policy is IfNotPresent. The Kubelet already has a local copy of rosskukulinski/myapplication:v1, so it doesn't attempt to do a docker pull. When the new Pods come up, they're still using the old broken Docker image.

There are three ways to resolve this:

  1. Switch to using :latest (DO NOT DO THIS!)
  2. Specify ImagePullPolicy: Always in your Deployment.
  3. Use unique tags (e.g. based on your source control commit id)

During development or if I'm quickly prototyping something, I will specify ImagePullPolicy: Always so that I can build & push container images with the same tag.

However, in all of my production deployments I use unique tags based on the Git SHA-1 of the commit used to build that image. This makes it trivial to identify and check-out the source code that's running in production for any deployed application.


Summary

Phew! That's a lot of things to watch out for. By now, you should now be a pro at debugging, identifying, and fixing failed Kubernetes Deployments.

In general, most of the common deployment failures can be debugged using these commands:

  • kubectl describe deployment/<deployname>
  • kubectl describe replicaset/<rsname>
  • kubectl get pods
  • kubectl describe pod/<podname>
  • kubectl logs <podname> --previous

In the quest to automate myself out of a job, I created a bash script that runs anytime a CI/CD deployment fails. Helpful Kubernetes information will show up in the Jenkins/CircleCI/etc build output so that developers can quickly find any obvious problems.


I hope you have enjoyed these two posts!

How have you seen Kubernetes Deployments fail? Any other troubleshooting tips to share? Leave a comment!