Init Containers
This feature has exited beta in 1.6. Init Containers can be specified in the PodSpec alongside the app containers
array. The beta annotation value will still be respected and overrides the PodSpec field value, however, they are deprecated in 1.6 and 1.7. In 1.8, the annotations are no longer supported and must be converted to the PodSpec field.
This page provides an overview of Init Containers, which are specialized Containers that run before app Containers and can contain utilities or setup scripts not present in an app image.
- Understanding Init Containers
- What can Init Containers be used for?
- Detailed behavior
- Support and compatibility
Understanding Init Containers
A Pod can have multiple Containers running apps within it, but it can also have one or more Init Containers, which are run before the app Containers are started.
Init Containers are exactly like regular Containers, except:
- They always run to completion.
- Each one must complete successfully before the next one is started.
If an Init Container fails for a Pod, Kubernetes restarts the Pod repeatedly until the Init Container succeeds. However, if the Pod has a restartPolicy
of Never, it is not restarted.
To specify a Container as an Init Container, add the initContainers
field on the PodSpec as a JSON array of objects of type v1.Container alongside the app containers
array. The status of the init containers is returned in status.initContainerStatuses
field as an array of the container statuses (similar to the status.containerStatuses
field).
Differences from regular Containers
Init Containers support all the fields and features of app Containers, including resource limits, volumes, and security settings.
However, the resource requests and limits for an Init Container are handled slightly differently, which are documented in Resourcesbelow.
Also, Init Containers do not support readiness probes because they must run to completion before the Pod can be ready.
If multiple Init Containers are specified for a Pod, those Containers are run one at a time in sequential order.
Each must succeed before the next can run.
When all of the Init Containers have run to completion, Kubernetes initializes the Pod and runs the application Containers as usual.
What can Init Containers be used for?
Because Init Containers have separate images from app Containers, they have some advantages for start-up related code:
- They can contain and run utilities that are not desirable to include in the app Container image for security reasons.
- They can contain utilities or custom code for setup that is not present in an app image. For example, there is no need to make an image
FROM
another image just to use a tool likesed
,awk
,python
, ordig
during setup. - The application image builder and deployer roles can work independently without the need to jointly build a single app image.
- They use Linux namespaces so that they have different filesystem views from app Containers. Consequently, they can be given access to Secrets that app Containers are not able to access.
- They run to completion before any app Containers start, whereas app Containers run in parallel, so Init Containers provide an easy way to block or delay the startup of app Containers until some set of preconditions are met.
Examples
Here are some ideas for how to use Init Containers:
-
Wait for a service to be created with a shell command like:
for i in {1..100}; do sleep 1; if dig myservice; then exit 0; fi; done; exit 1
-
Register this Pod with a remote server from the downward API with a command like:
curl -X POST http://$MANAGEMENT_SERVICE_HOST:$MANAGEMENT_SERVICE_PORT/register -d 'instance=$(<POD_NAME>)&ip=$(<POD_IP>)'
- Wait for some time before starting the app Container with a command like
sleep 60
. - Clone a git repository into a volume.
- Place values into a configuration file and run a template tool to dynamically generate a configuration file for the main app Container. For example, place the POD_IP value in a configuration and generate the main app configuration file using Jinja.
More detailed usage examples can be found in the StatefulSets documentation and the Production Pods guide.
Init Containers in use
The following yaml file for Kubernetes 1.5 outlines a simple Pod which has two Init Containers. The first waits for myservice
and the second waits for mydb
. Once both containers complete, the Pod will begin.
apiVersion: v1 kind: Pod metadata: name: myapp-pod labels: app: myapp annotations: pod.beta.kubernetes.io/init-containers: '[ { "name": "init-myservice", "image": "busybox", "command": ["sh", "-c", "until nslookup myservice; do echo waiting for myservice; sleep 2; done;"] }, { "name": "init-mydb", "image": "busybox", "command": ["sh", "-c", "until nslookup mydb; do echo waiting for mydb; sleep 2; done;"] } ]' spec: containers: - name: myapp-container image: busybox command: ['sh', '-c', 'echo The app is running! && sleep 3600']
There is a new syntax in Kubernetes 1.6, although the old annotation syntax still works for 1.6 and 1.7. The new syntax must be used for 1.8 or greater. We have moved the declaration of init containers to spec
:
apiVersion: v1 kind: Pod metadata: name: myapp-pod labels: app: myapp spec: containers: - name: myapp-container image: busybox command: ['sh', '-c', 'echo The app is running! && sleep 3600'] initContainers: - name: init-myservice image: busybox command: ['sh', '-c', 'until nslookup myservice; do echo waiting for myservice; sleep 2; done;'] - name: init-mydb image: busybox command: ['sh', '-c', 'until nslookup mydb; do echo waiting for mydb; sleep 2; done;']
1.5 syntax still works on 1.6, but we recommend using 1.6 syntax. In Kubernetes 1.6, Init Containers were made a field in the API. The beta annotation is still respected in 1.6 and 1.7, but is not supported in 1.8 or greater.
Yaml file below outlines the mydb
and myservice
services:
kind: Service apiVersion: v1 metadata: name: myservice spec: ports: - protocol: TCP port: 80 targetPort: 9376 --- kind: Service apiVersion: v1 metadata: name: mydb spec: ports: - protocol: TCP port: 80 targetPort: 9377
This Pod can be started and debugged with the following commands:
$ kubectl create -f myapp.yaml pod "myapp-pod" created $ kubectl get -f myapp.yaml NAME READY STATUS RESTARTS AGE myapp-pod 0/1 Init:0/2 0 6m $ kubectl describe -f myapp.yaml Name: myapp-pod Namespace: default [...] Labels: app=myapp Status: Pending [...] Init Containers: init-myservice: [...] State: Running [...] init-mydb: [...] State: Waiting Reason: PodInitializing Ready: False [...] Containers: myapp-container: [...] State: Waiting Reason: PodInitializing Ready: False [...] Events: FirstSeen LastSeen Count From SubObjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 16s 16s 1 {default-scheduler } Normal Scheduled Successfully assigned myapp-pod to 172.17.4.201 16s 16s 1 {kubelet 172.17.4.201} spec.initContainers{init-myservice} Normal Pulling pulling image "busybox" 13s 13s 1 {kubelet 172.17.4.201} spec.initContainers{init-myservice} Normal Pulled Successfully pulled image "busybox" 13s 13s 1 {kubelet 172.17.4.201} spec.initContainers{init-myservice} Normal Created Created container with docker id 5ced34a04634; Security:[seccomp=unconfined] 13s 13s 1 {kubelet 172.17.4.201} spec.initContainers{init-myservice} Normal Started Started container with docker id 5ced34a04634 $ kubectl logs myapp-pod -c init-myservice # Inspect the first init container $ kubectl logs myapp-pod -c init-mydb # Inspect the second init container
Once we start the mydb
and myservice
services, we can see the Init Containers complete and the myapp-pod
is created:
$ kubectl create -f services.yaml service "myservice" created service "mydb" created $ kubectl get -f myapp.yaml NAME READY STATUS RESTARTS AGE myapp-pod 1/1 Running 0 9m
This example is very simple but should provide some inspiration for you to create your own Init Containers.
Detailed behavior
During the startup of a Pod, the Init Containers are started in order, after the network and volumes are initialized.
Each Container must exit successfully before the next is started.
If a Container fails to start due to the runtime or exits with failure, it is retried according to the Pod restartPolicy
.
However, if the Pod restartPolicy
is set to Always, the Init Containers use RestartPolicy
OnFailure.
A Pod cannot be Ready
until all Init Containers have succeeded.
The ports on an Init Container are not aggregated under a service.
A Pod that is initializing is in the Pending
state but should have a condition Initializing
set to true.
If the Pod is restarted, all Init Containers must execute again.
Changes to the Init Container spec are limited to the container image field. Altering an Init Container image field is equivalent to restarting the Pod.
Because Init Containers can be restarted, retried, or re-executed, Init Container code should be idempotent.
In particular, code that writes to files on EmptyDirs
should be prepared for the possibility that an output file already exists.
Init Containers have all of the fields of an app Container.
However, Kubernetes prohibits readinessProbe
from being used because Init Containers cannot define readiness distinct from completion. This is enforced during validation.
Use activeDeadlineSeconds
on the Pod and livenessProbe
on the Container to prevent Init Containers from failing forever. T
he active deadline includes Init Containers.
The name of each app and Init Container in a Pod must be unique; a validation error is thrown for any Container sharing a name with another.
Resources
Given the ordering and execution for Init Containers, the following rules for resource usage apply:
- The highest of any particular resource request or limit defined on all Init Containers is the effective init request/limit
- The Pod’s effective request/limit for a resource is the higher of:
- the sum of all app Containers request/limit for a resource
- the effective init request/limit for a resource
- Scheduling is done based on effective requests/limits, which means Init Containers can reserve resources for initialization that are not used during the life of the Pod.
- QoS tier of the Pod’s effective QoS tier is the QoS tier for Init Containers and app containers alike.
Quota and limits are applied based on the effective Pod request and limit.
Pod level cgroups are based on the effective Pod request and limit, the same as the scheduler.
Pod restart reasons
A Pod can restart, causing re-execution of Init Containers, for the following reasons:
- A user updates the PodSpec causing the Init Container image to change. App Container image changes only restart the app Container.
- The Pod infrastructure container is restarted. This is uncommon and would have to be done by someone with root access to nodes.
- All containers in a Pod are terminated while
restartPolicy
is set to Always, forcing a restart, and the Init Container completion record has been lost due to garbage collection.
Support and compatibility
A cluster with Apiserver version 1.6.0 or greater supports Init Containers using the spec.initContainers
field.
Previous versions support Init Containers using the alpha or beta annotations.
The spec.initContainers
field is also mirrored into alpha and beta annotations so that Kubelets version 1.3.0 or greater can execute Init Containers, and so that a version 1.6 apiserver can safely be rolled back to version 1.5.x without losing Init Container functionality for existing created pods.
In Apiserver and Kubelet versions 1.8.0 or greater, support for the alpha and beta annotations is removed, requiring a conversion from the deprecated annotations to the spec.initContainers
field.
Pod Preset
This page provides an overview of PodPresets, which are objects for injecting certain information into pods at creation time. The information can include secrets, volumes, volume mounts, and environment variables.
Understanding Pod Presets
A Pod Preset
is an API resource for injecting additional runtime requirements into a Pod at creation time. You use label selectorsto specify the Pods to which a given Pod Preset applies.
Using a Pod Preset allows pod template authors to not have to explicitly provide all information for every pod. This way, authors of pod templates consuming a specific service do not need to know all the details about that service.
For more information about the background, see the design proposal for PodPreset.
How It Works
Kubernetes provides an admission controller (PodPreset
) which, when enabled, applies Pod Presets to incoming pod creation requests. When a pod creation request occurs, the system does the following:
- Retrieve all
PodPresets
available for use. - Check if the label selectors of any
PodPreset
matches the labels on the pod being created. - Attempt to merge the various resources defined by the
PodPreset
into the Pod being created. - On error, throw an event documenting the merge error on the pod, and create the pod without any injected resources from the
PodPreset
. - Annotate the resulting modified Pod spec to indicate that it has been modified by a
PodPreset
. The annotation is of the formpodpreset.admission.kubernetes.io/podpreset-<pod-preset name>: "<resource version>"
.
Each Pod can be matched zero or more Pod Presets; and each PodPreset
can be applied to zero or more pods.
When a PodPreset
is applied to one or more Pods, Kubernetes modifies the Pod Spec.
For changes to Env
, EnvFrom
, and VolumeMounts
, Kubernetes modifies the container spec for all containers in the Pod;
for changes to Volume
, Kubernetes modifies the Pod Spec.
Note: A Pod Preset is capable of modifying the spec.containers
field in a Pod spec when appropriate. No resource definition from the Pod Preset will be applied to the initContainers
field.
Disable Pod Preset for a Specific Pod
There may be instances where you wish for a Pod to not be altered by any Pod Preset mutations. In these cases, you can add an annotation in the Pod Spec of the form: podpreset.admission.kubernetes.io/exclude: "true"
.
Enable Pod Preset
In order to use Pod Presets in your cluster you must ensure the following:
- You have enabled the API type
settings.k8s.io/v1alpha1/podpreset
. For example, this can be done by includingsettings.k8s.io/v1alpha1=true
in the--runtime-config
option for the API server. - You have enabled the admission controller
PodPreset
. One way to doing this is to includePodPreset
in the--admission-control
option value specified for the API server. - You have defined your Pod Presets by creating
PodPreset
objects in the namespace you will use.
Disruptions
This guide is for application owners who want to build highly available applications, and thus need to understand what types of Disruptions can happen to Pods.
It is also for Cluster Administrators who want to perform automated cluster actions, like upgrading and autoscaling clusters.
- Voluntary and Involuntary Disruptions
- Dealing with Disruptions
- How Disruption Budgets Work
- PDB Example
- Separating Cluster Owner and Application Owner Roles
- How to perform Disruptive Actions on your Cluster
Voluntary and Involuntary Disruptions
Pods do not disappear until someone (a person or a controller) destroys them, or there is an unavoidable hardware or system software error.
We call these unavoidable cases involuntary disruptions to an application. Examples are:
- a hardware failure of the physical machine backing the node
- cluster administrator deletes VM (instance) by mistake
- cloud provider or hypervisor failure makes VM disappear
- a kernel panic
- the node disappears from the cluster due to cluster network partition
- eviction of a pod due to the node being out-of-resources.
Except for the out-of-resources condition, all these conditions should be familiar to most users; they are not specific to Kubernetes.
We call other cases voluntary disruptions. These include both actions initiated by the application owner and those initiated by a Cluster Administrator. Typical application owner actions include:
- deleting the deployment or other controller that manages the pod
- updating a deployment’s pod template causing a restart
- directly deleting a pod (e.g. by accident)
Cluster Administrator actions include:
- Draining a node for repair or upgrade.
- Draining a node from a cluster to scale the cluster down (learn about Cluster Autoscaling ).
- Removing a pod from a node to permit something else to fit on that node.
These actions might be taken directly by the cluster administrator, or by automation run by the cluster administrator, or by your cluster hosting provider.
Ask your cluster administrator or consult your cloud provider or distribution documentation to determine if any sources of voluntary disruptions are enabled for your cluster. If none are enabled, you can skip creating Pod Disruption Budgets.
Dealing with Disruptions
Here are some ways to mitigate involuntary disruptions:
- Ensure your pod requests the resources it needs.
- Replicate your application if you need higher availability. (Learn about running replicated stateless and stateful applications.)
- For even higher availability when running replicated applications, spread applications across racks (using anti-affinity) or across zones (if using a multi-zone cluster.)
The frequency of voluntary disruptions varies.
On a basic Kubernetes cluster, there are no voluntary disruptions at all.
However, your cluster administrator or hosting provider may run some additional services which cause voluntary disruptions.
For example, rolling out node software updates can cause voluntary disruptions. Also, some implementations of cluster (node) autoscaling may cause voluntary disruptions to defragment and compact nodes.
Your cluster administrator or hosting provider should have documented what level of voluntary disruptions, if any, to expect.
Kubernetes offers features to help run highly available applications at the same time as frequent voluntary disruptions. We call this set of features Disruption Budgets.
How Disruption Budgets Work
An Application Owner can create a PodDisruptionBudget
object (PDB) for each application.
A PDB limits the number pods of a replicated application that are down simultaneously from voluntary disruptions.
For example, a quorum-based application would like to ensure that the number of replicas running is never brought below the number needed for a quorum. A web front end might want to ensure that the number of replicas serving load never falls below a certain percentage of the total.
Cluster managers and hosting providers should use tools which respect Pod Disruption Budgets by calling the Eviction API instead of directly deleting pods.
Examples are the kubectl drain
command and the Kubernetes-on-GCE cluster upgrade script (cluster/gce/upgrade.sh
).
When a cluster administrator wants to drain a node they use the kubectl drain
command. That tool tries to evict all the pods on the machine.
The eviction request may be temporarily rejected, and the tool periodically retries all failed requests until all pods are terminated, or until a configurable timeout is reached.
A PDB specifies the number of replicas that an application can tolerate having, relative to how many it is intended to have.
For example, a Deployment which has a spec.replicas: 5
is supposed to have 5 pods at any given time. If its PDB allows for there to be 4 at a time, then the Eviction API will allow voluntary disruption of one, but not two pods, at a time.
The group of pods that comprise the application is specified using a label selector, the same as the one used by the application’s controller (deployment, stateful-set, etc).
The “intended” number of pods is computed from the .spec.replicas
of the pods controller. The controller is discovered from the pods using the .metadata.ownerReferences
of the object.
PDBs cannot prevent involuntary disruptions from occurring, but they do count against the budget.
Pods which are deleted or unavailable due to a rolling upgrade to an application do count against the disruption budget, but controllers (like deployment and stateful-set) are not limited by PDBs when doing rolling upgrades – the handling of failures during application updates is configured in the controller spec. (Learn about updating a deployment.)
When a pod is evicted using the eviction API, it is gracefully terminated (see terminationGracePeriodSeconds
in PodSpec.)
PDB Example
Consider a cluster with 3 nodes, node-1
through node-3
. The cluster is running several applications. One of them has 3 replicas initially called pod-a
, pod-b
, and pod-c
. Another, unrelated pod without a PDB, called pod-x
, is also shown. Initially, the pods are laid out as follows:
All 3 pods are part of a deployment, and they collectively have a PDB which requires there be at least 2 of the 3 pods to be available at all times.
For example, assume the cluster administrator wants to reboot into a new kernel version to fix a bug in the kernel. The cluster administrator first tries to drain node-1
using the kubectl drain
command. That tool tries to evict pod-a
and pod-x
. This succeeds immediately. Both pods go into the terminating
state at the same time. This puts the cluster in this state:
The deployment notices that one of the pods is terminating, so it creates a replacement called pod-d
. Since node-1
is cordoned, it lands on another node. Something has also created pod-y
as a replacement for pod-x
.
(Note: for a StatefulSet, pod-a
, which would be called something like pod-1
, would need to terminate completely before its replacement, which is also called pod-1
but has a different UID, could be created. Otherwise, the example applies to a StatefulSet as well.)
Now the cluster is in this state:
At some point, the pods terminate, and the cluster looks like this:
At this point, if an impatient cluster administrator tries to drain node-2
or node-3
, the drain command will block, because there are only 2 available pods for the deployment, and its PDB requires at least 2. After some time passes, pod-d
becomes available.
The cluster state now looks like this:
Now, the cluster administrator tries to drain node-2
. The drain command will try to evict the two pods in some order, say pod-b
first and then pod-d
. It will succeed at evicting pod-b
. But, when it tries to evict pod-d
, it will be refused because that would leave only one pod available for the deployment.
The deployment creates a replacement for pod-b
called pod-e
. However, not there are not enough resources in the cluster to schedule pod-e
. So, the drain will again block. The cluster may end up in this state:
At this point, the cluster administrator needs to add a node back to the cluster to proceed with the upgrade.
You can see how Kubernetes varies the rate at which disruptions can happen, according to:
- how many replicas an application needs
- how long it takes to gracefully shutdown an instance
- how long it takes a new instance to start up
- the type of controller
- the cluster’s resource capacity
Separating Cluster Owner and Application Owner Roles
Often, it is useful to think of the Cluster Manager and Application Owner as separate roles with limited knowledge of each other. This separation of responsibilities may make sense in these scenarios:
- when there are many application teams sharing a Kubernetes cluster, and there is natural specialization of roles
- when third-party tools or services are used to automate cluster management
Pod Disruption Budgets support this separation of roles by providing an interface between the roles.
If you do not have such a separation of responsibilities in your organization, you may not need to use Pod Disruption Budgets.
How to perform Disruptive Actions on your Cluster
If you are a Cluster Administrator, and you need to perform a disruptive action on all the nodes in your cluster, such as a node or system software upgrade, here are some options:
- Accept downtime during the upgrade.
- Fail over to another complete replica cluster.
- No downtime, but may be costly both for the duplicated nodes, and for human effort to orchestrate the switchover.
- Write disruption tolerant applications and use PDBs.
- No downtime.
- Minimal resource duplication.
- Allows more automation of cluster administration.
- Writing disruption-tolerant applications is tricky, but the work to tolerate voluntary disruptions largely overlaps with work to support autoscaling and tolerating involuntary disruptions.