Georgia Street View imagery is now available on Google Maps.
Street View opens a new window into Georgia
Georgia Street View imagery is now available on Google Maps.
Georgia Street View imagery is now available on Google Maps.
Operational efficiency and system resilience are critical when running scaled platforms. Yet, in Kubernetes, recovering from software crashes remains a headache because you couldn't trigger a clean restart of a Pod's containers without recreating the entire Pod object, leading to some amount of resource waste.
To address this, Restart All Containers on Container Exits graduated to beta and is enabled by default in Kubernetes v1.36. Developed in close collaboration with the CNCF community, this capability represents Google's commitment to investing in the success of foundation-led open source projects. By sharing best practices from running large distributed systems internally, we are helping build a more resilient and efficient ecosystem. Letting containers restart while keeping the Pod's runtime identity provides a built-in way to perform in-place Pod recovery, boosting application reliability and saving resource costs.
Historically, Kubernetes managed failures using pod level restart policies. While sufficient for simple services, modern multi-container Pods often have complex dependencies. When a failure requires a full environment reset, your only option was deleting and recreating the entire Pod.
This introduces massive control plane churn, causing latency and pressure on the etcd backend during large failures:
Previously, resolving these failures required destroying the entire Pod. For large batch or AI/ML workloads, where thousands of Pods might fail simultaneously, this can lead to "Thundering Herd" scheduling requests, delaying recovery and wasting expensive GPU/TPU compute time.
Kubernetes v1.35 introduces the RestartAllContainers action, enabled by the RestartAllContainersOnContainerExits feature gate, which graduated to beta in 1.36 alongside its dependencies ContainerRestartRules and NodeDeclaredFeatures. This lets a container's exit behavior trigger a fast, in-place restart of the entire Pod on its existing node.
The Kubelet halts all containers while keeping the Pod sandbox intact, preserving critical infrastructure:
emptyDir and PVCs, remain fully mounted; their content is not cleared during restarts.Once terminated, the Kubelet re-runs init containers (including sidecars, which are part of the init sequence) in order, guaranteeing a clean setup in a known-good environment.
You can implement this under the container's restartPolicyRules field. Here is a quick example of how a watcher sidecar can trigger an in-place restart of the entire Pod by exiting with code 88:
YAML
Note: Image names and paths in the YAML below are for illustrative purposes.
apiVersion: v1
kind: Pod
metadata:
name: ml-worker-pod
spec:
restartPolicy: Never
initContainers:
- name: setup-environment
image: registry.k8s.io/ml-tools/setup-worker:v1.0
- name: watcher-sidecar
image: registry.k8s.io/ml-tools/watcher:v1.0
restartPolicy: Always
restartPolicyRules:
- action: RestartAllContainers
exitCodes:
operator: In
values: [88]
containers:
- name: main-application
image: registry.k8s.io/ml-tools/training-app:v1.0
For organizations running distributed workloads, RestartAllContainers provides serious operational advantages:
To support monitoring, Kubernetes v1.35 introduces the AllContainersRestarting Pod condition. Set to True during restarts, it alerts SREs and autoscalers, preventing false-positive alerts, while container restart counts increment to let Prometheus easily track recovery events.
To use in-place restarts successfully, shift your mental model to "persistent sandboxes" and follow three best practices:
preStop hooks) is not supported for in-place restarts. SIGKILL is almost immediate, so applications must handle sudden exits gracefully.This beta capability is a major step toward fluid workload management and serves as a building block for advanced community features like JobSet in-place restarts (KEP-467).
Our work on KEP-5532 reflects our commitment to transparent open source governance. Developed collaboratively within SIG Node, this feature shows how we hold ourselves to high citizenship standards; making our design, goals, and intentions transparent while building shared best practices that benefit everyone. We encourage you to experiment with Kubernetes v1.35 and share your feedback with the community!