Kubernetes v1.33: Job's Backoff Limit Per Index Goes GA
In Kubernetes v1.33, the Backoff Limit Per Index feature reaches general availability (GA). This blog describes the Backoff Limit Per Index feature and its benefits.
About Backoff Limit Per Index
When you run workloads on Kubernetes, you must consider scenarios where Pod failures can affect the completion of your workloads. Ideally, your workload should tolerate transient failures and continue running.
To achieve failure tolerance in a Kubernetes Job, you can set the
spec.backoffLimit
field. This field specifies the total number of tolerated
failures.
However, for workloads where every index is considered independent, like
embarassingly parallel
workloads - the spec.backoffLimit
field is often not flexible enough.
For example, you may choose to run multiple suites of integration tests by
representing each suite as an index within an Indexed Job.
In that setup, a fast-failing index (test suite) is likely to consume your
entire budget for tolerating Pod failures, and you might not be able to run the
other indexes.
In order to address this limitation, we introduce Backoff Limit Per Index, which allows you to control the number of retries per index.
How Backoff Limit Per Index works
To use Backoff Limit Per Index for Indexed Jobs, specify the number of tolerated
Pod failures per index with the spec.backoffLimitPerIndex
field. When you set
this field, the Job executes all indexes by default.
Additionally, to fine-tune the error handling:
- Specify the cap on the total number of failed indexes by setting the
spec.maxFailedIndexes
field. When the limit is exceeded the entire Job is terminated. - Define a short-circuit to detect a failed index by using the
FailIndex
action in the Pod Failure Policy feature.
When the number of tolerated failures is exceeded, the Job marks that index as
failed and lists it in the Job's status.failedIndexes
field.
Example
The following Job spec snippet is an example of how to combine Backoff Limit Per Index with the Pod Failure Policy feature:
completions: 10
parallelism: 10
completionMode: Indexed
backoffLimitPerIndex: 1
maxFailedIndexes: 5
podFailurePolicy:
rules:
- action: Ignore
onPodConditions:
- type: DisruptionTarget
- action: FailIndex
onExitCodes:
operator: In
values: [ 42 ]
In this example, the Job handles Pod failures as follows:
- Ignores any failed Pods that have the built-in
disruption condition,
called
DisruptionTarget
. These Pods don't count towards Job backoff limits. - Fails the index corresponding to the failed Pod if any of the failed Pod's containers finished with the exit code 42 - based on the matching "FailIndex" rule.
- Retries the first failure of any index, unless the index failed due to the
matching
FailIndex
rule. - Fails the entire Job if the number of failed indexes exceeded 5 (set by the
spec.maxFailedIndexes
field).
Learn more
- Read the blog post on the closely related feature of Pod Failure Policy Kubernetes 1.31: Pod Failure Policy for Jobs Goes GA
- For a hands-on guide to using Pod failure policy, including the use of FailIndex, see Handling retriable and non-retriable pod failures with Pod failure policy
- Read the documentation for Backoff limit per index and Pod failure policy
- Read the KEP for the Backoff Limits Per Index For Indexed Jobs
Get involved
This work was sponsored by batch working group in close collaboration with the SIG Apps community.
If you are interested in working on new features in the space we recommend subscribing to our Slack channel and attending the regular community meetings.