Skip to content

Commit 444c9d6

Browse files
committed
amend! Copy KEP template
KEP-5322: DRA: Handle permanent driver allocation failures
1 parent 513153e commit 444c9d6

File tree

2 files changed

+62
-39
lines changed

2 files changed

+62
-39
lines changed

keps/sig-node/5322-dra-driver-permanent-allocation-failure/README.md

Lines changed: 47 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ If none of those approvers are still appropriate, then changes to that list
6565
should be approved by the remaining approvers and/or the owning SIG (or
6666
SIG Architecture for cross-cutting KEPs).
6767
-->
68-
# KEP-NNNN: Your short, descriptive title
68+
# KEP-5322: DRA: Handle permanent driver allocation failures
6969

7070
<!--
7171
This is the title of your KEP. Keep it short, simple, and descriptive. A good
@@ -90,9 +90,9 @@ tags, and then generate with `hack/update-toc.sh`.
9090
- [Goals](#goals)
9191
- [Non-Goals](#non-goals)
9292
- [Proposal](#proposal)
93-
- [User Stories (Optional)](#user-stories-optional)
94-
- [Story 1](#story-1)
95-
- [Story 2](#story-2)
93+
- [User Stories](#user-stories)
94+
- [Efficiency](#efficiency)
95+
- [Visibility of Errors](#visibility-of-errors)
9696
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
9797
- [Risks and Mitigations](#risks-and-mitigations)
9898
- [Design Details](#design-details)
@@ -173,6 +173,18 @@ useful for a wide audience.
173173
A good summary is probably at least a paragraph in length.
174174
-->
175175

176+
For Dynamic Resource Allocation (DRA), the kubelet interfaces with a separate
177+
driver component via gRPC which is responsible for attaching devices to
178+
containers based on scheduler outcomes. When drivers indicate to the kubelet
179+
that a failure occurred during that process, the kubelet will try again later.
180+
This strategy enables the kubelet to overcome transient failures in drivers, but
181+
is wasteful when the error is deterministic based on unchanged inputs.
182+
183+
This KEP proposes additions to the gRPC interface between the kubelet and DRA
184+
drivers to enable drivers to report permanent failures, and updates to the
185+
kubelet to respond to those errors by ceasing to continuously retry invoking the
186+
DRA driver.
187+
176188
## Motivation
177189

178190
<!--
@@ -184,13 +196,33 @@ demonstrate the interest in a KEP within the wider Kubernetes community.
184196
[experience reports]: https://github.com/golang/go/wiki/ExperienceReports
185197
-->
186198

199+
Several failure modes of the DRA driver's `NodePrepareResources` gRPC method are
200+
not possible to resolve by trying again with the same input the way the kubelet
201+
currently handles all failures:
202+
203+
- The opaque `config` associated with a request in a ResourceClaim may be
204+
invalid
205+
- A device allocated by the scheduler may have just been found by the driver to
206+
be unusable
207+
208+
Pods with unfulfillable DRA allocations will stay stuck in a non-erroneous
209+
pending state until manual intervention is taken to identify the cause for the
210+
lack of progress and ultimately delete and recreate the Pod. In the meantime,
211+
the kubelet will waste time retrying an operation that is known will fail the
212+
same way as it did previously. Making the permanent nature of the error known as
213+
soon as possible allows the quickest path for remediation.
214+
187215
### Goals
188216

189217
<!--
190218
List the specific goals of the KEP. What is it trying to achieve? How will we
191219
know that this has succeeded?
192220
-->
193221

222+
- Minimize the amount of unnecessary work done by the kubelet and DRA drivers.
223+
- Enable workloads to more responsively reschedule Pods in a permanent failure
224+
state.
225+
194226
### Non-Goals
195227

196228
<!--
@@ -209,7 +241,7 @@ The "Design Details" section below is for the real
209241
nitty-gritty.
210242
-->
211243

212-
### User Stories (Optional)
244+
### User Stories
213245

214246
<!--
215247
Detail the things that people will be able to do if this KEP is implemented.
@@ -218,9 +250,17 @@ the system. The goal here is to make this feel real for users without getting
218250
bogged down.
219251
-->
220252

221-
#### Story 1
253+
#### Efficiency
254+
255+
As a cluster administrator, I want to minimize the amount of unnecessary work
256+
done by critical components like the kubelet and DRA drivers to maximize their
257+
availability for more important work.
258+
259+
#### Visibility of Errors
222260

223-
#### Story 2
261+
As a workload administrator, I want to ensure that my workloads are able to
262+
start up as quickly and reliably as possible by proactively rescheduling Pods
263+
when their allocated DRA resources cannot be fulfilled.
224264

225265
### Notes/Constraints/Caveats (Optional)
226266

Lines changed: 15 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1,51 +1,34 @@
1-
title: KEP Template
2-
kep-number: NNNN
1+
title: "DRA: Handle permanent driver allocation failures"
2+
kep-number: 5322
33
authors:
4-
- "@jane.doe"
5-
owning-sig: sig-xyz
6-
participating-sigs:
7-
- sig-aaa
8-
- sig-bbb
9-
status: provisional|implementable|implemented|deferred|rejected|withdrawn|replaced
10-
creation-date: yyyy-mm-dd
4+
- "@nojnhuh"
5+
owning-sig: sig-node
6+
participating-sigs: []
7+
status: provisional
8+
creation-date: 2025-09-19
119
reviewers:
1210
- TBD
13-
- "@alice.doe"
1411
approvers:
1512
- TBD
16-
- "@oscar.doe"
1713

1814
see-also:
19-
- "/keps/sig-aaa/1234-we-heard-you-like-keps"
20-
- "/keps/sig-bbb/2345-everyone-gets-a-kep"
21-
replaces:
22-
- "/keps/sig-ccc/3456-replaced-kep"
15+
- "/keps/sig-scheduling/5055-dra-device-taints-and-tolerations"
16+
replaces: []
2317

24-
# The target maturity stage in the current dev cycle for this KEP.
25-
# If the purpose of this KEP is to deprecate a user-visible feature
26-
# and a Deprecated feature gates are added, they should be deprecated|disabled|removed.
27-
stage: alpha|beta|stable
18+
stage: alpha
2819

2920
# The most recent milestone for which work toward delivery of this KEP has been
3021
# done. This can be the current (upcoming) milestone, if it is being actively
3122
# worked on.
32-
latest-milestone: "v1.19"
23+
latest-milestone: "v1.35"
3324

34-
# The milestone at which this feature was, or is targeted to be, at each stage.
3525
milestone:
36-
alpha: "v1.19"
37-
beta: "v1.20"
38-
stable: "v1.22"
26+
alpha: "v1.35"
3927

40-
# The following PRR answers are required at alpha release
41-
# List the feature gate name and the components for which it must be enabled
4228
feature-gates:
43-
- name: MyFeature
29+
- name: DRAHandlePermanentDriverFailures
4430
components:
45-
- kube-apiserver
46-
- kube-controller-manager
31+
- kubelet
4732
disable-supported: true
4833

49-
# The following PRR answers are required at beta release
50-
metrics:
51-
- my_feature_metric
34+
metrics: []

0 commit comments

Comments
 (0)