@@ -65,7 +65,7 @@ If none of those approvers are still appropriate, then changes to that list
6565should be approved by the remaining approvers and/or the owning SIG (or
6666SIG Architecture for cross-cutting KEPs).
6767-->
68- # KEP-NNNN: Your short, descriptive title
68+ # KEP-5322: DRA: Handle permanent driver allocation failures
6969
7070<!--
7171This is the title of your KEP. Keep it short, simple, and descriptive. A good
@@ -90,9 +90,9 @@ tags, and then generate with `hack/update-toc.sh`.
9090 - [ Goals] ( #goals )
9191 - [ Non-Goals] ( #non-goals )
9292- [ Proposal] ( #proposal )
93- - [ User Stories (Optional) ] ( #user-stories-optional )
94- - [ Story 1 ] ( #story-1 )
95- - [ Story 2 ] ( #story-2 )
93+ - [ User Stories] ( #user-stories )
94+ - [ Efficiency ] ( #efficiency )
95+ - [ Visibility of Errors ] ( #visibility-of-errors )
9696 - [ Notes/Constraints/Caveats (Optional)] ( #notesconstraintscaveats-optional )
9797 - [ Risks and Mitigations] ( #risks-and-mitigations )
9898- [ Design Details] ( #design-details )
@@ -173,6 +173,18 @@ useful for a wide audience.
173173A good summary is probably at least a paragraph in length.
174174-->
175175
176+ For Dynamic Resource Allocation (DRA), the kubelet interfaces with a separate
177+ driver component via gRPC which is responsible for attaching devices to
178+ containers based on scheduler outcomes. When drivers indicate to the kubelet
179+ that a failure occurred during that process, the kubelet will try again later.
180+ This strategy enables the kubelet to overcome transient failures in drivers, but
181+ is wasteful when the error is deterministic based on unchanged inputs.
182+
183+ This KEP proposes additions to the gRPC interface between the kubelet and DRA
184+ drivers to enable drivers to report permanent failures, and updates to the
185+ kubelet to respond to those errors by ceasing to continuously retry invoking the
186+ DRA driver.
187+
176188## Motivation
177189
178190<!--
@@ -184,13 +196,33 @@ demonstrate the interest in a KEP within the wider Kubernetes community.
184196[experience reports]: https://github.com/golang/go/wiki/ExperienceReports
185197-->
186198
199+ Several failure modes of the DRA driver's ` NodePrepareResources ` gRPC method are
200+ not possible to resolve by trying again with the same input the way the kubelet
201+ currently handles all failures:
202+
203+ - The opaque ` config ` associated with a request in a ResourceClaim may be
204+ invalid
205+ - A device allocated by the scheduler may have just been found by the driver to
206+ be unusable
207+
208+ Pods with unfulfillable DRA allocations will stay stuck in a non-erroneous
209+ pending state until manual intervention is taken to identify the cause for the
210+ lack of progress and ultimately delete and recreate the Pod. In the meantime,
211+ the kubelet will waste time retrying an operation that is known will fail the
212+ same way as it did previously. Making the permanent nature of the error known as
213+ soon as possible allows the quickest path for remediation.
214+
187215### Goals
188216
189217<!--
190218List the specific goals of the KEP. What is it trying to achieve? How will we
191219know that this has succeeded?
192220-->
193221
222+ - Minimize the amount of unnecessary work done by the kubelet and DRA drivers.
223+ - Enable workloads to more responsively reschedule Pods in a permanent failure
224+ state.
225+
194226### Non-Goals
195227
196228<!--
@@ -209,7 +241,7 @@ The "Design Details" section below is for the real
209241nitty-gritty.
210242-->
211243
212- ### User Stories (Optional)
244+ ### User Stories
213245
214246<!--
215247Detail the things that people will be able to do if this KEP is implemented.
@@ -218,9 +250,17 @@ the system. The goal here is to make this feel real for users without getting
218250bogged down.
219251-->
220252
221- #### Story 1
253+ #### Efficiency
254+
255+ As a cluster administrator, I want to minimize the amount of unnecessary work
256+ done by critical components like the kubelet and DRA drivers to maximize their
257+ availability for more important work.
258+
259+ #### Visibility of Errors
222260
223- #### Story 2
261+ As a workload administrator, I want to ensure that my workloads are able to
262+ start up as quickly and reliably as possible by proactively rescheduling Pods
263+ when their allocated DRA resources cannot be fulfilled.
224264
225265### Notes/Constraints/Caveats (Optional)
226266
0 commit comments