Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review comments proposal.md from API review #167

Closed
wants to merge 2 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 13 additions & 1 deletion docs/proposals/002-api-proposal/proposal.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@ The API design is based on these axioms:
- This solution should be composable with other Gateway solutions and flexible to fit customer needs
- The MVP will heavily assume requests are done using the OpenAI spec, but open to extension in the future
- The Gateway should route in a way that does not generate a queue of requests at the model server level
/* REVIEW: Note that models are served in a way that leads to many models running at the same time, vs many existing kube networking assumptions that assume there is a user-facing service that the implementation changes in user-invisible ways over time */

The [PoC](https://youtu.be/NUBZg_uqqXk?si=v681EeYdGUGEVqQQ&t=1458) was focused on lower-level scheduling. And the API follows that similar logic, which lead to the proposal of the **InferencePool**.

Expand Down Expand Up @@ -129,6 +130,7 @@ type InferencePoolSpec struct {
// that should be included in the InferencePool. ModelServers should not
// be with any other Service or InferencePool, that behavior is not supported
// and will result in sub-optimal utilization.
/* REVIEW: This is generally true for all services, but since the ig needs to deal with monitoring, health checking, and changes to membership in the pool (i.e. a member removed from the selector or gracefully shutting down), I think it's more of a distraction than necessary. */
kfswain marked this conversation as resolved.
Show resolved Hide resolved
ModelServerSelector map[string]string `json:"modelServerSelector,omitempty"`
}
```
Expand All @@ -142,11 +144,15 @@ type InferencePoolSpec struct {
// and rollout of new versions of those models, and defines the specific
// performance and latency goals for the model. These workloads are
// expected to operate within an InferencePool sharing compute capacity with other
/* REVIEW: "operate" sounds wierd */
kfswain marked this conversation as resolved.
Show resolved Hide resolved
// InferenceModels, defined by the Inference Platform Admin. We allow a user who
// has multiple InferenceModels across multiple pools (with the same config) to
// specify the configuration exactly once, and deploy to many pools
// specify the configuration exactly once, and deploy to many pools
/* REVIEW: be more consistent about "to" or "on" or "in" */
// simultaneously. Enabling a simpler config and single source of truth
/* REVIEW: broken clause, this should be part of the previous sentence or more naturally integrated */
// for a given user. InferenceModel ModelNames are unique for a given InferencePool,
/* Review: missing clause? How is uniqueness enforced? */
type InferenceModel struct {
metav1.ObjectMeta
metav1.TypeMeta
Expand All @@ -157,15 +163,18 @@ type InferenceModel struct {
type InferenceModelSpec struct {
// The name of the model as the users set in the "model" parameter in the requests.
// The name should be unique among the workloads that reference the same backend pool.
/* REVIEW: Describe what happens if the name isn't unique, and specify the behavior here */
// This is the parameter that will be used to match the request with. In the future, we may
// allow to match on other request parameters. The other approach to support matching on
// on other request parameters is to use a different ModelName per HTTPFilter.
// Names can be reserved without implementing an actual model in the pool.
/* REVIEW: "without a model of that name already being configured in the pool"? "Implementing a model" doesn't sound right */
// This can be done by specifying a target model and setting the weight to zero,
// an error will be returned specifying that no valid target model is found.
ModelName string
// Optional
// Defines how important it is to serve the model compared to other models referencing the same pool.
/* REVIEW: define the behavior when multiple models have the same criticality in a sentence */
Criticality *Criticality
// Optional.
// Allow multiple versions of a model for traffic splitting.
Expand All @@ -177,6 +186,7 @@ type InferenceModelSpec struct {
}

// Defines how important it is to serve the model compared to other models.
/* REVIEW: Describe why there are only three levels and suggest that we don't expect to extend this enumeration, probably within the field name above. Open enumerations can be problematic, but a novice user reading this could reasonably ask "why not just have a priority number?" and so we should answer that. */
type Criticality string
const (
// Most important. Requests to this band will be shed last.
Expand All @@ -200,7 +210,9 @@ type TargetModel struct {
Name string
// Weight is used to determine the percentage of traffic that should be
// sent to this target model when multiple versions of the model are specified.
/* REVIEW: What's the closest analog to weight in the existing API? */
Weight int
/* REVIEW: As discussed in review session, make this *int with no default because that is most flexible to future extensibility */
}

// LocalObjectReference identifies an API object within the namespace of the
Expand Down