feat(gateway): add /v1/score endpoint for cross-encoder reranker models#1032
feat(gateway): add /v1/score endpoint for cross-encoder reranker models#1032
/v1/score endpoint for cross-encoder reranker models#1032Conversation
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
📝 WalkthroughWalkthroughAdds end-to-end support for a new vLLM scoring endpoint (/v1/score): proto messages and RPC, client helpers, servicer implementation, protocol types, router/pipeline stages for gRPC and HTTP, routing/context/metrics plumbing, and response conversion. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant HTTP_Router as HTTP Router
participant GrpcRouter as GrpcRouter
participant Pipeline as RequestPipeline
participant GrpcClient as GrpcClient (vLLM)
participant Servicer as vLLM Servicer
participant Engine as vLLM Engine
Client->>HTTP_Router: POST /v1/score (ScoreRequest)
HTTP_Router->>GrpcRouter: route_score()
GrpcRouter->>Pipeline: execute_score(request, headers, model_id)
Pipeline->>GrpcClient: build_score_request(...) / score(proto::ScoreRequest)
GrpcClient->>Servicer: Score RPC
Servicer->>Engine: encode(text_1, text_2[*]) with PoolingParams(task="classify")
Engine-->>Servicer: outputs (per-item)
Servicer-->>GrpcClient: ScoreResponse (results + token counts)
GrpcClient-->>Pipeline: proto::ScoreResponse
Pipeline->>GrpcRouter: OpenAI-style ScoreResponse (JSON)
GrpcRouter-->>HTTP_Router: HTTP 200 JSON
HTTP_Router-->>Client: 200 OK
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Code Review
This pull request implements support for the vLLM /v1/score endpoint to enable cross-encoder reranking, including updates to gRPC proto definitions, model capability flags, and the addition of a dedicated scoring pipeline in the gateway. Feedback highlights missing fields in the protobuf definition for truncation settings and the use of a hardcoded request ID in the pipeline. Additionally, improvements are suggested for parallelizing document processing in the servicer and correcting misleading documentation regarding the gRPC pipeline implementation.
| message ScoreRequest { | ||
| string request_id = 1; | ||
| string text_1 = 2; | ||
| repeated string text_2 = 3; | ||
| } |
There was a problem hiding this comment.
The ScoreRequest message is missing the truncate_prompt_tokens field, which is present in the ScoreRequest protocol definition in crates/protocols/src/rerank.rs. Without this field in the proto, the gateway cannot pass truncation settings to the vLLM worker.
| message ScoreRequest { | |
| string request_id = 1; | |
| string text_1 = 2; | |
| repeated string text_2 = 3; | |
| } | |
| message ScoreRequest { | |
| string request_id = 1; | |
| string text_1 = 2; | |
| repeated string text_2 = 3; | |
| optional uint32 truncate_prompt_tokens = 4; | |
| } |
References
- For protocol data structures that mirror an external API (e.g., OpenAI), prioritize alignment with the external specification over internal consistency.
| pub fn build_score_request( | ||
| &self, | ||
| request_id: String, | ||
| text_1: String, | ||
| text_2: Vec<String>, | ||
| ) -> proto::ScoreRequest { | ||
| proto::ScoreRequest { | ||
| request_id, | ||
| text_1, | ||
| text_2, | ||
| } | ||
| } |
There was a problem hiding this comment.
Update build_score_request to include the truncate_prompt_tokens parameter to match the updated proto and protocol definitions.
| pub fn build_score_request( | |
| &self, | |
| request_id: String, | |
| text_1: String, | |
| text_2: Vec<String>, | |
| ) -> proto::ScoreRequest { | |
| proto::ScoreRequest { | |
| request_id, | |
| text_1, | |
| text_2, | |
| } | |
| } | |
| pub fn build_score_request( | |
| &self, | |
| request_id: String, | |
| text_1: String, | |
| text_2: Vec<String>, | |
| truncate_prompt_tokens: Option<u32>, | |
| ) -> proto::ScoreRequest { | |
| proto::ScoreRequest { | |
| request_id, | |
| text_1, | |
| text_2, | |
| truncate_prompt_tokens, | |
| } | |
| } |
References
- For builder methods that construct data structures mapping directly to a wire format, it is acceptable to have many arguments if they correspond one-to-one with the wire-format fields.
| results = [] | ||
| total_prompt_tokens = 0 | ||
|
|
||
| for i, text_2_item in enumerate(request.text_2): |
There was a problem hiding this comment.
| let stages: Vec<Box<dyn PipelineStage>> = vec![ | ||
| Box::new(super::regular::stages::score::ScorePreparationStage), | ||
| Box::new(WorkerSelectionStage::new( | ||
| worker_registry, | ||
| policy_registry, | ||
| WorkerSelectionMode::Regular, // Score is always single-worker | ||
| )), | ||
| Box::new(ClientAcquisitionStage), | ||
| Box::new(super::regular::stages::score::ScoreNativeStage::new()), | ||
| ]; |
There was a problem hiding this comment.
The new_score pipeline is missing the DispatchMetadataStage. This stage is essential for generating a unique request ID and populating dispatch metadata, which should be used by the ScoreNativeStage instead of a hardcoded ID.
| let stages: Vec<Box<dyn PipelineStage>> = vec![ | |
| Box::new(super::regular::stages::score::ScorePreparationStage), | |
| Box::new(WorkerSelectionStage::new( | |
| worker_registry, | |
| policy_registry, | |
| WorkerSelectionMode::Regular, // Score is always single-worker | |
| )), | |
| Box::new(ClientAcquisitionStage), | |
| Box::new(super::regular::stages::score::ScoreNativeStage::new()), | |
| ]; | |
| let stages: Vec<Box<dyn PipelineStage>> = vec![ | |
| Box::new(super::regular::stages::score::ScorePreparationStage), | |
| Box::new(WorkerSelectionStage::new( | |
| worker_registry, | |
| policy_registry, | |
| WorkerSelectionMode::Regular, // Score is always single-worker | |
| )), | |
| Box::new(ClientAcquisitionStage), | |
| Box::new(DispatchMetadataStage), | |
| Box::new(super::regular::stages::score::ScoreNativeStage::new()), | |
| ]; |
| /// (`/v1/score`). The `ScoreHttpForwardStage` always returns | ||
| /// `Ok(Some(response))`, so the loop below will return the proxied response | ||
| /// before reaching the `final_response` check at the bottom. |
There was a problem hiding this comment.
The documentation comment for execute_score is incorrect. It describes HTTP forwarding and a non-existent ScoreHttpForwardStage, while the actual implementation uses a gRPC pipeline with ScoreNativeStage.
/// Execute the complete pipeline for a Score API request.
///
/// Score requests are processed via the gRPC pipeline using the ScoreNativeStage.
/// This stage executes the Native Score RPC via the client and returns
/// Ok(Some(response)) to short-circuit the pipeline.
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
| # The output data for score is a relevance score wrapped in a tensor. | ||
| # vLLM versions return different structures — normalize to scalar. | ||
| data = final_output.outputs.data | ||
| logger.info("Score data type=%s repr=%s", type(data).__name__, repr(data)[:200]) |
There was a problem hiding this comment.
🟡 Nit: This logger.info fires for every text_2 item in every score request, logging repr(data). In production with batched requests this will be extremely noisy. Should be logger.debug — the INFO log on line 272 already records the request arrival.
| while isinstance(raw, (list, tuple)) and len(raw) > 0: | ||
| raw = raw[0] | ||
| score_value = float(raw) |
There was a problem hiding this comment.
🟡 Nit: If the model returns an empty list/tuple (e.g. []), the while loop exits without unwrapping and float(raw) is called on the empty list, raising an unhelpful TypeError. Consider adding a guard:
if isinstance(raw, (list, tuple)):
# exhausted without finding a scalar
msg = f"Score request {request_id}_{i} returned empty data: {repr(data)[:200]}"
logger.warning(msg)
await context.abort(grpc.StatusCode.INTERNAL, msg)
score_value = float(raw)There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: dc36dc55ca
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| return Err(error::internal_error( | ||
| "score_backend_unsupported", | ||
| "Score is only supported on vLLM backend", | ||
| )); |
There was a problem hiding this comment.
Return 4xx instead of 500 for unsupported /v1/score backends
If /v1/score is called while SMG is running in gRPC mode with a non-vLLM backend (for example, SGLang/TRT-LLM), this branch converts the backend-capability mismatch into internal_error, which surfaces as HTTP 500. Since the endpoint is registered unconditionally, this is a predictable user-input/path mismatch rather than a server fault, and returning 500 will mislead clients and inflate server-error metrics.
Useful? React with 👍 / 👎.
| let proto_req = | ||
| match client.build_score_request(request_id, score_req.text_1.clone(), text_2) { | ||
| Ok(req) => req, |
There was a problem hiding this comment.
Preserve /v1/score truncation options in gRPC requests
The gRPC score path only forwards request_id, text_1, and text_2, so ScoreRequest options like truncate_prompt_tokens (and encoding_format) are silently dropped in gRPC mode. This creates behavior drift from HTTP passthrough mode and can cause long text-pair requests that depend on truncation to fail or behave differently when routed through the gRPC pipeline.
Useful? React with 👍 / 👎.
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/protocols/src/model_type.rs`:
- Around line 37-39: The JSON schema enum for ModelType is missing the "score"
variant even though ModelType includes the SCORE flag and CAPABILITY_NAMES/serde
accept "score"; update the ModelType JsonSchema implementation to add "score" to
the enum list (and mirror this change wherever the manual enum list is
duplicated, e.g., the other JsonSchema/enum generation block referenced at the
second location). Locate the ModelType definition (including the const SCORE)
and the impl JsonSchema for ModelType and add "score" to the returned enum
values so schema-based validation and generated clients accept the score
capability.
In `@crates/protocols/src/rerank.rs`:
- Around line 233-252: Add validation for ScoreRequest similar to RerankRequest
by implementing a validation method (e.g., impl ScoreRequest::validate or
implementing the same Validate trait used by RerankRequest) that checks that
text_1 is not empty and that text_2 contains at least one non-empty document
(handle both String and Vec variants of StringOrVec). Return a suitable error
type on failure and call this validation where other request types are validated
so protocol-layer errors are consistent.
In `@grpc_servicer/smg_grpc_servicer/vllm/servicer.py`:
- Around line 287-341: The loop over request.text_2 in servicer.py currently
calls self.engine.encode per-item (see the for i, text_2_item in
enumerate(request.text_2) loop and use of pooling_params/request_id), which
serializes scoring and limits throughput; either implement batching by
aggregating prompts (build a list of TokensPrompt objects and call
self.engine.encode once to let the backend parallelize) or, if sequential
processing is intentional, add a clear TODO comment above the loop referencing
request.text_2, self.engine.encode and pooling_params that notes this is
single-item encoding and that batching should be considered in a follow-up for
high-throughput workloads.
- Around line 331-333: The unwrapping loop around the variable raw (while
isinstance(raw, (list, tuple)) and len(raw) > 0: raw = raw[0]) can leave raw as
an empty list/tuple and then float(raw) will raise; modify the unwrapping logic
in servicer.py around the raw handling to guard against empty containers by
checking after the loop whether raw is still a list/tuple (or empty) and either
raise a clear ValueError/TypeError with context (including the problematic raw)
or provide a sensible default, ensuring you only call float(raw) when raw is a
scalar value; reference the raw unwrapping code block and the float(raw) call
when applying the fix.
In `@model_gateway/src/routers/grpc/pipeline.rs`:
- Around line 1128-1129: Update the incorrect comment that references
ScoreHttpForwardStage: change it to refer to ScoreNativeStage so the comment
matches the implementation of the score pipeline (e.g., the comment near the
Ok(Some(response)) branch should explain that ScoreNativeStage returns Ok(Some)
to short-circuit and record success).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 057c93a3-855c-446b-989c-1c27d129717c
📒 Files selected for processing (20)
crates/grpc_client/proto/vllm_engine.protocrates/grpc_client/python/pyproject.tomlcrates/grpc_client/src/vllm_engine.rscrates/protocols/src/model_type.rscrates/protocols/src/rerank.rsgrpc_servicer/smg_grpc_servicer/vllm/servicer.pymodel_gateway/src/observability/metrics.rsmodel_gateway/src/routers/grpc/client.rsmodel_gateway/src/routers/grpc/common/stages/dispatch_metadata.rsmodel_gateway/src/routers/grpc/context.rsmodel_gateway/src/routers/grpc/harmony/stages/request_building.rsmodel_gateway/src/routers/grpc/harmony/stages/response_processing.rsmodel_gateway/src/routers/grpc/pipeline.rsmodel_gateway/src/routers/grpc/regular/stages/mod.rsmodel_gateway/src/routers/grpc/regular/stages/score/mod.rsmodel_gateway/src/routers/grpc/router.rsmodel_gateway/src/routers/http/router.rsmodel_gateway/src/routers/mod.rsmodel_gateway/src/routers/router_manager.rsmodel_gateway/src/server.rs
| /// Score/cross-encoder reranker models (vLLM /v1/score) | ||
| const SCORE = 1 << 12; | ||
|
|
There was a problem hiding this comment.
Add "score" to ModelType JSON schema enum values.
CAPABILITY_NAMES/serde now accept "score", but the manual JsonSchema enum list still omits it. This can cause schema-based validation or generated clients to reject valid configs.
🛠️ Proposed fix
@@
enum_values: Some(vec![
"chat".into(),
"completions".into(),
"responses".into(),
"embeddings".into(),
"rerank".into(),
"generate".into(),
"vision".into(),
"tools".into(),
"reasoning".into(),
"image_gen".into(),
"audio".into(),
"moderation".into(),
+ "score".into(),
]),Also applies to: 87-87
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/protocols/src/model_type.rs` around lines 37 - 39, The JSON schema
enum for ModelType is missing the "score" variant even though ModelType includes
the SCORE flag and CAPABILITY_NAMES/serde accept "score"; update the ModelType
JsonSchema implementation to add "score" to the enum list (and mirror this
change wherever the manual enum list is duplicated, e.g., the other
JsonSchema/enum generation block referenced at the second location). Locate the
ModelType definition (including the const SCORE) and the impl JsonSchema for
ModelType and add "score" to the returned enum values so schema-based validation
and generated clients accept the score capability.
| #[derive(Debug, Clone, Deserialize, Serialize, schemars::JsonSchema)] | ||
| pub struct ScoreRequest { | ||
| /// The model to use for scoring | ||
| pub model: String, | ||
|
|
||
| /// The query/source text (single string) | ||
| pub text_1: String, | ||
|
|
||
| /// The document(s) to score against the query. | ||
| /// Can be a single string or a list of strings. | ||
| pub text_2: StringOrVec, | ||
|
|
||
| /// Optional encoding format for the response | ||
| #[serde(skip_serializing_if = "Option::is_none")] | ||
| pub encoding_format: Option<String>, | ||
|
|
||
| /// Whether to truncate the input | ||
| #[serde(skip_serializing_if = "Option::is_none")] | ||
| pub truncate_prompt_tokens: Option<u32>, | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider adding validation for ScoreRequest.
Unlike RerankRequest which has validation for non-empty query and documents, ScoreRequest lacks validation. Consider adding:
- Non-empty
text_1validation - Non-empty
text_2validation (at least one document to score)
This would provide consistent error handling at the protocol layer rather than at the backend.
♻️ Example validation addition
+use validator::Validate;
+
#[derive(Debug, Clone, Deserialize, Serialize, schemars::JsonSchema)]
+#[derive(Validate)]
pub struct ScoreRequest {
/// The model to use for scoring
pub model: String,
/// The query/source text (single string)
+ #[validate(custom(function = "validate_text_1"))]
pub text_1: String,
/// The document(s) to score against the query.
/// Can be a single string or a list of strings.
+ #[validate(custom(function = "validate_text_2"))]
pub text_2: StringOrVec,
// ... rest unchanged
}
+
+fn validate_text_1(text: &str) -> Result<(), validator::ValidationError> {
+ if text.trim().is_empty() {
+ return Err(validator::ValidationError::new("text_1 cannot be empty"));
+ }
+ Ok(())
+}
+
+fn validate_text_2(text_2: &StringOrVec) -> Result<(), validator::ValidationError> {
+ if text_2.is_empty() {
+ return Err(validator::ValidationError::new("text_2 cannot be empty"));
+ }
+ Ok(())
+}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #[derive(Debug, Clone, Deserialize, Serialize, schemars::JsonSchema)] | |
| pub struct ScoreRequest { | |
| /// The model to use for scoring | |
| pub model: String, | |
| /// The query/source text (single string) | |
| pub text_1: String, | |
| /// The document(s) to score against the query. | |
| /// Can be a single string or a list of strings. | |
| pub text_2: StringOrVec, | |
| /// Optional encoding format for the response | |
| #[serde(skip_serializing_if = "Option::is_none")] | |
| pub encoding_format: Option<String>, | |
| /// Whether to truncate the input | |
| #[serde(skip_serializing_if = "Option::is_none")] | |
| pub truncate_prompt_tokens: Option<u32>, | |
| } | |
| use validator::Validate; | |
| #[derive(Debug, Clone, Deserialize, Serialize, schemars::JsonSchema, Validate)] | |
| pub struct ScoreRequest { | |
| /// The model to use for scoring | |
| pub model: String, | |
| /// The query/source text (single string) | |
| #[validate(custom(function = "validate_text_1"))] | |
| pub text_1: String, | |
| /// The document(s) to score against the query. | |
| /// Can be a single string or a list of strings. | |
| #[validate(custom(function = "validate_text_2"))] | |
| pub text_2: StringOrVec, | |
| /// Optional encoding format for the response | |
| #[serde(skip_serializing_if = "Option::is_none")] | |
| pub encoding_format: Option<String>, | |
| /// Whether to truncate the input | |
| #[serde(skip_serializing_if = "Option::is_none")] | |
| pub truncate_prompt_tokens: Option<u32>, | |
| } | |
| fn validate_text_1(text: &str) -> Result<(), validator::ValidationError> { | |
| if text.trim().is_empty() { | |
| return Err(validator::ValidationError::new("text_1 cannot be empty")); | |
| } | |
| Ok(()) | |
| } | |
| fn validate_text_2(text_2: &StringOrVec) -> Result<(), validator::ValidationError> { | |
| if text_2.is_empty() { | |
| return Err(validator::ValidationError::new("text_2 cannot be empty")); | |
| } | |
| Ok(()) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/protocols/src/rerank.rs` around lines 233 - 252, Add validation for
ScoreRequest similar to RerankRequest by implementing a validation method (e.g.,
impl ScoreRequest::validate or implementing the same Validate trait used by
RerankRequest) that checks that text_1 is not empty and that text_2 contains at
least one non-empty document (handle both String and Vec variants of
StringOrVec). Return a suitable error type on failure and call this validation
where other request types are validated so protocol-layer errors are consistent.
| while isinstance(raw, (list, tuple)) and len(raw) > 0: | ||
| raw = raw[0] | ||
| score_value = float(raw) |
There was a problem hiding this comment.
Edge case: empty nested structure could cause IndexError.
The while loop unwraps nested lists/tuples, but if the innermost non-empty container holds an empty container, raw[0] will be assigned that empty container, then the next iteration's len(raw) > 0 check will fail and float(raw) will be called on an empty list/tuple, raising a TypeError.
While this edge case is unlikely given vLLM's output format, consider adding a guard:
🛡️ Proposed defensive fix
while isinstance(raw, (list, tuple)) and len(raw) > 0:
raw = raw[0]
+ if isinstance(raw, (list, tuple)):
+ raise ValueError(f"Could not extract scalar score from nested structure: {data!r}")
score_value = float(raw)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@grpc_servicer/smg_grpc_servicer/vllm/servicer.py` around lines 331 - 333, The
unwrapping loop around the variable raw (while isinstance(raw, (list, tuple))
and len(raw) > 0: raw = raw[0]) can leave raw as an empty list/tuple and then
float(raw) will raise; modify the unwrapping logic in servicer.py around the raw
handling to guard against empty containers by checking after the loop whether
raw is still a list/tuple (or empty) and either raise a clear
ValueError/TypeError with context (including the problematic raw) or provide a
sensible default, ensuring you only call float(raw) when raw is a scalar value;
reference the raw unwrapping code block and the float(raw) call when applying
the fix.
| Ok(Some(response)) => { | ||
| // ScoreHttpForwardStage returns Ok(Some) to short-circuit — record success |
There was a problem hiding this comment.
Comment references wrong stage name.
The comment mentions ScoreHttpForwardStage, but the score pipeline uses ScoreNativeStage. This appears to be a copy-paste artifact.
📝 Suggested fix
Ok(Some(response)) => {
- // ScoreHttpForwardStage returns Ok(Some) to short-circuit — record success
+ // ScoreNativeStage returns Ok(Some) to short-circuit — record success
Metrics::record_router_duration(📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Ok(Some(response)) => { | |
| // ScoreHttpForwardStage returns Ok(Some) to short-circuit — record success | |
| Ok(Some(response)) => { | |
| // ScoreNativeStage returns Ok(Some) to short-circuit — record success | |
| Metrics::record_router_duration( |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@model_gateway/src/routers/grpc/pipeline.rs` around lines 1128 - 1129, Update
the incorrect comment that references ScoreHttpForwardStage: change it to refer
to ScoreNativeStage so the comment matches the implementation of the score
pipeline (e.g., the comment near the Ok(Some(response)) branch should explain
that ScoreNativeStage returns Ok(Some) to short-circuit and record success).
| /// Create a Score API pipeline for vLLM `/v1/score` endpoint. | ||
| /// | ||
| /// The `/v1/score` endpoint was added to the vLLM gRPC specs. This pipeline uses | ||
| /// three stages: |
There was a problem hiding this comment.
🟡 Nit: Doc says "three stages" but the pipeline actually has four — ScorePreparationStage is in the code (line 333) but missing from the numbered list. Should be:
ScorePreparationStage— extractstext_1for worker routingWorkerSelectionStage— picks the target workerClientAcquisitionStage— obtains the gRPC clientScoreNativeStage— executes the Score RPC
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: da212d0f9e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
|
||
| ctx.state.preparation = Some(PreparationOutput { | ||
| original_text: Some(original_text), | ||
| token_ids: Vec::new(), // Scoring worker routing doesn't strictly need accurate token lengths |
There was a problem hiding this comment.
Populate score token_ids before worker selection
ScorePreparationStage always sets token_ids to an empty vector, so WorkerSelectionStage forwards tokens: None for every /v1/score request. In deployments using the prefix_hash policy, select_worker_impl returns None when tokens are missing, which makes the score path fail with model_not_found even though healthy workers exist. This is a hard routing failure for any model configured with token-based policies.
Useful? React with 👍 / 👎.
| policy_registry, | ||
| WorkerSelectionMode::Regular, // Score is always single-worker | ||
| )), | ||
| Box::new(ClientAcquisitionStage), | ||
| Box::new(super::regular::stages::score::ScoreNativeStage::new()), |
There was a problem hiding this comment.
Preserve worker load/circuit accounting in score pipeline
The new score pipeline skips RequestExecutionStage and calls client.score(...) directly, which bypasses the normal WorkerLoadGuard lifecycle and worker outcome recording done in request_execution.rs. As a result, score traffic is invisible to load-based balancing and does not feed circuit-breaker outcomes, so under real score load the router can keep selecting overloaded or failing workers instead of adapting like other gRPC endpoints.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
crates/protocols/src/rerank.rs (1)
233-252:⚠️ Potential issue | 🟠 MajorAdd protocol-layer validation for
ScoreRequestfields.At Line 233,
ScoreRequestis public but not validated, so emptytext_1and empty/blanktext_2inputs can pass protocol parsing and fail later in backend-specific ways. Please reject these early, similar toRerankRequest.♻️ Proposed fix
-#[derive(Debug, Clone, Deserialize, Serialize, schemars::JsonSchema)] +#[derive(Debug, Clone, Deserialize, Serialize, Validate, schemars::JsonSchema)] +#[validate(schema(function = "validate_score_request"))] pub struct ScoreRequest { /// The model to use for scoring pub model: String, /// The query/source text (single string) + #[validate(custom(function = "validate_text_1"))] pub text_1: String, /// The document(s) to score against the query. /// Can be a single string or a list of strings. + #[validate(custom(function = "validate_text_2"))] pub text_2: StringOrVec, @@ } + +fn validate_text_1(text: &str) -> Result<(), validator::ValidationError> { + if text.trim().is_empty() { + return Err(validator::ValidationError::new("text_1 cannot be empty")); + } + Ok(()) +} + +fn validate_text_2(text_2: &StringOrVec) -> Result<(), validator::ValidationError> { + match text_2 { + StringOrVec::Single(s) if s.trim().is_empty() => { + Err(validator::ValidationError::new("text_2 cannot be empty")) + } + StringOrVec::Array(v) if v.is_empty() || v.iter().any(|s| s.trim().is_empty()) => { + Err(validator::ValidationError::new("text_2 contains empty entries")) + } + _ => Ok(()), + } +} + +#[expect( + clippy::unnecessary_wraps, + reason = "validator crate requires Result return type" +)] +fn validate_score_request(_req: &ScoreRequest) -> Result<(), validator::ValidationError> { + Ok(()) +}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/protocols/src/rerank.rs` around lines 233 - 252, ScoreRequest is not validated at protocol layer allowing empty text_1 or empty/blank text_2 to pass; add a validation routine (implement a Validate-like method or trait) for the ScoreRequest struct that checks: model is present if required, text_1 is non-empty/non-whitespace, and text_2 (StringOrVec) contains at least one non-empty/non-whitespace string (if it's a Vec ensure no empty entries; if it's a single String ensure it's non-blank); wire this validator into the same protocol parsing path used by RerankRequest so invalid requests are rejected early with a clear error.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@grpc_servicer/smg_grpc_servicer/vllm/servicer.py`:
- Around line 31-36: The code currently calls mm_inputs(...) unconditionally
which raises a 'NoneType' object is not callable when the vllm multimodal
constructor failed to import; update the handling so that before entering the
preprocessed multimodal path (i.e., when has_preprocessed_mm is true or inside
_build_preprocessed_mm_inputs) you check that mm_inputs is not None and
VllmMultiModalInputs is available, and if not raise a clear
UnsupportedOperation/RuntimeError with a message like "multimodal preprocessing
not supported in this vllm version" (or fail fast during initialization if you
prefer); ensure checks reference mm_inputs, VllmMultiModalInputs,
_build_preprocessed_mm_inputs and has_preprocessed_mm so callers hit a clear
error instead of a NoneType call.
- Around line 300-306: The code is passing the raw token_type_ids list into
pooling_params.extra_kwargs, but vLLM expects a compressed integer index named
"compressed_token_type_ids"; update the branch in servicer.py that handles
encoded["token_type_ids"] to import and call
compress_token_type_ids(token_type_ids) and set pair_pooling_params.extra_kwargs
= {"compressed_token_type_ids": compressed_index} (keeping the rest of
pooling_params cloning logic and names like pooling_params, pair_pooling_params,
and encoded unchanged).
---
Duplicate comments:
In `@crates/protocols/src/rerank.rs`:
- Around line 233-252: ScoreRequest is not validated at protocol layer allowing
empty text_1 or empty/blank text_2 to pass; add a validation routine (implement
a Validate-like method or trait) for the ScoreRequest struct that checks: model
is present if required, text_1 is non-empty/non-whitespace, and text_2
(StringOrVec) contains at least one non-empty/non-whitespace string (if it's a
Vec ensure no empty entries; if it's a single String ensure it's non-blank);
wire this validator into the same protocol parsing path used by RerankRequest so
invalid requests are rejected early with a clear error.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 9216d459-c62e-44a5-bb30-dd1be6b11677
📒 Files selected for processing (3)
crates/grpc_client/python/smg_grpc_proto/protocrates/protocols/src/rerank.rsgrpc_servicer/smg_grpc_servicer/vllm/servicer.py
| # Pass token_type_ids via pooling_params if present | ||
| pair_pooling_params = pooling_params | ||
| if "token_type_ids" in encoded: | ||
| pair_pooling_params = pooling_params.clone() | ||
| pair_pooling_params.extra_kwargs = { | ||
| "token_type_ids": encoded["token_type_ids"], | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -type f -name "*.py" | grep -E "(servicer|vllm)" | head -20Repository: lightseekorg/smg
Length of output: 655
🏁 Script executed:
cat -n grpc_servicer/smg_grpc_servicer/vllm/servicer.py | sed -n '290,320p'Repository: lightseekorg/smg
Length of output: 1709
🏁 Script executed:
rg "compress_token_type_ids" --type pyRepository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
rg "token_type_ids" grpc_servicer/smg_grpc_servicer/vllm/ --type py -B 3 -A 3Repository: lightseekorg/smg
Length of output: 1147
🏁 Script executed:
rg "pooling_params" grpc_servicer/smg_grpc_servicer/vllm/ --type py -B 2 -A 2Repository: lightseekorg/smg
Length of output: 2711
🏁 Script executed:
rg "compressed_token_type_ids\|compress_token_type_ids" --type pyRepository: lightseekorg/smg
Length of output: 42
🌐 Web query:
vLLM pooling_params extra_kwargs token_type_ids compressed_token_type_ids
💡 Result:
In vLLM's pooling API (used for tasks like embeddings, classification, scoring with models like cross-encoders), PoolingParams is a configuration class that includes an 'extra_kwargs' field: dict[str, Any] | None = None. This field stores additional parameters for internal use. token_type_ids is an optional field in TokensPrompt (vllm.inputs), a list[int] used for cross-encoder models to distinguish query and document tokens (typically [0]*query_len + [1]*doc_len). When token_type_ids is provided in the engine prompt (e.g., during LLM.score for cross-encoder scoring), vLLM processes it as follows: - Pops "token_type_ids" from the prompt. - Clones the default PoolingParams. - Computes compressed_token_type_ids = compress_token_type_ids(token_type_ids), where compress_token_type_ids returns the index of the first 1 in the list (position where document tokens start, assuming format of zeros followed by ones) or the list length if no 1s. - Sets pooling_params.extra_kwargs = {"compressed_token_type_ids": compressed} This compressed value (an int) is passed via extra_kwargs to the pooling layer (e.g., for token-wise pooling in cross-encoder score computation), likely to efficiently identify the boundary without storing the full list. Sources: vLLM GitHub (pooling_params.py, llm.py, score_utils.py), docs (inputs, score_utils, pooling score serving).
Citations:
- 1: https://github.com/vllm-project/vllm/blob/main/vllm/pooling_params.py
- 2: https://docs.vllm.ai/en/stable/api/vllm/entrypoints/pooling/score/serving/
- 3: https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/llm.py
- 4: https://docs.vllm.ai/en/stable/api/vllm/inputs/
- 5: https://docs.vllm.ai/en/v0.13.0/api/vllm/entrypoints/score_utils/
🌐 Web query:
vLLM score task token_type_ids handling official documentation
💡 Result:
vLLM supports handling of token_type_ids primarily for cross-encoder models used in score tasks. TokensPrompt includes an optional token_type_ids field: "A list of token type IDs to pass to the cross encoder model." This allows users to provide pre-tokenized inputs with segment IDs (e.g., 0 for query, 1 for document in BERT-like models). In the LLM class (offline inference), when token_type_ids are present in the prompt, they are extracted, compressed via compress_token_type_ids (likely run-length encoding for efficiency), and passed as extra_kwargs={"compressed_token_type_ids": compressed} to PoolingParams for the score/classify task. For score tasks (cross-encoders), users typically provide text pairs (query, document), which vLLM tokenizes internally using get_score_prompt, applying score templates. These may generate token_type_ids automatically during tokenization if the tokenizer supports it (e.g., BERT tokenizer with text_pair). However, users can bypass internal tokenization by providing TokensPrompt with explicit token_type_ids. There was a bug in AsyncLLMEngine (online serving) where token_type_ids from TokensPrompt were lost for score tasks (v0.8.5.post1, issue #17743, closed Aug 2025), requiring a patch to InputPreprocessor to forward them. This suggests support exists but had an async-specific issue, likely fixed post-closure. Score task uses pooling runner (PoolingParams with task="score", deprecated in recent versions favoring classify for cross-encoders). No explicit PoolingParams field for token_type_ids; handled internally via extra_kwargs. Overall, token_type_ids handling is integrated for cross-encoder score/classify tasks via TokensPrompt and passed to model executor, with compression for efficiency. Primarily useful when providing pre-tokenized inputs; internal text tokenization may handle automatically for supported models.
Citations:
- 1: [Bug]: token_type_ids lost from prompt input during asynchronous request processing vllm-project/vllm#17743
- 2: https://docs.vllm.ai/en/stable/api/vllm/entrypoints/llm/
- 3: https://docs.vllm.ai/en/v0.7.1/api/offline_inference/llm_inputs.html
- 4: https://docs.vllm.ai/en/v0.18.2/models/pooling_models/scoring/
- 5: https://github.com/vllm-project/vllm/blob/main/vllm/pooling_params.py
- 6: https://docs.vllm.ai/en/v0.6.2/_modules/vllm/inputs/data.html
- 7: https://docs.vllm.ai/en/v0.7.0/_modules/vllm/inputs/data.html
🌐 Web query:
vLLM InputPreprocessor _process_tokens token_type_ids
💡 Result:
In vLLM's InputPreprocessor class (vllm.inputs.preprocess.py), the _process_tokens method processes a TokensPrompt object. It extracts prompt_token_ids from parsed_content["prompt_token_ids"] and optionally retrieves token_type_ids using parsed_content.get("token_type_ids"). If multi_modal_data is present in parsed_content, it calls _process_multimodal (or async equivalent) with prompt_token_ids and other data. Otherwise, it constructs a TokenInputs object passing both prompt_token_ids and token_type_ids. token_type_ids supports models like cross-encoders (e.g., for query-document scoring) that require distinguishing token types, similar to BERT-style token_type_ids (0 for first sequence, 1 for second). There was a bug (GitHub issue #17743, fixed by Aug 2025) where token_type_ids were lost in async processing, but it's resolved in recent versions. Example code snippet from docs: def _process_tokens(self, parsed_content: TokensPrompt, ...): prompt_token_ids = parsed_content["prompt_token_ids"] token_type_ids = parsed_content.get("token_type_ids") ... else: inputs = token_inputs( prompt_token_ids=prompt_token_ids, token_type_ids=token_type_ids, ) TokensPrompt schema: prompt_token_ids: list[int], token_type_ids: list[int] (optional).
Citations:
- 1: https://docs.vllm.ai/en/v0.9.0/api/vllm/inputs/preprocess.html
- 2: [Bug]: token_type_ids lost from prompt input during asynchronous request processing vllm-project/vllm#17743
- 3: https://vllm.website.cncfstack.com/api/vllm/inputs/preprocess/
- 4: https://docs.vllm.ai/en/v0.10.1/api/vllm/inputs/preprocess.html
- 5: https://docs.vllm.ai/en/v0.9.0/api/vllm/inputs/
Pass compressed token_type_ids via extra_kwargs, not raw list.
vLLM's cross-encoder score/classify path expects pooling_params.extra_kwargs["compressed_token_type_ids"] as an integer (the position of the first document token), not the raw token_type_ids list. vLLM computes this by calling compress_token_type_ids(token_type_ids), which returns the index of the first 1 in the list. Passing the raw list here will cause incorrect score computation. (vLLM pooling_params.py, vLLM score_utils)
🐛 Proposed fix
if "token_type_ids" in encoded:
pair_pooling_params = pooling_params.clone()
pair_pooling_params.extra_kwargs = {
- "token_type_ids": encoded["token_type_ids"],
+ "compressed_token_type_ids": compress_token_type_ids(
+ encoded["token_type_ids"]
+ ),
}Import compress_token_type_ids from vLLM's score utilities.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@grpc_servicer/smg_grpc_servicer/vllm/servicer.py` around lines 300 - 306, The
code is passing the raw token_type_ids list into pooling_params.extra_kwargs,
but vLLM expects a compressed integer index named "compressed_token_type_ids";
update the branch in servicer.py that handles encoded["token_type_ids"] to
import and call compress_token_type_ids(token_type_ids) and set
pair_pooling_params.extra_kwargs = {"compressed_token_type_ids":
compressed_index} (keeping the rest of pooling_params cloning logic and names
like pooling_params, pair_pooling_params, and encoded unchanged).
|
Hi @ppraneth, this PR has merge conflicts that must be resolved before it can be merged. Please rebase your branch: git fetch origin main
git rebase origin/main
# resolve any conflicts, then:
git push --force-with-lease |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@grpc_servicer/smg_grpc_servicer/vllm/servicer.py`:
- Around line 327-332: The ScoreResult construction unnecessarily wraps
score_value in float() even though score_value is already a float; in the
results.append call that creates vllm_engine_pb2.ScoreResult (the block
referencing results.append and vllm_engine_pb2.ScoreResult), remove the
redundant float() conversion and pass score_value directly as the score argument
(i.e., use score=score_value).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 417aa3d4-c4d9-4c0e-a613-40cf6f9ffcb4
📒 Files selected for processing (1)
grpc_servicer/smg_grpc_servicer/vllm/servicer.py
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 288797a7e6
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| (ModelType::IMAGE_GEN, "image_gen"), | ||
| (ModelType::AUDIO, "audio"), | ||
| (ModelType::MODERATION, "moderation"), | ||
| (ModelType::SCORE, "score"), |
There was a problem hiding this comment.
Keep ModelType schema in sync with new score capability
Adding ModelType::SCORE to CAPABILITY_NAMES makes runtime serialization emit "score", but the manual JsonSchema enum list in the same file still omits "score". That creates a schema/runtime mismatch where generated OpenAPI/JSON-schema validation can reject payloads that the code itself now produces for score-capable models.
Useful? React with 👍 / 👎.
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/protocols/src/rerank.rs`:
- Around line 245-251: The truncate_prompt_tokens field is declared on the
public Rerank/score request but is ignored on the native gRPC path; update the
native plumbing to carry it or remove it until supported. Either (preferred) add
truncate_prompt_tokens to the native request/message and propagate it in
ScoreNativeStage (and any ScoreNativeRequest/ScoreNativeResponse structs, gRPC
proto/messages, and the native transport marshal/unmarshal code) so the native
path forwards and honors the truncation flag exactly like the HTTP path, or
remove truncate_prompt_tokens from the public struct in rerank.rs so the
contract is consistent; make sure to update any tests and code that build/parse
the native score request to reference the truncate_prompt_tokens symbol and not
silently drop it.
In `@grpc_servicer/smg_grpc_servicer/vllm/servicer.py`:
- Around line 314-316: The per-result score dump currently uses logger.info with
an expensive repr(data) and runs for every final_output.outputs.data (see the
logger.info line and raw = data), which floods INFO on large rerank requests;
change that log to logger.debug and avoid building the repr unless debug is
enabled (e.g. check logger.isEnabledFor(logging.DEBUG) before computing
repr(data)), or remove the log entirely if not needed, leaving raw = data
unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 3c420d23-42af-47bb-a216-546f1a37e34d
📒 Files selected for processing (4)
crates/grpc_client/proto/vllm_engine.protocrates/protocols/src/rerank.rsgrpc_servicer/smg_grpc_servicer/vllm/servicer.pymodel_gateway/src/routers/grpc/regular/stages/score/mod.rs
| /// Optional encoding format for the response | ||
| #[serde(skip_serializing_if = "Option::is_none")] | ||
| pub encoding_format: Option<String>, | ||
|
|
||
| /// Whether to truncate the input | ||
| #[serde(skip_serializing_if = "Option::is_none")] | ||
| pub truncate_prompt_tokens: Option<u32>, |
There was a problem hiding this comment.
truncate_prompt_tokens is exposed here but ignored on the native path.
This field is part of the public /v1/score contract now, but the native gRPC transport and ScoreNativeStage only forward request_id, text_1, and text_2. That means HTTP passthrough can honor truncation while native gRPC silently drops it, so the same request behaves differently between connection modes. Please either plumb it end-to-end or remove the field until it is supported.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/protocols/src/rerank.rs` around lines 245 - 251, The
truncate_prompt_tokens field is declared on the public Rerank/score request but
is ignored on the native gRPC path; update the native plumbing to carry it or
remove it until supported. Either (preferred) add truncate_prompt_tokens to the
native request/message and propagate it in ScoreNativeStage (and any
ScoreNativeRequest/ScoreNativeResponse structs, gRPC proto/messages, and the
native transport marshal/unmarshal code) so the native path forwards and honors
the truncation flag exactly like the HTTP path, or remove truncate_prompt_tokens
from the public struct in rerank.rs so the contract is consistent; make sure to
update any tests and code that build/parse the native score request to reference
the truncate_prompt_tokens symbol and not silently drop it.
| data = final_output.outputs.data | ||
| logger.info("Score data type=%s repr=%s", type(data).__name__, repr(data)[:200]) | ||
| raw = data |
There was a problem hiding this comment.
Move the per-result score dump off INFO.
This runs once per text_2 item, so large rerank requests will emit one INFO log per candidate and serialize repr(data) on the hot path. Please downgrade it to DEBUG or remove it after bring-up.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@grpc_servicer/smg_grpc_servicer/vllm/servicer.py` around lines 314 - 316, The
per-result score dump currently uses logger.info with an expensive repr(data)
and runs for every final_output.outputs.data (see the logger.info line and raw =
data), which floods INFO on large rerank requests; change that log to
logger.debug and avoid building the repr unless debug is enabled (e.g. check
logger.isEnabledFor(logging.DEBUG) before computing repr(data)), or remove the
log entirely if not needed, leaving raw = data unchanged.
Description
Problem
SMG's router does not support the
/v1/scoreendpoint, returning 404 in both gRPC and HTTP connection modes. This prevents reranker models (e.g.,BAAI/bge-reranker-v2-m3, ModernBERT-based cross-encoders) from being served through SMG.The vLLM worker correctly exposes
/v1/scoreand responds with valid rerank results, but the SMG gateway does not route this endpoint.Closes #1017
Solution
Add end-to-end
/v1/scoresupport across the gateway:ScoreRequest,ScoreResponse, andScoreResultmessages plus aScoreRPC invllm_engine.proto.score()andbuild_score_request()on the vLLM engine client.ScoreNativeStagein the gRPC regular pipeline, add score-aware dispatch metadata, context plumbing, and HTTP→gRPC routing in the gateway server./v1/scoreto the vLLM worker in HTTP connection mode.ScoreRPC in the vLLM servicer with proper cross-encoder text-pair tokenization (tokenizer(text=text_1, text_pair=text_2)with[SEP]tokens) andPoolingParams(task="classify"), mirroring vLLM's upstreamCrossEncoderIOProcessor. Note: vLLM deprecated the"score"task in vllm-project/vllm#37537; cross-encoder rerankers now use"classify".ModelTypedetection to recognize scoring/reranker models and addScoreRequest/ScoreResponseprotocol types.Changes
crates/grpc_client/proto/vllm_engine.proto— AddScoreRPC andScoreRequest/ScoreResponse/ScoreResultmessagescrates/grpc_client/python/pyproject.toml— Bump proto package versioncrates/grpc_client/src/vllm_engine.rs— Implementscore()andbuild_score_request()on gRPC clientcrates/protocols/src/model_type.rs— Extend model type detection for scoring modelscrates/protocols/src/rerank.rs— AddScoreRequest/ScoreResponseprotocol typesmodel_gateway/src/observability/metrics.rs— Add score endpoint metricsmodel_gateway/src/routers/grpc/client.rs— Wire score through gRPC client abstractionmodel_gateway/src/routers/grpc/common/stages/dispatch_metadata.rs— Score-aware dispatch metadatamodel_gateway/src/routers/grpc/context.rs— Add score request to gRPC contextmodel_gateway/src/routers/grpc/harmony/stages/request_building.rs— Harmony pipeline score supportmodel_gateway/src/routers/grpc/harmony/stages/response_processing.rs— Harmony pipeline score responsemodel_gateway/src/routers/grpc/pipeline.rs— Register score stage in pipelinemodel_gateway/src/routers/grpc/regular/stages/mod.rs— Export score stage modulemodel_gateway/src/routers/grpc/regular/stages/score/mod.rs—ScoreNativeStageimplementationmodel_gateway/src/routers/grpc/router.rs— Route score requests in gRPC routermodel_gateway/src/routers/http/router.rs— Route/v1/scorein HTTP passthrough modemodel_gateway/src/routers/mod.rs— Add score to router traitmodel_gateway/src/routers/router_manager.rs— Wire score in router managermodel_gateway/src/server.rs— Register/v1/scoreHTTP endpointgrpc_servicer/smg_grpc_servicer/vllm/servicer.py— ImplementScoreRPC with cross-encoder text-pair tokenizationTest Plan
Model:
BAAI/bge-reranker-v2-m3gRPC mode (default):
Result:
{ "object": "list", "data": [ {"object": "score", "score": 0.9942461848258972, "index": 0}, {"object": "score", "score": 0.0004087462439201772, "index": 1} ], "model": "bge-reranker", "usage": {"prompt_tokens": 33, "completion_tokens": 0, "total_tokens": 33} }HTTP mode:
Result:
{ "id": "score-9c1eefc27778320a", "object": "list", "created": 1775210958, "model": "bge-reranker", "data": [ {"index": 0, "object": "score", "score": 0.994252622127533}, {"index": 1, "object": "score", "score": 0.00040918969898484647} ], "usage": {"prompt_tokens": 33, "total_tokens": 33, "completion_tokens": 0, "prompt_tokens_details": null} }Both modes correctly rank "Paris is the capital" as highly relevant (~0.994) and "London is in the UK" as irrelevant (~0.0004). Scores match between gRPC and HTTP within floating-point precision.
cargo +nightly fmtpassescargo clippy --all-targets --all-features -- -D warningspassesSummary by CodeRabbit