feat(gateway): add `/v1/score` endpoint for cross-encoder reranker models by ppraneth · Pull Request #1032 · lightseekorg/smg

ppraneth · 2026-04-03T10:33:03Z

Description

Problem

SMG's router does not support the /v1/score endpoint, returning 404 in both gRPC and HTTP connection modes. This prevents reranker models (e.g., BAAI/bge-reranker-v2-m3, ModernBERT-based cross-encoders) from being served through SMG.

The vLLM worker correctly exposes /v1/score and responds with valid rerank results, but the SMG gateway does not route this endpoint.
Closes #1017

Solution

Add end-to-end /v1/score support across the gateway:

Proto: Define ScoreRequest, ScoreResponse, and ScoreResult messages plus a Score RPC in vllm_engine.proto.
gRPC client (Rust): Implement score() and build_score_request() on the vLLM engine client.
Routing pipeline (Rust): Wire up a ScoreNativeStage in the gRPC regular pipeline, add score-aware dispatch metadata, context plumbing, and HTTP→gRPC routing in the gateway server.
HTTP passthrough: Forward /v1/score to the vLLM worker in HTTP connection mode.
gRPC servicer (Python): Implement Score RPC in the vLLM servicer with proper cross-encoder text-pair tokenization (tokenizer(text=text_1, text_pair=text_2) with [SEP] tokens) and PoolingParams(task="classify"), mirroring vLLM's upstream CrossEncoderIOProcessor. Note: vLLM deprecated the "score" task in vllm-project/vllm#37537; cross-encoder rerankers now use "classify".
Observability: Add score request metrics.
Protocols: Extend ModelType detection to recognize scoring/reranker models and add ScoreRequest/ScoreResponse protocol types.

Changes

crates/grpc_client/proto/vllm_engine.proto — Add Score RPC and ScoreRequest/ScoreResponse/ScoreResult messages
crates/grpc_client/python/pyproject.toml — Bump proto package version
crates/grpc_client/src/vllm_engine.rs — Implement score() and build_score_request() on gRPC client
crates/protocols/src/model_type.rs — Extend model type detection for scoring models
crates/protocols/src/rerank.rs — Add ScoreRequest/ScoreResponse protocol types
model_gateway/src/observability/metrics.rs — Add score endpoint metrics
model_gateway/src/routers/grpc/client.rs — Wire score through gRPC client abstraction
model_gateway/src/routers/grpc/common/stages/dispatch_metadata.rs — Score-aware dispatch metadata
model_gateway/src/routers/grpc/context.rs — Add score request to gRPC context
model_gateway/src/routers/grpc/harmony/stages/request_building.rs — Harmony pipeline score support
model_gateway/src/routers/grpc/harmony/stages/response_processing.rs — Harmony pipeline score response
model_gateway/src/routers/grpc/pipeline.rs — Register score stage in pipeline
model_gateway/src/routers/grpc/regular/stages/mod.rs — Export score stage module
model_gateway/src/routers/grpc/regular/stages/score/mod.rs — ScoreNativeStage implementation
model_gateway/src/routers/grpc/router.rs — Route score requests in gRPC router
model_gateway/src/routers/http/router.rs — Route /v1/score in HTTP passthrough mode
model_gateway/src/routers/mod.rs — Add score to router trait
model_gateway/src/routers/router_manager.rs — Wire score in router manager
model_gateway/src/server.rs — Register /v1/score HTTP endpoint
grpc_servicer/smg_grpc_servicer/vllm/servicer.py — Implement Score RPC with cross-encoder text-pair tokenization

Test Plan

Model: BAAI/bge-reranker-v2-m3

gRPC mode (default):

smg serve --host 0.0.0.0 --model BAAI/bge-reranker-v2-m3 --port 8080 --backend vllm --served-model-name bge-reranker

curl http://localhost:8080/v1/score \
  -H "Content-Type: application/json" \
  -d '{"model":"bge-reranker","text_1":"What is the capital of France?","text_2":["Paris is the capital.","London is in the UK."]}'

Result:

{
  "object": "list",
  "data": [
    {"object": "score", "score": 0.9942461848258972, "index": 0},
    {"object": "score", "score": 0.0004087462439201772, "index": 1}
  ],
  "model": "bge-reranker",
  "usage": {"prompt_tokens": 33, "completion_tokens": 0, "total_tokens": 33}
}

HTTP mode:

smg serve --host 0.0.0.0 --model BAAI/bge-reranker-v2-m3 --port 8080 --backend vllm --connection-mode http --served-model-name bge-reranker

curl http://localhost:8080/v1/score \
  -H "Content-Type: application/json" \
  -d '{"model":"bge-reranker","text_1":"What is the capital of France?","text_2":["Paris is the capital.","London is in the UK."]}'

Result:

{
  "id": "score-9c1eefc27778320a",
  "object": "list",
  "created": 1775210958,
  "model": "bge-reranker",
  "data": [
    {"index": 0, "object": "score", "score": 0.994252622127533},
    {"index": 1, "object": "score", "score": 0.00040918969898484647}
  ],
  "usage": {"prompt_tokens": 33, "total_tokens": 33, "completion_tokens": 0, "prompt_tokens_details": null}
}

Both modes correctly rank "Paris is the capital" as highly relevant (~0.994) and "London is in the UK" as irrelevant (~0.0004). Scores match between gRPC and HTTP within floating-point precision.

cargo +nightly fmt passes
cargo clippy --all-targets --all-features -- -D warnings passes
(Optional) Documentation updated
(Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

Summary by CodeRabbit

New Features
- Added a native /v1/score endpoint for text-pair scoring/reranking that returns per-item scores, request IDs, timestamps, and token usage.
- End-to-end support: server handlers, routing (HTTP & gRPC), pipelines, and clients now accept and dispatch Score requests (non-streaming) and return OpenAI-style score responses.
- Metrics label added for the new Score endpoint.
Chores
- Tightened protobuf dependency constraint.

Signed-off-by: ppraneth <pranethparuchuri@gmail.com>

coderabbitai · 2026-04-03T10:33:12Z

📝 Walkthrough

Walkthrough

Adds end-to-end support for a new vLLM scoring endpoint (/v1/score): proto messages and RPC, client helpers, servicer implementation, protocol types, router/pipeline stages for gRPC and HTTP, routing/context/metrics plumbing, and response conversion.

Changes

Cohort / File(s)	Summary
Proto & Packaging `crates/grpc_client/proto/vllm_engine.proto`, `crates/grpc_client/python/pyproject.toml`	Added `Score` RPC and messages (`ScoreRequest`, `ScoreResult`, `ScoreResponse`); constrained `protobuf` dependency with an upper bound.
gRPC client `crates/grpc_client/src/vllm_engine.rs`, `crates/grpc_client/python/smg_grpc_proto/proto`	Added `build_score_request` helper and async `score` method to vLLM client; trivial newline fix for Python proto.
vLLM servicer `grpc_servicer/smg_grpc_servicer/vllm/servicer.py`	Implemented async `Score` RPC: tokenization, per-pair encode() with PoolingParams(task="classify"), per-item score extraction, token counting, and gRPC error mapping.
Protocol types & rerank surface `crates/protocols/src/model_type.rs`, `crates/protocols/src/rerank.rs`	Added `ModelType::SCORE` and `Endpoint::Score`; introduced `ScoreRequest`, `StringOrVec`, `ScoreData`, `ScoreResponse`, and GenerationRequest impl (non-streaming) for rerank/score.
Gateway metrics & context `model_gateway/src/observability/metrics.rs`, `model_gateway/src/routers/grpc/context.rs`	Added `ENDPOINT_SCORE` metric label; added `RequestType::Score`, `RequestContext::for_score` and accessor, treat score as non-streaming.
gRPC client adapter & metadata `model_gateway/src/routers/grpc/client.rs`, `model_gateway/src/routers/grpc/common/stages/dispatch_metadata.rs`	GrpcClient got `build_score_request` and `score` forwarding for vLLM variant; dispatch metadata now handles `RequestType::Score`.
Pipeline & stages `model_gateway/src/routers/grpc/pipeline.rs`, `model_gateway/src/routers/grpc/regular/stages/mod.rs`, `model_gateway/src/routers/grpc/regular/stages/score/mod.rs`	Added Score pipeline (`new_score`, `execute_score`) and stages: `ScorePreparationStage` and `ScoreNativeStage` which build backend proto, call gRPC client, and convert to OpenAI-style response.
Harmony pipeline guards `model_gateway/src/routers/grpc/harmony/stages/request_building.rs`, `model_gateway/src/routers/grpc/harmony/stages/response_processing.rs`	Extended Harmony checks to explicitly reject `RequestType::Score` with existing not-supported error path.
Router & server wiring `model_gateway/src/routers/grpc/router.rs`, `model_gateway/src/routers/http/router.rs`, `model_gateway/src/routers/mod.rs`, `model_gateway/src/routers/router_manager.rs`, `model_gateway/src/server.rs`	Added `route_score` to RouterTrait (default unimplemented), implemented HTTP and gRPC router handling, RouterManager dispatch, and registered `POST /v1/score` handler.
Minor packaging `crates/grpc_client/python/smg_grpc_proto/proto`	Added trailing newline (formatting only).

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant HTTP_Router as HTTP Router
    participant GrpcRouter as GrpcRouter
    participant Pipeline as RequestPipeline
    participant GrpcClient as GrpcClient (vLLM)
    participant Servicer as vLLM Servicer
    participant Engine as vLLM Engine

    Client->>HTTP_Router: POST /v1/score (ScoreRequest)
    HTTP_Router->>GrpcRouter: route_score()
    GrpcRouter->>Pipeline: execute_score(request, headers, model_id)
    Pipeline->>GrpcClient: build_score_request(...) / score(proto::ScoreRequest)
    GrpcClient->>Servicer: Score RPC
    Servicer->>Engine: encode(text_1, text_2[*]) with PoolingParams(task="classify")
    Engine-->>Servicer: outputs (per-item)
    Servicer-->>GrpcClient: ScoreResponse (results + token counts)
    GrpcClient-->>Pipeline: proto::ScoreResponse
    Pipeline->>GrpcRouter: OpenAI-style ScoreResponse (JSON)
    GrpcRouter-->>HTTP_Router: HTTP 200 JSON
    HTTP_Router-->>Client: 200 OK

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

fix: gRPC error handling — proper status codes and circuit breaker accuracy #645 — Overlaps changes to gRPC client dispatch and error handling around GrpcClient and pipeline client adapters.
feat(embed): add vLLM gRPC embedding support, clean up proto #1001 — Related vLLM client/proto helper additions in the same vllm_engine client area.
refactor(core): unify worker metadata discovery and clean up registration #447 — Touches ModelType/endpoint handling; related to adding SCORE capability/endpoint.

Suggested reviewers

CatherineSue
key4ng
njhill
slin1237

Poem

🐰 I hopped through proto, client, and route,
Scored each pair with a careful snout.
Tokens counted, ranks in store,
Responses returned — I ask for more!
A rabbit cheers — rerank and encore! 🥕

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat(gateway): add `/v1/score` endpoint for cross-encoder reranker models' directly and specifically describes the main change: adding a new `/v1/score` endpoint to support scoring/reranking functionality in the gateway.
Linked Issues check	✅ Passed	All requirements from issue `#1017` are met: the PR adds complete `/v1/score` endpoint support with request/response routing, model-type detection for reranker models, and dual-mode support (gRPC and HTTP).
Out of Scope Changes check	✅ Passed	All changes are directly in scope: proto definitions, gRPC client/servicer implementations, gateway routing stages, HTTP routing, protocol types, model-type detection, and observability—all necessary to support the `/v1/score` endpoint.
Docstring Coverage	✅ Passed	Docstring coverage is 89.06% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch ppraneth/endpoint

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

model_gateway/src/routers/grpc/regular/stages/score/mod.rs

crates/protocols/src/rerank.rs

model_gateway/src/routers/grpc/pipeline.rs

model_gateway/src/routers/grpc/context.rs

gemini-code-assist

Code Review

This pull request implements support for the vLLM /v1/score endpoint to enable cross-encoder reranking, including updates to gRPC proto definitions, model capability flags, and the addition of a dedicated scoring pipeline in the gateway. Feedback highlights missing fields in the protobuf definition for truncation settings and the use of a hardcoded request ID in the pipeline. Additionally, improvements are suggested for parallelizing document processing in the servicer and correcting misleading documentation regarding the gRPC pipeline implementation.

gemini-code-assist · 2026-04-03T10:37:42Z

crates/grpc_client/proto/vllm_engine.proto

+message ScoreRequest {
+  string request_id = 1;
+  string text_1 = 2;
+  repeated string text_2 = 3;
+}


The ScoreRequest message is missing the truncate_prompt_tokens field, which is present in the ScoreRequest protocol definition in crates/protocols/src/rerank.rs. Without this field in the proto, the gateway cannot pass truncation settings to the vLLM worker.

Suggested change

message ScoreRequest {

string request_id = 1;

string text_1 = 2;

repeated string text_2 = 3;

}

message ScoreRequest {

string request_id = 1;

string text_1 = 2;

repeated string text_2 = 3;

optional uint32 truncate_prompt_tokens = 4;

}

References

For protocol data structures that mirror an external API (e.g., OpenAI), prioritize alignment with the external specification over internal consistency.

gemini-code-assist · 2026-04-03T10:37:42Z

crates/grpc_client/src/vllm_engine.rs

+    pub fn build_score_request(
+        &self,
+        request_id: String,
+        text_1: String,
+        text_2: Vec<String>,
+    ) -> proto::ScoreRequest {
+        proto::ScoreRequest {
+            request_id,
+            text_1,
+            text_2,
+        }
+    }


Update build_score_request to include the truncate_prompt_tokens parameter to match the updated proto and protocol definitions.

Suggested change

pub fn build_score_request(

&self,

request_id: String,

text_1: String,

text_2: Vec<String>,

) -> proto::ScoreRequest {

proto::ScoreRequest {

request_id,

text_1,

text_2,

}

}

pub fn build_score_request(

&self,

request_id: String,

text_1: String,

text_2: Vec<String>,

truncate_prompt_tokens: Option<u32>,

) -> proto::ScoreRequest {

proto::ScoreRequest {

request_id,

text_1,

text_2,

truncate_prompt_tokens,

}

}

References

For builder methods that construct data structures mapping directly to a wire format, it is acceptable to have many arguments if they correspond one-to-one with the wire-format fields.

gemini-code-assist · 2026-04-03T10:37:43Z

grpc_servicer/smg_grpc_servicer/vllm/servicer.py

+            results = []
+            total_prompt_tokens = 0
+
+            for i, text_2_item in enumerate(request.text_2):


The documents in request.text_2 are processed sequentially using an async for loop inside a for loop. For reranking tasks with many documents, this will be significantly slower than processing them in parallel. Consider using asyncio.gather to run the encode tasks concurrently.

model_gateway/src/routers/grpc/context.rs

gemini-code-assist · 2026-04-03T10:37:43Z

model_gateway/src/routers/grpc/pipeline.rs

+        let stages: Vec<Box<dyn PipelineStage>> = vec![
+            Box::new(super::regular::stages::score::ScorePreparationStage),
+            Box::new(WorkerSelectionStage::new(
+                worker_registry,
+                policy_registry,
+                WorkerSelectionMode::Regular, // Score is always single-worker
+            )),
+            Box::new(ClientAcquisitionStage),
+            Box::new(super::regular::stages::score::ScoreNativeStage::new()),
+        ];


The new_score pipeline is missing the DispatchMetadataStage. This stage is essential for generating a unique request ID and populating dispatch metadata, which should be used by the ScoreNativeStage instead of a hardcoded ID.

Suggested change

let stages: Vec<Box<dyn PipelineStage>> = vec![

Box::new(super::regular::stages::score::ScorePreparationStage),

Box::new(WorkerSelectionStage::new(

worker_registry,

policy_registry,

WorkerSelectionMode::Regular, // Score is always single-worker

)),

Box::new(ClientAcquisitionStage),

Box::new(super::regular::stages::score::ScoreNativeStage::new()),

];

let stages: Vec<Box<dyn PipelineStage>> = vec![

Box::new(super::regular::stages::score::ScorePreparationStage),

Box::new(WorkerSelectionStage::new(

worker_registry,

policy_registry,

WorkerSelectionMode::Regular, // Score is always single-worker

)),

Box::new(ClientAcquisitionStage),

Box::new(DispatchMetadataStage),

Box::new(super::regular::stages::score::ScoreNativeStage::new()),

];

gemini-code-assist · 2026-04-03T10:37:43Z

model_gateway/src/routers/grpc/pipeline.rs

+    /// (`/v1/score`). The `ScoreHttpForwardStage` always returns
+    /// `Ok(Some(response))`, so the loop below will return the proxied response
+    /// before reaching the `final_response` check at the bottom.


The documentation comment for execute_score is incorrect. It describes HTTP forwarding and a non-existent ScoreHttpForwardStage, while the actual implementation uses a gRPC pipeline with ScoreNativeStage.

/// Execute the complete pipeline for a Score API request. /// /// Score requests are processed via the gRPC pipeline using the ScoreNativeStage. /// This stage executes the Native Score RPC via the client and returns /// Ok(Some(response)) to short-circuit the pipeline.

Signed-off-by: ppraneth <pranethparuchuri@gmail.com>

crates/protocols/src/rerank.rs

claude · 2026-04-03T10:54:46Z

grpc_servicer/smg_grpc_servicer/vllm/servicer.py

+                # The output data for score is a relevance score wrapped in a tensor.
+                # vLLM versions return different structures — normalize to scalar.
+                data = final_output.outputs.data
+                logger.info("Score data type=%s repr=%s", type(data).__name__, repr(data)[:200])


🟡 Nit: This logger.info fires for every text_2 item in every score request, logging repr(data). In production with batched requests this will be extremely noisy. Should be logger.debug — the INFO log on line 272 already records the request arrival.

claude · 2026-04-03T10:54:50Z

grpc_servicer/smg_grpc_servicer/vllm/servicer.py

+                while isinstance(raw, (list, tuple)) and len(raw) > 0:
+                    raw = raw[0]
+                score_value = float(raw)


🟡 Nit: If the model returns an empty list/tuple (e.g. []), the while loop exits without unwrapping and float(raw) is called on the empty list, raising an unhelpful TypeError. Consider adding a guard:

if isinstance(raw, (list, tuple)): # exhausted without finding a scalar msg = f"Score request {request_id}_{i} returned empty data: {repr(data)[:200]}" logger.warning(msg) await context.abort(grpc.StatusCode.INTERNAL, msg) score_value = float(raw)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dc36dc55ca

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-03T10:59:08Z

model_gateway/src/routers/grpc/regular/stages/score/mod.rs

+                    return Err(error::internal_error(
+                        "score_backend_unsupported",
+                        "Score is only supported on vLLM backend",
+                    ));


Return 4xx instead of 500 for unsupported /v1/score backends

If /v1/score is called while SMG is running in gRPC mode with a non-vLLM backend (for example, SGLang/TRT-LLM), this branch converts the backend-capability mismatch into internal_error, which surfaces as HTTP 500. Since the endpoint is registered unconditionally, this is a predictable user-input/path mismatch rather than a server fault, and returning 500 will mislead clients and inflate server-error metrics.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-03T10:59:08Z

model_gateway/src/routers/grpc/regular/stages/score/mod.rs

+        let proto_req =
+            match client.build_score_request(request_id, score_req.text_1.clone(), text_2) {
+                Ok(req) => req,


Preserve /v1/score truncation options in gRPC requests

The gRPC score path only forwards request_id, text_1, and text_2, so ScoreRequest options like truncate_prompt_tokens (and encoding_format) are silently dropped in gRPC mode. This creates behavior drift from HTTP passthrough mode and can cause long text-pair requests that depend on truncation to fail or behave differently when routed through the gRPC pipeline.

Useful? React with 👍 / 👎.

Signed-off-by: ppraneth <pranethparuchuri@gmail.com>

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/protocols/src/model_type.rs`:
- Around line 37-39: The JSON schema enum for ModelType is missing the "score"
variant even though ModelType includes the SCORE flag and CAPABILITY_NAMES/serde
accept "score"; update the ModelType JsonSchema implementation to add "score" to
the enum list (and mirror this change wherever the manual enum list is
duplicated, e.g., the other JsonSchema/enum generation block referenced at the
second location). Locate the ModelType definition (including the const SCORE)
and the impl JsonSchema for ModelType and add "score" to the returned enum
values so schema-based validation and generated clients accept the score
capability.

In `@crates/protocols/src/rerank.rs`:
- Around line 233-252: Add validation for ScoreRequest similar to RerankRequest
by implementing a validation method (e.g., impl ScoreRequest::validate or
implementing the same Validate trait used by RerankRequest) that checks that
text_1 is not empty and that text_2 contains at least one non-empty document
(handle both String and Vec variants of StringOrVec). Return a suitable error
type on failure and call this validation where other request types are validated
so protocol-layer errors are consistent.

In `@grpc_servicer/smg_grpc_servicer/vllm/servicer.py`:
- Around line 287-341: The loop over request.text_2 in servicer.py currently
calls self.engine.encode per-item (see the for i, text_2_item in
enumerate(request.text_2) loop and use of pooling_params/request_id), which
serializes scoring and limits throughput; either implement batching by
aggregating prompts (build a list of TokensPrompt objects and call
self.engine.encode once to let the backend parallelize) or, if sequential
processing is intentional, add a clear TODO comment above the loop referencing
request.text_2, self.engine.encode and pooling_params that notes this is
single-item encoding and that batching should be considered in a follow-up for
high-throughput workloads.
- Around line 331-333: The unwrapping loop around the variable raw (while
isinstance(raw, (list, tuple)) and len(raw) > 0: raw = raw[0]) can leave raw as
an empty list/tuple and then float(raw) will raise; modify the unwrapping logic
in servicer.py around the raw handling to guard against empty containers by
checking after the loop whether raw is still a list/tuple (or empty) and either
raise a clear ValueError/TypeError with context (including the problematic raw)
or provide a sensible default, ensuring you only call float(raw) when raw is a
scalar value; reference the raw unwrapping code block and the float(raw) call
when applying the fix.

In `@model_gateway/src/routers/grpc/pipeline.rs`:
- Around line 1128-1129: Update the incorrect comment that references
ScoreHttpForwardStage: change it to refer to ScoreNativeStage so the comment
matches the implementation of the score pipeline (e.g., the comment near the
Ok(Some(response)) branch should explain that ScoreNativeStage returns Ok(Some)
to short-circuit and record success).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 057c93a3-855c-446b-989c-1c27d129717c

📥 Commits

Reviewing files that changed from the base of the PR and between 4c8635d and dc36dc5.

📒 Files selected for processing (20)

crates/grpc_client/proto/vllm_engine.proto
crates/grpc_client/python/pyproject.toml
crates/grpc_client/src/vllm_engine.rs
crates/protocols/src/model_type.rs
crates/protocols/src/rerank.rs
grpc_servicer/smg_grpc_servicer/vllm/servicer.py
model_gateway/src/observability/metrics.rs
model_gateway/src/routers/grpc/client.rs
model_gateway/src/routers/grpc/common/stages/dispatch_metadata.rs
model_gateway/src/routers/grpc/context.rs
model_gateway/src/routers/grpc/harmony/stages/request_building.rs
model_gateway/src/routers/grpc/harmony/stages/response_processing.rs
model_gateway/src/routers/grpc/pipeline.rs
model_gateway/src/routers/grpc/regular/stages/mod.rs
model_gateway/src/routers/grpc/regular/stages/score/mod.rs
model_gateway/src/routers/grpc/router.rs
model_gateway/src/routers/http/router.rs
model_gateway/src/routers/mod.rs
model_gateway/src/routers/router_manager.rs
model_gateway/src/server.rs

coderabbitai · 2026-04-03T11:01:32Z

crates/protocols/src/model_type.rs

+        /// Score/cross-encoder reranker models (vLLM /v1/score)
+        const SCORE       = 1 << 12;



⚠️ Potential issue | 🟠 Major

Add "score" to ModelType JSON schema enum values.

CAPABILITY_NAMES/serde now accept "score", but the manual JsonSchema enum list still omits it. This can cause schema-based validation or generated clients to reject valid configs.

🛠️ Proposed fix

@@ enum_values: Some(vec![ "chat".into(), "completions".into(), "responses".into(), "embeddings".into(), "rerank".into(), "generate".into(), "vision".into(), "tools".into(), "reasoning".into(), "image_gen".into(), "audio".into(), "moderation".into(), + "score".into(), ]),

Also applies to: 87-87

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/protocols/src/model_type.rs` around lines 37 - 39, The JSON schema enum for ModelType is missing the "score" variant even though ModelType includes the SCORE flag and CAPABILITY_NAMES/serde accept "score"; update the ModelType JsonSchema implementation to add "score" to the enum list (and mirror this change wherever the manual enum list is duplicated, e.g., the other JsonSchema/enum generation block referenced at the second location). Locate the ModelType definition (including the const SCORE) and the impl JsonSchema for ModelType and add "score" to the returned enum values so schema-based validation and generated clients accept the score capability.

coderabbitai · 2026-04-03T11:01:32Z

crates/protocols/src/rerank.rs

+#[derive(Debug, Clone, Deserialize, Serialize, schemars::JsonSchema)]
+pub struct ScoreRequest {
+    /// The model to use for scoring
+    pub model: String,
+
+    /// The query/source text (single string)
+    pub text_1: String,
+
+    /// The document(s) to score against the query.
+    /// Can be a single string or a list of strings.
+    pub text_2: StringOrVec,
+
+    /// Optional encoding format for the response
+    #[serde(skip_serializing_if = "Option::is_none")]
+    pub encoding_format: Option<String>,
+
+    /// Whether to truncate the input
+    #[serde(skip_serializing_if = "Option::is_none")]
+    pub truncate_prompt_tokens: Option<u32>,
+}


🧹 Nitpick | 🔵 Trivial

Consider adding validation for ScoreRequest.

Unlike RerankRequest which has validation for non-empty query and documents, ScoreRequest lacks validation. Consider adding:

Non-empty text_1 validation

Non-empty text_2 validation (at least one document to score)

This would provide consistent error handling at the protocol layer rather than at the backend.

♻️ Example validation addition

+use validator::Validate; + #[derive(Debug, Clone, Deserialize, Serialize, schemars::JsonSchema)] +#[derive(Validate)] pub struct ScoreRequest { /// The model to use for scoring pub model: String, /// The query/source text (single string) + #[validate(custom(function = "validate_text_1"))] pub text_1: String, /// The document(s) to score against the query. /// Can be a single string or a list of strings. + #[validate(custom(function = "validate_text_2"))] pub text_2: StringOrVec, // ... rest unchanged } + +fn validate_text_1(text: &str) -> Result<(), validator::ValidationError> { + if text.trim().is_empty() { + return Err(validator::ValidationError::new("text_1 cannot be empty")); + } + Ok(()) +} + +fn validate_text_2(text_2: &StringOrVec) -> Result<(), validator::ValidationError> { + if text_2.is_empty() { + return Err(validator::ValidationError::new("text_2 cannot be empty")); + } + Ok(()) +}

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

#[derive(Debug, Clone, Deserialize, Serialize, schemars::JsonSchema)]

pub struct ScoreRequest {

/// The model to use for scoring

pub model: String,

/// The query/source text (single string)

pub text_1: String,

/// The document(s) to score against the query.

/// Can be a single string or a list of strings.

pub text_2: StringOrVec,

/// Optional encoding format for the response

#[serde(skip_serializing_if = "Option::is_none")]

pub encoding_format: Option<String>,

/// Whether to truncate the input

#[serde(skip_serializing_if = "Option::is_none")]

pub truncate_prompt_tokens: Option<u32>,

}

use validator::Validate;

#[derive(Debug, Clone, Deserialize, Serialize, schemars::JsonSchema, Validate)]

pub struct ScoreRequest {

/// The model to use for scoring

pub model: String,

/// The query/source text (single string)

#[validate(custom(function = "validate_text_1"))]

pub text_1: String,

/// The document(s) to score against the query.

/// Can be a single string or a list of strings.

#[validate(custom(function = "validate_text_2"))]

pub text_2: StringOrVec,

/// Optional encoding format for the response

#[serde(skip_serializing_if = "Option::is_none")]

pub encoding_format: Option<String>,

/// Whether to truncate the input

#[serde(skip_serializing_if = "Option::is_none")]

pub truncate_prompt_tokens: Option<u32>,

}

fn validate_text_1(text: &str) -> Result<(), validator::ValidationError> {

if text.trim().is_empty() {

return Err(validator::ValidationError::new("text_1 cannot be empty"));

}

Ok(())

}

fn validate_text_2(text_2: &StringOrVec) -> Result<(), validator::ValidationError> {

if text_2.is_empty() {

return Err(validator::ValidationError::new("text_2 cannot be empty"));

}

Ok(())

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/protocols/src/rerank.rs` around lines 233 - 252, Add validation for ScoreRequest similar to RerankRequest by implementing a validation method (e.g., impl ScoreRequest::validate or implementing the same Validate trait used by RerankRequest) that checks that text_1 is not empty and that text_2 contains at least one non-empty document (handle both String and Vec variants of StringOrVec). Return a suitable error type on failure and call this validation where other request types are validated so protocol-layer errors are consistent.

grpc_servicer/smg_grpc_servicer/vllm/servicer.py

coderabbitai · 2026-04-03T11:01:32Z

grpc_servicer/smg_grpc_servicer/vllm/servicer.py

+                while isinstance(raw, (list, tuple)) and len(raw) > 0:
+                    raw = raw[0]
+                score_value = float(raw)


⚠️ Potential issue | 🟡 Minor

Edge case: empty nested structure could cause IndexError.

The while loop unwraps nested lists/tuples, but if the innermost non-empty container holds an empty container, raw[0] will be assigned that empty container, then the next iteration's len(raw) > 0 check will fail and float(raw) will be called on an empty list/tuple, raising a TypeError.

While this edge case is unlikely given vLLM's output format, consider adding a guard:

🛡️ Proposed defensive fix

while isinstance(raw, (list, tuple)) and len(raw) > 0: raw = raw[0] + if isinstance(raw, (list, tuple)): + raise ValueError(f"Could not extract scalar score from nested structure: {data!r}") score_value = float(raw)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@grpc_servicer/smg_grpc_servicer/vllm/servicer.py` around lines 331 - 333, The unwrapping loop around the variable raw (while isinstance(raw, (list, tuple)) and len(raw) > 0: raw = raw[0]) can leave raw as an empty list/tuple and then float(raw) will raise; modify the unwrapping logic in servicer.py around the raw handling to guard against empty containers by checking after the loop whether raw is still a list/tuple (or empty) and either raise a clear ValueError/TypeError with context (including the problematic raw) or provide a sensible default, ensuring you only call float(raw) when raw is a scalar value; reference the raw unwrapping code block and the float(raw) call when applying the fix.

coderabbitai · 2026-04-03T11:01:32Z

model_gateway/src/routers/grpc/pipeline.rs

+                Ok(Some(response)) => {
+                    // ScoreHttpForwardStage returns Ok(Some) to short-circuit — record success


⚠️ Potential issue | 🟡 Minor

Comment references wrong stage name.

The comment mentions ScoreHttpForwardStage, but the score pipeline uses ScoreNativeStage. This appears to be a copy-paste artifact.

📝 Suggested fix

Ok(Some(response)) => { - // ScoreHttpForwardStage returns Ok(Some) to short-circuit — record success + // ScoreNativeStage returns Ok(Some) to short-circuit — record success Metrics::record_router_duration(

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

Ok(Some(response)) => {

// ScoreHttpForwardStage returns Ok(Some) to short-circuit — record success

Ok(Some(response)) => {

// ScoreNativeStage returns Ok(Some) to short-circuit — record success

Metrics::record_router_duration(

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/routers/grpc/pipeline.rs` around lines 1128 - 1129, Update the incorrect comment that references ScoreHttpForwardStage: change it to refer to ScoreNativeStage so the comment matches the implementation of the score pipeline (e.g., the comment near the Ok(Some(response)) branch should explain that ScoreNativeStage returns Ok(Some) to short-circuit and record success).

grpc_servicer/smg_grpc_servicer/vllm/servicer.py

claude · 2026-04-03T11:06:11Z

model_gateway/src/routers/grpc/pipeline.rs

+    /// Create a Score API pipeline for vLLM `/v1/score` endpoint.
+    ///
+    /// The `/v1/score` endpoint was added to the vLLM gRPC specs. This pipeline uses
+    /// three stages:


🟡 Nit: Doc says "three stages" but the pipeline actually has four — ScorePreparationStage is in the code (line 333) but missing from the numbered list. Should be:

ScorePreparationStage — extracts text_1 for worker routing

WorkerSelectionStage — picks the target worker

ClientAcquisitionStage — obtains the gRPC client

ScoreNativeStage — executes the Score RPC

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: da212d0f9e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-03T11:07:24Z

model_gateway/src/routers/grpc/regular/stages/score/mod.rs

+
+        ctx.state.preparation = Some(PreparationOutput {
+            original_text: Some(original_text),
+            token_ids: Vec::new(), // Scoring worker routing doesn't strictly need accurate token lengths


Populate score token_ids before worker selection

ScorePreparationStage always sets token_ids to an empty vector, so WorkerSelectionStage forwards tokens: None for every /v1/score request. In deployments using the prefix_hash policy, select_worker_impl returns None when tokens are missing, which makes the score path fail with model_not_found even though healthy workers exist. This is a hard routing failure for any model configured with token-based policies.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-03T11:07:24Z

model_gateway/src/routers/grpc/pipeline.rs

+                policy_registry,
+                WorkerSelectionMode::Regular, // Score is always single-worker
+            )),
+            Box::new(ClientAcquisitionStage),
+            Box::new(super::regular::stages::score::ScoreNativeStage::new()),


Preserve worker load/circuit accounting in score pipeline

The new score pipeline skips RequestExecutionStage and calls client.score(...) directly, which bypasses the normal WorkerLoadGuard lifecycle and worker outcome recording done in request_execution.rs. As a result, score traffic is invisible to load-based balancing and does not feed circuit-breaker outcomes, so under real score load the router can keep selecting overloaded or failing workers instead of adapting like other gRPC endpoints.

Useful? React with 👍 / 👎.

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

crates/protocols/src/rerank.rs (1)

233-252: ⚠️ Potential issue | 🟠 Major

Add protocol-layer validation for ScoreRequest fields.

At Line 233, ScoreRequest is public but not validated, so empty text_1 and empty/blank text_2 inputs can pass protocol parsing and fail later in backend-specific ways. Please reject these early, similar to RerankRequest.

♻️ Proposed fix

-#[derive(Debug, Clone, Deserialize, Serialize, schemars::JsonSchema)]
+#[derive(Debug, Clone, Deserialize, Serialize, Validate, schemars::JsonSchema)]
+#[validate(schema(function = "validate_score_request"))]
 pub struct ScoreRequest {
     /// The model to use for scoring
     pub model: String,

     /// The query/source text (single string)
+    #[validate(custom(function = "validate_text_1"))]
     pub text_1: String,

     /// The document(s) to score against the query.
     /// Can be a single string or a list of strings.
+    #[validate(custom(function = "validate_text_2"))]
     pub text_2: StringOrVec,
@@
 }
+
+fn validate_text_1(text: &str) -> Result<(), validator::ValidationError> {
+    if text.trim().is_empty() {
+        return Err(validator::ValidationError::new("text_1 cannot be empty"));
+    }
+    Ok(())
+}
+
+fn validate_text_2(text_2: &StringOrVec) -> Result<(), validator::ValidationError> {
+    match text_2 {
+        StringOrVec::Single(s) if s.trim().is_empty() => {
+            Err(validator::ValidationError::new("text_2 cannot be empty"))
+        }
+        StringOrVec::Array(v) if v.is_empty() || v.iter().any(|s| s.trim().is_empty()) => {
+            Err(validator::ValidationError::new("text_2 contains empty entries"))
+        }
+        _ => Ok(()),
+    }
+}
+
+#[expect(
+    clippy::unnecessary_wraps,
+    reason = "validator crate requires Result return type"
+)]
+fn validate_score_request(_req: &ScoreRequest) -> Result<(), validator::ValidationError> {
+    Ok(())
+}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/protocols/src/rerank.rs` around lines 233 - 252, ScoreRequest is not
validated at protocol layer allowing empty text_1 or empty/blank text_2 to pass;
add a validation routine (implement a Validate-like method or trait) for the
ScoreRequest struct that checks: model is present if required, text_1 is
non-empty/non-whitespace, and text_2 (StringOrVec) contains at least one
non-empty/non-whitespace string (if it's a Vec ensure no empty entries; if it's
a single String ensure it's non-blank); wire this validator into the same
protocol parsing path used by RerankRequest so invalid requests are rejected
early with a clear error.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@grpc_servicer/smg_grpc_servicer/vllm/servicer.py`:
- Around line 31-36: The code currently calls mm_inputs(...) unconditionally
which raises a 'NoneType' object is not callable when the vllm multimodal
constructor failed to import; update the handling so that before entering the
preprocessed multimodal path (i.e., when has_preprocessed_mm is true or inside
_build_preprocessed_mm_inputs) you check that mm_inputs is not None and
VllmMultiModalInputs is available, and if not raise a clear
UnsupportedOperation/RuntimeError with a message like "multimodal preprocessing
not supported in this vllm version" (or fail fast during initialization if you
prefer); ensure checks reference mm_inputs, VllmMultiModalInputs,
_build_preprocessed_mm_inputs and has_preprocessed_mm so callers hit a clear
error instead of a NoneType call.
- Around line 300-306: The code is passing the raw token_type_ids list into
pooling_params.extra_kwargs, but vLLM expects a compressed integer index named
"compressed_token_type_ids"; update the branch in servicer.py that handles
encoded["token_type_ids"] to import and call
compress_token_type_ids(token_type_ids) and set pair_pooling_params.extra_kwargs
= {"compressed_token_type_ids": compressed_index} (keeping the rest of
pooling_params cloning logic and names like pooling_params, pair_pooling_params,
and encoded unchanged).

---

Duplicate comments:
In `@crates/protocols/src/rerank.rs`:
- Around line 233-252: ScoreRequest is not validated at protocol layer allowing
empty text_1 or empty/blank text_2 to pass; add a validation routine (implement
a Validate-like method or trait) for the ScoreRequest struct that checks: model
is present if required, text_1 is non-empty/non-whitespace, and text_2
(StringOrVec) contains at least one non-empty/non-whitespace string (if it's a
Vec ensure no empty entries; if it's a single String ensure it's non-blank);
wire this validator into the same protocol parsing path used by RerankRequest so
invalid requests are rejected early with a clear error.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9216d459-c62e-44a5-bb30-dd1be6b11677

📥 Commits

Reviewing files that changed from the base of the PR and between dc36dc5 and da212d0.

📒 Files selected for processing (3)

crates/grpc_client/python/smg_grpc_proto/proto
crates/protocols/src/rerank.rs
grpc_servicer/smg_grpc_servicer/vllm/servicer.py

grpc_servicer/smg_grpc_servicer/vllm/servicer.py

coderabbitai · 2026-04-03T11:14:26Z

grpc_servicer/smg_grpc_servicer/vllm/servicer.py

+                # Pass token_type_ids via pooling_params if present
+                pair_pooling_params = pooling_params
+                if "token_type_ids" in encoded:
+                    pair_pooling_params = pooling_params.clone()
+                    pair_pooling_params.extra_kwargs = {
+                        "token_type_ids": encoded["token_type_ids"],
+                    }


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "*.py" | grep -E "(servicer|vllm)" | head -20

Repository: lightseekorg/smg

Length of output: 655

🏁 Script executed:

cat -n grpc_servicer/smg_grpc_servicer/vllm/servicer.py | sed -n '290,320p'

Repository: lightseekorg/smg

Length of output: 1709

🏁 Script executed:

rg "compress_token_type_ids" --type py

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

rg "token_type_ids" grpc_servicer/smg_grpc_servicer/vllm/ --type py -B 3 -A 3

Repository: lightseekorg/smg

Length of output: 1147

🏁 Script executed:

rg "pooling_params" grpc_servicer/smg_grpc_servicer/vllm/ --type py -B 2 -A 2

Repository: lightseekorg/smg

Length of output: 2711

🏁 Script executed:

rg "compressed_token_type_ids\|compress_token_type_ids" --type py

Repository: lightseekorg/smg

Length of output: 42

🌐 Web query:

vLLM pooling_params extra_kwargs token_type_ids compressed_token_type_ids

💡 Result:

In vLLM's pooling API (used for tasks like embeddings, classification, scoring with models like cross-encoders), PoolingParams is a configuration class that includes an 'extra_kwargs' field: dict[str, Any] | None = None. This field stores additional parameters for internal use. token_type_ids is an optional field in TokensPrompt (vllm.inputs), a list[int] used for cross-encoder models to distinguish query and document tokens (typically [0]*query_len + [1]*doc_len). When token_type_ids is provided in the engine prompt (e.g., during LLM.score for cross-encoder scoring), vLLM processes it as follows: - Pops "token_type_ids" from the prompt. - Clones the default PoolingParams. - Computes compressed_token_type_ids = compress_token_type_ids(token_type_ids), where compress_token_type_ids returns the index of the first 1 in the list (position where document tokens start, assuming format of zeros followed by ones) or the list length if no 1s. - Sets pooling_params.extra_kwargs = {"compressed_token_type_ids": compressed} This compressed value (an int) is passed via extra_kwargs to the pooling layer (e.g., for token-wise pooling in cross-encoder score computation), likely to efficiently identify the boundary without storing the full list. Sources: vLLM GitHub (pooling_params.py, llm.py, score_utils.py), docs (inputs, score_utils, pooling score serving).

Citations:

1: https://github.com/vllm-project/vllm/blob/main/vllm/pooling_params.py

2: https://docs.vllm.ai/en/stable/api/vllm/entrypoints/pooling/score/serving/

3: https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/llm.py

4: https://docs.vllm.ai/en/stable/api/vllm/inputs/

5: https://docs.vllm.ai/en/v0.13.0/api/vllm/entrypoints/score_utils/

🌐 Web query:

vLLM score task token_type_ids handling official documentation

💡 Result:

vLLM supports handling of token_type_ids primarily for cross-encoder models used in score tasks. TokensPrompt includes an optional token_type_ids field: "A list of token type IDs to pass to the cross encoder model." This allows users to provide pre-tokenized inputs with segment IDs (e.g., 0 for query, 1 for document in BERT-like models). In the LLM class (offline inference), when token_type_ids are present in the prompt, they are extracted, compressed via compress_token_type_ids (likely run-length encoding for efficiency), and passed as extra_kwargs={"compressed_token_type_ids": compressed} to PoolingParams for the score/classify task. For score tasks (cross-encoders), users typically provide text pairs (query, document), which vLLM tokenizes internally using get_score_prompt, applying score templates. These may generate token_type_ids automatically during tokenization if the tokenizer supports it (e.g., BERT tokenizer with text_pair). However, users can bypass internal tokenization by providing TokensPrompt with explicit token_type_ids. There was a bug in AsyncLLMEngine (online serving) where token_type_ids from TokensPrompt were lost for score tasks (v0.8.5.post1, issue #17743, closed Aug 2025), requiring a patch to InputPreprocessor to forward them. This suggests support exists but had an async-specific issue, likely fixed post-closure. Score task uses pooling runner (PoolingParams with task="score", deprecated in recent versions favoring classify for cross-encoders). No explicit PoolingParams field for token_type_ids; handled internally via extra_kwargs. Overall, token_type_ids handling is integrated for cross-encoder score/classify tasks via TokensPrompt and passed to model executor, with compression for efficiency. Primarily useful when providing pre-tokenized inputs; internal text tokenization may handle automatically for supported models.

Citations:

1: [Bug]: token_type_ids lost from prompt input during asynchronous request processing vllm-project/vllm#17743

2: https://docs.vllm.ai/en/stable/api/vllm/entrypoints/llm/

3: https://docs.vllm.ai/en/v0.7.1/api/offline_inference/llm_inputs.html

4: https://docs.vllm.ai/en/v0.18.2/models/pooling_models/scoring/

5: https://github.com/vllm-project/vllm/blob/main/vllm/pooling_params.py

6: https://docs.vllm.ai/en/v0.6.2/_modules/vllm/inputs/data.html

7: https://docs.vllm.ai/en/v0.7.0/_modules/vllm/inputs/data.html

🌐 Web query:

vLLM InputPreprocessor _process_tokens token_type_ids

💡 Result:

In vLLM's InputPreprocessor class (vllm.inputs.preprocess.py), the _process_tokens method processes a TokensPrompt object. It extracts prompt_token_ids from parsed_content["prompt_token_ids"] and optionally retrieves token_type_ids using parsed_content.get("token_type_ids"). If multi_modal_data is present in parsed_content, it calls _process_multimodal (or async equivalent) with prompt_token_ids and other data. Otherwise, it constructs a TokenInputs object passing both prompt_token_ids and token_type_ids. token_type_ids supports models like cross-encoders (e.g., for query-document scoring) that require distinguishing token types, similar to BERT-style token_type_ids (0 for first sequence, 1 for second). There was a bug (GitHub issue #17743, fixed by Aug 2025) where token_type_ids were lost in async processing, but it's resolved in recent versions. Example code snippet from docs: def _process_tokens(self, parsed_content: TokensPrompt, ...): prompt_token_ids = parsed_content["prompt_token_ids"] token_type_ids = parsed_content.get("token_type_ids") ... else: inputs = token_inputs( prompt_token_ids=prompt_token_ids, token_type_ids=token_type_ids, ) TokensPrompt schema: prompt_token_ids: list[int], token_type_ids: list[int] (optional).

Citations:

1: https://docs.vllm.ai/en/v0.9.0/api/vllm/inputs/preprocess.html

2: [Bug]: token_type_ids lost from prompt input during asynchronous request processing vllm-project/vllm#17743

3: https://vllm.website.cncfstack.com/api/vllm/inputs/preprocess/

4: https://docs.vllm.ai/en/v0.10.1/api/vllm/inputs/preprocess.html

5: https://docs.vllm.ai/en/v0.9.0/api/vllm/inputs/

Pass compressed token_type_ids via extra_kwargs, not raw list.

vLLM's cross-encoder score/classify path expects pooling_params.extra_kwargs["compressed_token_type_ids"] as an integer (the position of the first document token), not the raw token_type_ids list. vLLM computes this by calling compress_token_type_ids(token_type_ids), which returns the index of the first 1 in the list. Passing the raw list here will cause incorrect score computation. (vLLM pooling_params.py, vLLM score_utils)

🐛 Proposed fix

if "token_type_ids" in encoded: pair_pooling_params = pooling_params.clone() pair_pooling_params.extra_kwargs = { - "token_type_ids": encoded["token_type_ids"], + "compressed_token_type_ids": compress_token_type_ids( + encoded["token_type_ids"] + ), }

Import compress_token_type_ids from vLLM's score utilities.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@grpc_servicer/smg_grpc_servicer/vllm/servicer.py` around lines 300 - 306, The code is passing the raw token_type_ids list into pooling_params.extra_kwargs, but vLLM expects a compressed integer index named "compressed_token_type_ids"; update the branch in servicer.py that handles encoded["token_type_ids"] to import and call compress_token_type_ids(token_type_ids) and set pair_pooling_params.extra_kwargs = {"compressed_token_type_ids": compressed_index} (keeping the rest of pooling_params cloning logic and names like pooling_params, pair_pooling_params, and encoded unchanged).

mergify · 2026-04-03T18:26:16Z

Hi @ppraneth, this PR has merge conflicts that must be resolved before it can be merged. Please rebase your branch:

git fetch origin main
git rebase origin/main
# resolve any conflicts, then:
git push --force-with-lease

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@grpc_servicer/smg_grpc_servicer/vllm/servicer.py`:
- Around line 327-332: The ScoreResult construction unnecessarily wraps
score_value in float() even though score_value is already a float; in the
results.append call that creates vllm_engine_pb2.ScoreResult (the block
referencing results.append and vllm_engine_pb2.ScoreResult), remove the
redundant float() conversion and pass score_value directly as the score argument
(i.e., use score=score_value).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 417aa3d4-c4d9-4c0e-a613-40cf6f9ffcb4

📥 Commits

Reviewing files that changed from the base of the PR and between da212d0 and 288797a.

📒 Files selected for processing (1)

grpc_servicer/smg_grpc_servicer/vllm/servicer.py

grpc_servicer/smg_grpc_servicer/vllm/servicer.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 288797a7e6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-04T01:01:43Z

crates/protocols/src/model_type.rs

    (ModelType::IMAGE_GEN, "image_gen"),
    (ModelType::AUDIO, "audio"),
    (ModelType::MODERATION, "moderation"),
+    (ModelType::SCORE, "score"),


Keep ModelType schema in sync with new score capability

Adding ModelType::SCORE to CAPABILITY_NAMES makes runtime serialization emit "score", but the manual JsonSchema enum list in the same file still omits "score". That creates a schema/runtime mismatch where generated OpenAPI/JSON-schema validation can reject payloads that the code itself now produces for score-capable models.

Useful? React with 👍 / 👎.

model_gateway/src/routers/grpc/regular/stages/score/mod.rs

model_gateway/src/routers/grpc/router.rs

model_gateway/src/routers/grpc/context.rs

Signed-off-by: ppraneth <pranethparuchuri@gmail.com>

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/protocols/src/rerank.rs`:
- Around line 245-251: The truncate_prompt_tokens field is declared on the
public Rerank/score request but is ignored on the native gRPC path; update the
native plumbing to carry it or remove it until supported. Either (preferred) add
truncate_prompt_tokens to the native request/message and propagate it in
ScoreNativeStage (and any ScoreNativeRequest/ScoreNativeResponse structs, gRPC
proto/messages, and the native transport marshal/unmarshal code) so the native
path forwards and honors the truncation flag exactly like the HTTP path, or
remove truncate_prompt_tokens from the public struct in rerank.rs so the
contract is consistent; make sure to update any tests and code that build/parse
the native score request to reference the truncate_prompt_tokens symbol and not
silently drop it.

In `@grpc_servicer/smg_grpc_servicer/vllm/servicer.py`:
- Around line 314-316: The per-result score dump currently uses logger.info with
an expensive repr(data) and runs for every final_output.outputs.data (see the
logger.info line and raw = data), which floods INFO on large rerank requests;
change that log to logger.debug and avoid building the repr unless debug is
enabled (e.g. check logger.isEnabledFor(logging.DEBUG) before computing
repr(data)), or remove the log entirely if not needed, leaving raw = data
unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 3c420d23-42af-47bb-a216-546f1a37e34d

📥 Commits

Reviewing files that changed from the base of the PR and between 288797a and 0a0fed8.

📒 Files selected for processing (4)

crates/grpc_client/proto/vllm_engine.proto
crates/protocols/src/rerank.rs
grpc_servicer/smg_grpc_servicer/vllm/servicer.py
model_gateway/src/routers/grpc/regular/stages/score/mod.rs

coderabbitai · 2026-04-04T13:35:24Z

crates/protocols/src/rerank.rs

+    /// Optional encoding format for the response
+    #[serde(skip_serializing_if = "Option::is_none")]
+    pub encoding_format: Option<String>,
+
+    /// Whether to truncate the input
+    #[serde(skip_serializing_if = "Option::is_none")]
+    pub truncate_prompt_tokens: Option<u32>,


⚠️ Potential issue | 🟠 Major

truncate_prompt_tokens is exposed here but ignored on the native path.

This field is part of the public /v1/score contract now, but the native gRPC transport and ScoreNativeStage only forward request_id, text_1, and text_2. That means HTTP passthrough can honor truncation while native gRPC silently drops it, so the same request behaves differently between connection modes. Please either plumb it end-to-end or remove the field until it is supported.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/protocols/src/rerank.rs` around lines 245 - 251, The truncate_prompt_tokens field is declared on the public Rerank/score request but is ignored on the native gRPC path; update the native plumbing to carry it or remove it until supported. Either (preferred) add truncate_prompt_tokens to the native request/message and propagate it in ScoreNativeStage (and any ScoreNativeRequest/ScoreNativeResponse structs, gRPC proto/messages, and the native transport marshal/unmarshal code) so the native path forwards and honors the truncation flag exactly like the HTTP path, or remove truncate_prompt_tokens from the public struct in rerank.rs so the contract is consistent; make sure to update any tests and code that build/parse the native score request to reference the truncate_prompt_tokens symbol and not silently drop it.

coderabbitai · 2026-04-04T13:35:24Z

grpc_servicer/smg_grpc_servicer/vllm/servicer.py

+                data = final_output.outputs.data
+                logger.info("Score data type=%s repr=%s", type(data).__name__, repr(data)[:200])
+                raw = data


⚠️ Potential issue | 🟡 Minor

Move the per-result score dump off INFO.

This runs once per text_2 item, so large rerank requests will emit one INFO log per candidate and serialize repr(data) on the hot path. Please downgrade it to DEBUG or remove it after bring-up.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@grpc_servicer/smg_grpc_servicer/vllm/servicer.py` around lines 314 - 316, The per-result score dump currently uses logger.info with an expensive repr(data) and runs for every final_output.outputs.data (see the logger.info line and raw = data), which floods INFO on large rerank requests; change that log to logger.debug and avoid building the repr unless debug is enabled (e.g. check logger.isEnabledFor(logging.DEBUG) before computing repr(data)), or remove the log entirely if not needed, leaving raw = data unchanged.

ppraneth added 18 commits April 3, 2026 08:43

add v1/score

a7cfa2c

Signed-off-by: ppraneth <pranethparuchuri@gmail.com>

fix error

5d6ae46

Signed-off-by: ppraneth <pranethparuchuri@gmail.com>

fix error

97aeef3

Signed-off-by: ppraneth <pranethparuchuri@gmail.com>

fix error

ab0c3f3

Signed-off-by: ppraneth <pranethparuchuri@gmail.com>

fix error clippy

05fe363

Signed-off-by: ppraneth <pranethparuchuri@gmail.com>

fix error grpc vllm

4fa8a9a

Signed-off-by: ppraneth <pranethparuchuri@gmail.com>

fix error grpc vllm-2

4218148

Signed-off-by: ppraneth <pranethparuchuri@gmail.com>

fix error grpc vllm

7e4611b

Signed-off-by: ppraneth <pranethparuchuri@gmail.com>

fix error grpc vllm

f24bb23

Signed-off-by: ppraneth <pranethparuchuri@gmail.com>

fix error grpc vllm

8f06c76

Signed-off-by: ppraneth <pranethparuchuri@gmail.com>

fix error grpc vllm

da30be7

Signed-off-by: ppraneth <pranethparuchuri@gmail.com>

fix error grpc vllm

300527d

Signed-off-by: ppraneth <pranethparuchuri@gmail.com>

fix error grpc vllm

52a2813

Signed-off-by: ppraneth <pranethparuchuri@gmail.com>

fix error vllm

10137ae

Signed-off-by: ppraneth <pranethparuchuri@gmail.com>

trace vllm error

e8a93a1

Signed-off-by: ppraneth <pranethparuchuri@gmail.com>

update poolingparam

6b38351

Signed-off-by: ppraneth <pranethparuchuri@gmail.com>

update poolingparam to clasify

c146ef3

Signed-off-by: ppraneth <pranethparuchuri@gmail.com>

update poolingparam debug grpc error

5617de4

Signed-off-by: ppraneth <pranethparuchuri@gmail.com>

github-actions bot added dependencies Dependency updates grpc gRPC client and router changes protocols Protocols crate changes model-gateway Model gateway crate changes labels Apr 3, 2026

claude bot reviewed Apr 3, 2026

View reviewed changes

model_gateway/src/routers/grpc/regular/stages/score/mod.rs Outdated Show resolved Hide resolved

claude bot reviewed Apr 3, 2026

View reviewed changes

crates/protocols/src/rerank.rs Outdated Show resolved Hide resolved

claude bot reviewed Apr 3, 2026

View reviewed changes

model_gateway/src/routers/grpc/pipeline.rs Outdated Show resolved Hide resolved

claude bot reviewed Apr 3, 2026

View reviewed changes

model_gateway/src/routers/grpc/context.rs Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Apr 3, 2026

View reviewed changes

minor fixes

dc36dc5

Signed-off-by: ppraneth <pranethparuchuri@gmail.com>

coderabbitai bot approved these changes Apr 3, 2026

View reviewed changes

ppraneth marked this pull request as ready for review April 3, 2026 10:53

ppraneth requested review from CatherineSue, key4ng, njhill and slin1237 as code owners April 3, 2026 10:53

claude bot reviewed Apr 3, 2026

View reviewed changes

crates/protocols/src/rerank.rs Show resolved Hide resolved

claude bot reviewed Apr 3, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Apr 3, 2026

View reviewed changes

ppraneth added 2 commits April 3, 2026 16:30

minor fixes part 2

5131a59

Signed-off-by: ppraneth <pranethparuchuri@gmail.com>

Merge branch 'main' into ppraneth/endpoint

da212d0

coderabbitai bot requested changes Apr 3, 2026

View reviewed changes

claude bot reviewed Apr 3, 2026

View reviewed changes

grpc_servicer/smg_grpc_servicer/vllm/servicer.py Outdated Show resolved Hide resolved

claude bot reviewed Apr 3, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Apr 3, 2026

View reviewed changes

coderabbitai bot requested changes Apr 3, 2026

View reviewed changes

mergify bot added the needs-rebase PR has merge conflicts that need to be resolved label Apr 3, 2026

Merge branch 'main' into ppraneth/endpoint

288797a

coderabbitai bot requested changes Apr 4, 2026

View reviewed changes

grpc_servicer/smg_grpc_servicer/vllm/servicer.py Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Apr 4, 2026

View reviewed changes

claude bot reviewed Apr 4, 2026

View reviewed changes

model_gateway/src/routers/grpc/regular/stages/score/mod.rs Show resolved Hide resolved

claude bot reviewed Apr 4, 2026

View reviewed changes

model_gateway/src/routers/grpc/router.rs Show resolved Hide resolved

claude bot reviewed Apr 4, 2026

View reviewed changes

model_gateway/src/routers/grpc/context.rs Show resolved Hide resolved

pre commit fix and add id, created

0a0fed8

Signed-off-by: ppraneth <pranethparuchuri@gmail.com>

coderabbitai bot requested changes Apr 4, 2026

View reviewed changes

mergify bot removed the needs-rebase PR has merge conflicts that need to be resolved label Apr 4, 2026

		/// Score/cross-encoder reranker models (vLLM /v1/score)
		const SCORE = 1 << 12;

		Ok(Some(response)) => {
		// ScoreHttpForwardStage returns Ok(Some) to short-circuit — record success

Conversation

ppraneth commented Apr 3, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem

Solution

Changes

Test Plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

claude bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

claude bot Apr 3, 2026

Choose a reason for hiding this comment

ppraneth commented Apr 3, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 3, 2026 •

edited

Loading