Skip to content

feat(gateway): add /v1/score endpoint for cross-encoder reranker models#1032

Open
ppraneth wants to merge 23 commits intomainfrom
ppraneth/endpoint
Open

feat(gateway): add /v1/score endpoint for cross-encoder reranker models#1032
ppraneth wants to merge 23 commits intomainfrom
ppraneth/endpoint

Conversation

@ppraneth
Copy link
Copy Markdown
Collaborator

@ppraneth ppraneth commented Apr 3, 2026

Description

Problem

SMG's router does not support the /v1/score endpoint, returning 404 in both gRPC and HTTP connection modes. This prevents reranker models (e.g., BAAI/bge-reranker-v2-m3, ModernBERT-based cross-encoders) from being served through SMG.

The vLLM worker correctly exposes /v1/score and responds with valid rerank results, but the SMG gateway does not route this endpoint.
Closes #1017

Solution

Add end-to-end /v1/score support across the gateway:

  • Proto: Define ScoreRequest, ScoreResponse, and ScoreResult messages plus a Score RPC in vllm_engine.proto.
  • gRPC client (Rust): Implement score() and build_score_request() on the vLLM engine client.
  • Routing pipeline (Rust): Wire up a ScoreNativeStage in the gRPC regular pipeline, add score-aware dispatch metadata, context plumbing, and HTTP→gRPC routing in the gateway server.
  • HTTP passthrough: Forward /v1/score to the vLLM worker in HTTP connection mode.
  • gRPC servicer (Python): Implement Score RPC in the vLLM servicer with proper cross-encoder text-pair tokenization (tokenizer(text=text_1, text_pair=text_2) with [SEP] tokens) and PoolingParams(task="classify"), mirroring vLLM's upstream CrossEncoderIOProcessor. Note: vLLM deprecated the "score" task in vllm-project/vllm#37537; cross-encoder rerankers now use "classify".
  • Observability: Add score request metrics.
  • Protocols: Extend ModelType detection to recognize scoring/reranker models and add ScoreRequest/ScoreResponse protocol types.

Changes

  • crates/grpc_client/proto/vllm_engine.proto — Add Score RPC and ScoreRequest/ScoreResponse/ScoreResult messages
  • crates/grpc_client/python/pyproject.toml — Bump proto package version
  • crates/grpc_client/src/vllm_engine.rs — Implement score() and build_score_request() on gRPC client
  • crates/protocols/src/model_type.rs — Extend model type detection for scoring models
  • crates/protocols/src/rerank.rs — Add ScoreRequest/ScoreResponse protocol types
  • model_gateway/src/observability/metrics.rs — Add score endpoint metrics
  • model_gateway/src/routers/grpc/client.rs — Wire score through gRPC client abstraction
  • model_gateway/src/routers/grpc/common/stages/dispatch_metadata.rs — Score-aware dispatch metadata
  • model_gateway/src/routers/grpc/context.rs — Add score request to gRPC context
  • model_gateway/src/routers/grpc/harmony/stages/request_building.rs — Harmony pipeline score support
  • model_gateway/src/routers/grpc/harmony/stages/response_processing.rs — Harmony pipeline score response
  • model_gateway/src/routers/grpc/pipeline.rs — Register score stage in pipeline
  • model_gateway/src/routers/grpc/regular/stages/mod.rs — Export score stage module
  • model_gateway/src/routers/grpc/regular/stages/score/mod.rsScoreNativeStage implementation
  • model_gateway/src/routers/grpc/router.rs — Route score requests in gRPC router
  • model_gateway/src/routers/http/router.rs — Route /v1/score in HTTP passthrough mode
  • model_gateway/src/routers/mod.rs — Add score to router trait
  • model_gateway/src/routers/router_manager.rs — Wire score in router manager
  • model_gateway/src/server.rs — Register /v1/score HTTP endpoint
  • grpc_servicer/smg_grpc_servicer/vllm/servicer.py — Implement Score RPC with cross-encoder text-pair tokenization

Test Plan

Model: BAAI/bge-reranker-v2-m3

gRPC mode (default):

smg serve --host 0.0.0.0 --model BAAI/bge-reranker-v2-m3 --port 8080 --backend vllm --served-model-name bge-reranker
curl http://localhost:8080/v1/score \
  -H "Content-Type: application/json" \
  -d '{"model":"bge-reranker","text_1":"What is the capital of France?","text_2":["Paris is the capital.","London is in the UK."]}'

Result:

{
  "object": "list",
  "data": [
    {"object": "score", "score": 0.9942461848258972, "index": 0},
    {"object": "score", "score": 0.0004087462439201772, "index": 1}
  ],
  "model": "bge-reranker",
  "usage": {"prompt_tokens": 33, "completion_tokens": 0, "total_tokens": 33}
}

HTTP mode:

smg serve --host 0.0.0.0 --model BAAI/bge-reranker-v2-m3 --port 8080 --backend vllm --connection-mode http --served-model-name bge-reranker
curl http://localhost:8080/v1/score \
  -H "Content-Type: application/json" \
  -d '{"model":"bge-reranker","text_1":"What is the capital of France?","text_2":["Paris is the capital.","London is in the UK."]}'

Result:

{
  "id": "score-9c1eefc27778320a",
  "object": "list",
  "created": 1775210958,
  "model": "bge-reranker",
  "data": [
    {"index": 0, "object": "score", "score": 0.994252622127533},
    {"index": 1, "object": "score", "score": 0.00040918969898484647}
  ],
  "usage": {"prompt_tokens": 33, "total_tokens": 33, "completion_tokens": 0, "prompt_tokens_details": null}
}

Both modes correctly rank "Paris is the capital" as highly relevant (~0.994) and "London is in the UK" as irrelevant (~0.0004). Scores match between gRPC and HTTP within floating-point precision.

  • cargo +nightly fmt passes
  • cargo clippy --all-targets --all-features -- -D warnings passes
  • (Optional) Documentation updated
  • (Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

Summary by CodeRabbit

  • New Features
    • Added a native /v1/score endpoint for text-pair scoring/reranking that returns per-item scores, request IDs, timestamps, and token usage.
    • End-to-end support: server handlers, routing (HTTP & gRPC), pipelines, and clients now accept and dispatch Score requests (non-streaming) and return OpenAI-style score responses.
    • Metrics label added for the new Score endpoint.
  • Chores
    • Tightened protobuf dependency constraint.

ppraneth added 18 commits April 3, 2026 08:43
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 3, 2026

📝 Walkthrough

Walkthrough

Adds end-to-end support for a new vLLM scoring endpoint (/v1/score): proto messages and RPC, client helpers, servicer implementation, protocol types, router/pipeline stages for gRPC and HTTP, routing/context/metrics plumbing, and response conversion.

Changes

Cohort / File(s) Summary
Proto & Packaging
crates/grpc_client/proto/vllm_engine.proto, crates/grpc_client/python/pyproject.toml
Added Score RPC and messages (ScoreRequest, ScoreResult, ScoreResponse); constrained protobuf dependency with an upper bound.
gRPC client
crates/grpc_client/src/vllm_engine.rs, crates/grpc_client/python/smg_grpc_proto/proto
Added build_score_request helper and async score method to vLLM client; trivial newline fix for Python proto.
vLLM servicer
grpc_servicer/smg_grpc_servicer/vllm/servicer.py
Implemented async Score RPC: tokenization, per-pair encode() with PoolingParams(task="classify"), per-item score extraction, token counting, and gRPC error mapping.
Protocol types & rerank surface
crates/protocols/src/model_type.rs, crates/protocols/src/rerank.rs
Added ModelType::SCORE and Endpoint::Score; introduced ScoreRequest, StringOrVec, ScoreData, ScoreResponse, and GenerationRequest impl (non-streaming) for rerank/score.
Gateway metrics & context
model_gateway/src/observability/metrics.rs, model_gateway/src/routers/grpc/context.rs
Added ENDPOINT_SCORE metric label; added RequestType::Score, RequestContext::for_score and accessor, treat score as non-streaming.
gRPC client adapter & metadata
model_gateway/src/routers/grpc/client.rs, model_gateway/src/routers/grpc/common/stages/dispatch_metadata.rs
GrpcClient got build_score_request and score forwarding for vLLM variant; dispatch metadata now handles RequestType::Score.
Pipeline & stages
model_gateway/src/routers/grpc/pipeline.rs, model_gateway/src/routers/grpc/regular/stages/mod.rs, model_gateway/src/routers/grpc/regular/stages/score/mod.rs
Added Score pipeline (new_score, execute_score) and stages: ScorePreparationStage and ScoreNativeStage which build backend proto, call gRPC client, and convert to OpenAI-style response.
Harmony pipeline guards
model_gateway/src/routers/grpc/harmony/stages/request_building.rs, model_gateway/src/routers/grpc/harmony/stages/response_processing.rs
Extended Harmony checks to explicitly reject RequestType::Score with existing not-supported error path.
Router & server wiring
model_gateway/src/routers/grpc/router.rs, model_gateway/src/routers/http/router.rs, model_gateway/src/routers/mod.rs, model_gateway/src/routers/router_manager.rs, model_gateway/src/server.rs
Added route_score to RouterTrait (default unimplemented), implemented HTTP and gRPC router handling, RouterManager dispatch, and registered POST /v1/score handler.
Minor packaging
crates/grpc_client/python/smg_grpc_proto/proto
Added trailing newline (formatting only).

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant HTTP_Router as HTTP Router
    participant GrpcRouter as GrpcRouter
    participant Pipeline as RequestPipeline
    participant GrpcClient as GrpcClient (vLLM)
    participant Servicer as vLLM Servicer
    participant Engine as vLLM Engine

    Client->>HTTP_Router: POST /v1/score (ScoreRequest)
    HTTP_Router->>GrpcRouter: route_score()
    GrpcRouter->>Pipeline: execute_score(request, headers, model_id)
    Pipeline->>GrpcClient: build_score_request(...) / score(proto::ScoreRequest)
    GrpcClient->>Servicer: Score RPC
    Servicer->>Engine: encode(text_1, text_2[*]) with PoolingParams(task="classify")
    Engine-->>Servicer: outputs (per-item)
    Servicer-->>GrpcClient: ScoreResponse (results + token counts)
    GrpcClient-->>Pipeline: proto::ScoreResponse
    Pipeline->>GrpcRouter: OpenAI-style ScoreResponse (JSON)
    GrpcRouter-->>HTTP_Router: HTTP 200 JSON
    HTTP_Router-->>Client: 200 OK
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • CatherineSue
  • key4ng
  • njhill
  • slin1237

Poem

🐰 I hopped through proto, client, and route,
Scored each pair with a careful snout.
Tokens counted, ranks in store,
Responses returned — I ask for more!
A rabbit cheers — rerank and encore! 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(gateway): add /v1/score endpoint for cross-encoder reranker models' directly and specifically describes the main change: adding a new /v1/score endpoint to support scoring/reranking functionality in the gateway.
Linked Issues check ✅ Passed All requirements from issue #1017 are met: the PR adds complete /v1/score endpoint support with request/response routing, model-type detection for reranker models, and dual-mode support (gRPC and HTTP).
Out of Scope Changes check ✅ Passed All changes are directly in scope: proto definitions, gRPC client/servicer implementations, gateway routing stages, HTTP routing, protocol types, model-type detection, and observability—all necessary to support the /v1/score endpoint.
Docstring Coverage ✅ Passed Docstring coverage is 89.06% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ppraneth/endpoint

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added dependencies Dependency updates grpc gRPC client and router changes protocols Protocols crate changes model-gateway Model gateway crate changes labels Apr 3, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements support for the vLLM /v1/score endpoint to enable cross-encoder reranking, including updates to gRPC proto definitions, model capability flags, and the addition of a dedicated scoring pipeline in the gateway. Feedback highlights missing fields in the protobuf definition for truncation settings and the use of a hardcoded request ID in the pipeline. Additionally, improvements are suggested for parallelizing document processing in the servicer and correcting misleading documentation regarding the gRPC pipeline implementation.

Comment on lines +275 to +279
message ScoreRequest {
string request_id = 1;
string text_1 = 2;
repeated string text_2 = 3;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The ScoreRequest message is missing the truncate_prompt_tokens field, which is present in the ScoreRequest protocol definition in crates/protocols/src/rerank.rs. Without this field in the proto, the gateway cannot pass truncation settings to the vLLM worker.

Suggested change
message ScoreRequest {
string request_id = 1;
string text_1 = 2;
repeated string text_2 = 3;
}
message ScoreRequest {
string request_id = 1;
string text_1 = 2;
repeated string text_2 = 3;
optional uint32 truncate_prompt_tokens = 4;
}
References
  1. For protocol data structures that mirror an external API (e.g., OpenAI), prioritize alignment with the external specification over internal consistency.

Comment on lines +676 to +687
pub fn build_score_request(
&self,
request_id: String,
text_1: String,
text_2: Vec<String>,
) -> proto::ScoreRequest {
proto::ScoreRequest {
request_id,
text_1,
text_2,
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Update build_score_request to include the truncate_prompt_tokens parameter to match the updated proto and protocol definitions.

Suggested change
pub fn build_score_request(
&self,
request_id: String,
text_1: String,
text_2: Vec<String>,
) -> proto::ScoreRequest {
proto::ScoreRequest {
request_id,
text_1,
text_2,
}
}
pub fn build_score_request(
&self,
request_id: String,
text_1: String,
text_2: Vec<String>,
truncate_prompt_tokens: Option<u32>,
) -> proto::ScoreRequest {
proto::ScoreRequest {
request_id,
text_1,
text_2,
truncate_prompt_tokens,
}
}
References
  1. For builder methods that construct data structures mapping directly to a wire format, it is acceptable to have many arguments if they correspond one-to-one with the wire-format fields.

results = []
total_prompt_tokens = 0

for i, text_2_item in enumerate(request.text_2):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The documents in request.text_2 are processed sequentially using an async for loop inside a for loop. For reranking tasks with many documents, this will be significantly slower than processing them in parallel. Consider using asyncio.gather to run the encode tasks concurrently.

Comment on lines +332 to +341
let stages: Vec<Box<dyn PipelineStage>> = vec![
Box::new(super::regular::stages::score::ScorePreparationStage),
Box::new(WorkerSelectionStage::new(
worker_registry,
policy_registry,
WorkerSelectionMode::Regular, // Score is always single-worker
)),
Box::new(ClientAcquisitionStage),
Box::new(super::regular::stages::score::ScoreNativeStage::new()),
];
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The new_score pipeline is missing the DispatchMetadataStage. This stage is essential for generating a unique request ID and populating dispatch metadata, which should be used by the ScoreNativeStage instead of a hardcoded ID.

Suggested change
let stages: Vec<Box<dyn PipelineStage>> = vec![
Box::new(super::regular::stages::score::ScorePreparationStage),
Box::new(WorkerSelectionStage::new(
worker_registry,
policy_registry,
WorkerSelectionMode::Regular, // Score is always single-worker
)),
Box::new(ClientAcquisitionStage),
Box::new(super::regular::stages::score::ScoreNativeStage::new()),
];
let stages: Vec<Box<dyn PipelineStage>> = vec![
Box::new(super::regular::stages::score::ScorePreparationStage),
Box::new(WorkerSelectionStage::new(
worker_registry,
policy_registry,
WorkerSelectionMode::Regular, // Score is always single-worker
)),
Box::new(ClientAcquisitionStage),
Box::new(DispatchMetadataStage),
Box::new(super::regular::stages::score::ScoreNativeStage::new()),
];

Comment on lines +1100 to +1102
/// (`/v1/score`). The `ScoreHttpForwardStage` always returns
/// `Ok(Some(response))`, so the loop below will return the proxied response
/// before reaching the `final_response` check at the bottom.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The documentation comment for execute_score is incorrect. It describes HTTP forwarding and a non-existent ScoreHttpForwardStage, while the actual implementation uses a gRPC pipeline with ScoreNativeStage.

    /// Execute the complete pipeline for a Score API request.
    ///
    /// Score requests are processed via the gRPC pipeline using the ScoreNativeStage.
    /// This stage executes the Native Score RPC via the client and returns
    /// Ok(Some(response)) to short-circuit the pipeline.

Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
@ppraneth ppraneth marked this pull request as ready for review April 3, 2026 10:53
# The output data for score is a relevance score wrapped in a tensor.
# vLLM versions return different structures — normalize to scalar.
data = final_output.outputs.data
logger.info("Score data type=%s repr=%s", type(data).__name__, repr(data)[:200])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: This logger.info fires for every text_2 item in every score request, logging repr(data). In production with batched requests this will be extremely noisy. Should be logger.debug — the INFO log on line 272 already records the request arrival.

Comment on lines +331 to +333
while isinstance(raw, (list, tuple)) and len(raw) > 0:
raw = raw[0]
score_value = float(raw)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: If the model returns an empty list/tuple (e.g. []), the while loop exits without unwrapping and float(raw) is called on the empty list, raising an unhelpful TypeError. Consider adding a guard:

if isinstance(raw, (list, tuple)):
    # exhausted without finding a scalar
    msg = f"Score request {request_id}_{i} returned empty data: {repr(data)[:200]}"
    logger.warning(msg)
    await context.abort(grpc.StatusCode.INTERNAL, msg)
score_value = float(raw)

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dc36dc55ca

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +96 to +99
return Err(error::internal_error(
"score_backend_unsupported",
"Score is only supported on vLLM backend",
));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Return 4xx instead of 500 for unsupported /v1/score backends

If /v1/score is called while SMG is running in gRPC mode with a non-vLLM backend (for example, SGLang/TRT-LLM), this branch converts the backend-capability mismatch into internal_error, which surfaces as HTTP 500. Since the endpoint is registered unconditionally, this is a predictable user-input/path mismatch rather than a server fault, and returning 500 will mislead clients and inflate server-error metrics.

Useful? React with 👍 / 👎.

Comment on lines +91 to +93
let proto_req =
match client.build_score_request(request_id, score_req.text_1.clone(), text_2) {
Ok(req) => req,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve /v1/score truncation options in gRPC requests

The gRPC score path only forwards request_id, text_1, and text_2, so ScoreRequest options like truncate_prompt_tokens (and encoding_format) are silently dropped in gRPC mode. This creates behavior drift from HTTP passthrough mode and can cause long text-pair requests that depend on truncation to fail or behave differently when routed through the gRPC pipeline.

Useful? React with 👍 / 👎.

ppraneth added 2 commits April 3, 2026 16:30
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/protocols/src/model_type.rs`:
- Around line 37-39: The JSON schema enum for ModelType is missing the "score"
variant even though ModelType includes the SCORE flag and CAPABILITY_NAMES/serde
accept "score"; update the ModelType JsonSchema implementation to add "score" to
the enum list (and mirror this change wherever the manual enum list is
duplicated, e.g., the other JsonSchema/enum generation block referenced at the
second location). Locate the ModelType definition (including the const SCORE)
and the impl JsonSchema for ModelType and add "score" to the returned enum
values so schema-based validation and generated clients accept the score
capability.

In `@crates/protocols/src/rerank.rs`:
- Around line 233-252: Add validation for ScoreRequest similar to RerankRequest
by implementing a validation method (e.g., impl ScoreRequest::validate or
implementing the same Validate trait used by RerankRequest) that checks that
text_1 is not empty and that text_2 contains at least one non-empty document
(handle both String and Vec variants of StringOrVec). Return a suitable error
type on failure and call this validation where other request types are validated
so protocol-layer errors are consistent.

In `@grpc_servicer/smg_grpc_servicer/vllm/servicer.py`:
- Around line 287-341: The loop over request.text_2 in servicer.py currently
calls self.engine.encode per-item (see the for i, text_2_item in
enumerate(request.text_2) loop and use of pooling_params/request_id), which
serializes scoring and limits throughput; either implement batching by
aggregating prompts (build a list of TokensPrompt objects and call
self.engine.encode once to let the backend parallelize) or, if sequential
processing is intentional, add a clear TODO comment above the loop referencing
request.text_2, self.engine.encode and pooling_params that notes this is
single-item encoding and that batching should be considered in a follow-up for
high-throughput workloads.
- Around line 331-333: The unwrapping loop around the variable raw (while
isinstance(raw, (list, tuple)) and len(raw) > 0: raw = raw[0]) can leave raw as
an empty list/tuple and then float(raw) will raise; modify the unwrapping logic
in servicer.py around the raw handling to guard against empty containers by
checking after the loop whether raw is still a list/tuple (or empty) and either
raise a clear ValueError/TypeError with context (including the problematic raw)
or provide a sensible default, ensuring you only call float(raw) when raw is a
scalar value; reference the raw unwrapping code block and the float(raw) call
when applying the fix.

In `@model_gateway/src/routers/grpc/pipeline.rs`:
- Around line 1128-1129: Update the incorrect comment that references
ScoreHttpForwardStage: change it to refer to ScoreNativeStage so the comment
matches the implementation of the score pipeline (e.g., the comment near the
Ok(Some(response)) branch should explain that ScoreNativeStage returns Ok(Some)
to short-circuit and record success).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 057c93a3-855c-446b-989c-1c27d129717c

📥 Commits

Reviewing files that changed from the base of the PR and between 4c8635d and dc36dc5.

📒 Files selected for processing (20)
  • crates/grpc_client/proto/vllm_engine.proto
  • crates/grpc_client/python/pyproject.toml
  • crates/grpc_client/src/vllm_engine.rs
  • crates/protocols/src/model_type.rs
  • crates/protocols/src/rerank.rs
  • grpc_servicer/smg_grpc_servicer/vllm/servicer.py
  • model_gateway/src/observability/metrics.rs
  • model_gateway/src/routers/grpc/client.rs
  • model_gateway/src/routers/grpc/common/stages/dispatch_metadata.rs
  • model_gateway/src/routers/grpc/context.rs
  • model_gateway/src/routers/grpc/harmony/stages/request_building.rs
  • model_gateway/src/routers/grpc/harmony/stages/response_processing.rs
  • model_gateway/src/routers/grpc/pipeline.rs
  • model_gateway/src/routers/grpc/regular/stages/mod.rs
  • model_gateway/src/routers/grpc/regular/stages/score/mod.rs
  • model_gateway/src/routers/grpc/router.rs
  • model_gateway/src/routers/http/router.rs
  • model_gateway/src/routers/mod.rs
  • model_gateway/src/routers/router_manager.rs
  • model_gateway/src/server.rs

Comment on lines +37 to 39
/// Score/cross-encoder reranker models (vLLM /v1/score)
const SCORE = 1 << 12;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add "score" to ModelType JSON schema enum values.

CAPABILITY_NAMES/serde now accept "score", but the manual JsonSchema enum list still omits it. This can cause schema-based validation or generated clients to reject valid configs.

🛠️ Proposed fix
@@
             enum_values: Some(vec![
                 "chat".into(),
                 "completions".into(),
                 "responses".into(),
                 "embeddings".into(),
                 "rerank".into(),
                 "generate".into(),
                 "vision".into(),
                 "tools".into(),
                 "reasoning".into(),
                 "image_gen".into(),
                 "audio".into(),
                 "moderation".into(),
+                "score".into(),
             ]),

Also applies to: 87-87

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/protocols/src/model_type.rs` around lines 37 - 39, The JSON schema
enum for ModelType is missing the "score" variant even though ModelType includes
the SCORE flag and CAPABILITY_NAMES/serde accept "score"; update the ModelType
JsonSchema implementation to add "score" to the enum list (and mirror this
change wherever the manual enum list is duplicated, e.g., the other
JsonSchema/enum generation block referenced at the second location). Locate the
ModelType definition (including the const SCORE) and the impl JsonSchema for
ModelType and add "score" to the returned enum values so schema-based validation
and generated clients accept the score capability.

Comment on lines +233 to +252
#[derive(Debug, Clone, Deserialize, Serialize, schemars::JsonSchema)]
pub struct ScoreRequest {
/// The model to use for scoring
pub model: String,

/// The query/source text (single string)
pub text_1: String,

/// The document(s) to score against the query.
/// Can be a single string or a list of strings.
pub text_2: StringOrVec,

/// Optional encoding format for the response
#[serde(skip_serializing_if = "Option::is_none")]
pub encoding_format: Option<String>,

/// Whether to truncate the input
#[serde(skip_serializing_if = "Option::is_none")]
pub truncate_prompt_tokens: Option<u32>,
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider adding validation for ScoreRequest.

Unlike RerankRequest which has validation for non-empty query and documents, ScoreRequest lacks validation. Consider adding:

  • Non-empty text_1 validation
  • Non-empty text_2 validation (at least one document to score)

This would provide consistent error handling at the protocol layer rather than at the backend.

♻️ Example validation addition
+use validator::Validate;
+
 #[derive(Debug, Clone, Deserialize, Serialize, schemars::JsonSchema)]
+#[derive(Validate)]
 pub struct ScoreRequest {
     /// The model to use for scoring
     pub model: String,

     /// The query/source text (single string)
+    #[validate(custom(function = "validate_text_1"))]
     pub text_1: String,

     /// The document(s) to score against the query.
     /// Can be a single string or a list of strings.
+    #[validate(custom(function = "validate_text_2"))]
     pub text_2: StringOrVec,
     // ... rest unchanged
 }
+
+fn validate_text_1(text: &str) -> Result<(), validator::ValidationError> {
+    if text.trim().is_empty() {
+        return Err(validator::ValidationError::new("text_1 cannot be empty"));
+    }
+    Ok(())
+}
+
+fn validate_text_2(text_2: &StringOrVec) -> Result<(), validator::ValidationError> {
+    if text_2.is_empty() {
+        return Err(validator::ValidationError::new("text_2 cannot be empty"));
+    }
+    Ok(())
+}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#[derive(Debug, Clone, Deserialize, Serialize, schemars::JsonSchema)]
pub struct ScoreRequest {
/// The model to use for scoring
pub model: String,
/// The query/source text (single string)
pub text_1: String,
/// The document(s) to score against the query.
/// Can be a single string or a list of strings.
pub text_2: StringOrVec,
/// Optional encoding format for the response
#[serde(skip_serializing_if = "Option::is_none")]
pub encoding_format: Option<String>,
/// Whether to truncate the input
#[serde(skip_serializing_if = "Option::is_none")]
pub truncate_prompt_tokens: Option<u32>,
}
use validator::Validate;
#[derive(Debug, Clone, Deserialize, Serialize, schemars::JsonSchema, Validate)]
pub struct ScoreRequest {
/// The model to use for scoring
pub model: String,
/// The query/source text (single string)
#[validate(custom(function = "validate_text_1"))]
pub text_1: String,
/// The document(s) to score against the query.
/// Can be a single string or a list of strings.
#[validate(custom(function = "validate_text_2"))]
pub text_2: StringOrVec,
/// Optional encoding format for the response
#[serde(skip_serializing_if = "Option::is_none")]
pub encoding_format: Option<String>,
/// Whether to truncate the input
#[serde(skip_serializing_if = "Option::is_none")]
pub truncate_prompt_tokens: Option<u32>,
}
fn validate_text_1(text: &str) -> Result<(), validator::ValidationError> {
if text.trim().is_empty() {
return Err(validator::ValidationError::new("text_1 cannot be empty"));
}
Ok(())
}
fn validate_text_2(text_2: &StringOrVec) -> Result<(), validator::ValidationError> {
if text_2.is_empty() {
return Err(validator::ValidationError::new("text_2 cannot be empty"));
}
Ok(())
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/protocols/src/rerank.rs` around lines 233 - 252, Add validation for
ScoreRequest similar to RerankRequest by implementing a validation method (e.g.,
impl ScoreRequest::validate or implementing the same Validate trait used by
RerankRequest) that checks that text_1 is not empty and that text_2 contains at
least one non-empty document (handle both String and Vec variants of
StringOrVec). Return a suitable error type on failure and call this validation
where other request types are validated so protocol-layer errors are consistent.

Comment on lines +331 to +333
while isinstance(raw, (list, tuple)) and len(raw) > 0:
raw = raw[0]
score_value = float(raw)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Edge case: empty nested structure could cause IndexError.

The while loop unwraps nested lists/tuples, but if the innermost non-empty container holds an empty container, raw[0] will be assigned that empty container, then the next iteration's len(raw) > 0 check will fail and float(raw) will be called on an empty list/tuple, raising a TypeError.

While this edge case is unlikely given vLLM's output format, consider adding a guard:

🛡️ Proposed defensive fix
                 while isinstance(raw, (list, tuple)) and len(raw) > 0:
                     raw = raw[0]
+                if isinstance(raw, (list, tuple)):
+                    raise ValueError(f"Could not extract scalar score from nested structure: {data!r}")
                 score_value = float(raw)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@grpc_servicer/smg_grpc_servicer/vllm/servicer.py` around lines 331 - 333, The
unwrapping loop around the variable raw (while isinstance(raw, (list, tuple))
and len(raw) > 0: raw = raw[0]) can leave raw as an empty list/tuple and then
float(raw) will raise; modify the unwrapping logic in servicer.py around the raw
handling to guard against empty containers by checking after the loop whether
raw is still a list/tuple (or empty) and either raise a clear
ValueError/TypeError with context (including the problematic raw) or provide a
sensible default, ensuring you only call float(raw) when raw is a scalar value;
reference the raw unwrapping code block and the float(raw) call when applying
the fix.

Comment on lines +1128 to +1129
Ok(Some(response)) => {
// ScoreHttpForwardStage returns Ok(Some) to short-circuit — record success
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Comment references wrong stage name.

The comment mentions ScoreHttpForwardStage, but the score pipeline uses ScoreNativeStage. This appears to be a copy-paste artifact.

📝 Suggested fix
                 Ok(Some(response)) => {
-                    // ScoreHttpForwardStage returns Ok(Some) to short-circuit — record success
+                    // ScoreNativeStage returns Ok(Some) to short-circuit — record success
                     Metrics::record_router_duration(
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Ok(Some(response)) => {
// ScoreHttpForwardStage returns Ok(Some) to short-circuit — record success
Ok(Some(response)) => {
// ScoreNativeStage returns Ok(Some) to short-circuit — record success
Metrics::record_router_duration(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/pipeline.rs` around lines 1128 - 1129, Update
the incorrect comment that references ScoreHttpForwardStage: change it to refer
to ScoreNativeStage so the comment matches the implementation of the score
pipeline (e.g., the comment near the Ok(Some(response)) branch should explain
that ScoreNativeStage returns Ok(Some) to short-circuit and record success).

/// Create a Score API pipeline for vLLM `/v1/score` endpoint.
///
/// The `/v1/score` endpoint was added to the vLLM gRPC specs. This pipeline uses
/// three stages:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: Doc says "three stages" but the pipeline actually has four — ScorePreparationStage is in the code (line 333) but missing from the numbered list. Should be:

  1. ScorePreparationStage — extracts text_1 for worker routing
  2. WorkerSelectionStage — picks the target worker
  3. ClientAcquisitionStage — obtains the gRPC client
  4. ScoreNativeStage — executes the Score RPC

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: da212d0f9e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".


ctx.state.preparation = Some(PreparationOutput {
original_text: Some(original_text),
token_ids: Vec::new(), // Scoring worker routing doesn't strictly need accurate token lengths
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Populate score token_ids before worker selection

ScorePreparationStage always sets token_ids to an empty vector, so WorkerSelectionStage forwards tokens: None for every /v1/score request. In deployments using the prefix_hash policy, select_worker_impl returns None when tokens are missing, which makes the score path fail with model_not_found even though healthy workers exist. This is a hard routing failure for any model configured with token-based policies.

Useful? React with 👍 / 👎.

Comment on lines +336 to +340
policy_registry,
WorkerSelectionMode::Regular, // Score is always single-worker
)),
Box::new(ClientAcquisitionStage),
Box::new(super::regular::stages::score::ScoreNativeStage::new()),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve worker load/circuit accounting in score pipeline

The new score pipeline skips RequestExecutionStage and calls client.score(...) directly, which bypasses the normal WorkerLoadGuard lifecycle and worker outcome recording done in request_execution.rs. As a result, score traffic is invisible to load-based balancing and does not feed circuit-breaker outcomes, so under real score load the router can keep selecting overloaded or failing workers instead of adapting like other gRPC endpoints.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
crates/protocols/src/rerank.rs (1)

233-252: ⚠️ Potential issue | 🟠 Major

Add protocol-layer validation for ScoreRequest fields.

At Line 233, ScoreRequest is public but not validated, so empty text_1 and empty/blank text_2 inputs can pass protocol parsing and fail later in backend-specific ways. Please reject these early, similar to RerankRequest.

♻️ Proposed fix
-#[derive(Debug, Clone, Deserialize, Serialize, schemars::JsonSchema)]
+#[derive(Debug, Clone, Deserialize, Serialize, Validate, schemars::JsonSchema)]
+#[validate(schema(function = "validate_score_request"))]
 pub struct ScoreRequest {
     /// The model to use for scoring
     pub model: String,

     /// The query/source text (single string)
+    #[validate(custom(function = "validate_text_1"))]
     pub text_1: String,

     /// The document(s) to score against the query.
     /// Can be a single string or a list of strings.
+    #[validate(custom(function = "validate_text_2"))]
     pub text_2: StringOrVec,
@@
 }
+
+fn validate_text_1(text: &str) -> Result<(), validator::ValidationError> {
+    if text.trim().is_empty() {
+        return Err(validator::ValidationError::new("text_1 cannot be empty"));
+    }
+    Ok(())
+}
+
+fn validate_text_2(text_2: &StringOrVec) -> Result<(), validator::ValidationError> {
+    match text_2 {
+        StringOrVec::Single(s) if s.trim().is_empty() => {
+            Err(validator::ValidationError::new("text_2 cannot be empty"))
+        }
+        StringOrVec::Array(v) if v.is_empty() || v.iter().any(|s| s.trim().is_empty()) => {
+            Err(validator::ValidationError::new("text_2 contains empty entries"))
+        }
+        _ => Ok(()),
+    }
+}
+
+#[expect(
+    clippy::unnecessary_wraps,
+    reason = "validator crate requires Result return type"
+)]
+fn validate_score_request(_req: &ScoreRequest) -> Result<(), validator::ValidationError> {
+    Ok(())
+}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/protocols/src/rerank.rs` around lines 233 - 252, ScoreRequest is not
validated at protocol layer allowing empty text_1 or empty/blank text_2 to pass;
add a validation routine (implement a Validate-like method or trait) for the
ScoreRequest struct that checks: model is present if required, text_1 is
non-empty/non-whitespace, and text_2 (StringOrVec) contains at least one
non-empty/non-whitespace string (if it's a Vec ensure no empty entries; if it's
a single String ensure it's non-blank); wire this validator into the same
protocol parsing path used by RerankRequest so invalid requests are rejected
early with a clear error.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@grpc_servicer/smg_grpc_servicer/vllm/servicer.py`:
- Around line 31-36: The code currently calls mm_inputs(...) unconditionally
which raises a 'NoneType' object is not callable when the vllm multimodal
constructor failed to import; update the handling so that before entering the
preprocessed multimodal path (i.e., when has_preprocessed_mm is true or inside
_build_preprocessed_mm_inputs) you check that mm_inputs is not None and
VllmMultiModalInputs is available, and if not raise a clear
UnsupportedOperation/RuntimeError with a message like "multimodal preprocessing
not supported in this vllm version" (or fail fast during initialization if you
prefer); ensure checks reference mm_inputs, VllmMultiModalInputs,
_build_preprocessed_mm_inputs and has_preprocessed_mm so callers hit a clear
error instead of a NoneType call.
- Around line 300-306: The code is passing the raw token_type_ids list into
pooling_params.extra_kwargs, but vLLM expects a compressed integer index named
"compressed_token_type_ids"; update the branch in servicer.py that handles
encoded["token_type_ids"] to import and call
compress_token_type_ids(token_type_ids) and set pair_pooling_params.extra_kwargs
= {"compressed_token_type_ids": compressed_index} (keeping the rest of
pooling_params cloning logic and names like pooling_params, pair_pooling_params,
and encoded unchanged).

---

Duplicate comments:
In `@crates/protocols/src/rerank.rs`:
- Around line 233-252: ScoreRequest is not validated at protocol layer allowing
empty text_1 or empty/blank text_2 to pass; add a validation routine (implement
a Validate-like method or trait) for the ScoreRequest struct that checks: model
is present if required, text_1 is non-empty/non-whitespace, and text_2
(StringOrVec) contains at least one non-empty/non-whitespace string (if it's a
Vec ensure no empty entries; if it's a single String ensure it's non-blank);
wire this validator into the same protocol parsing path used by RerankRequest so
invalid requests are rejected early with a clear error.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9216d459-c62e-44a5-bb30-dd1be6b11677

📥 Commits

Reviewing files that changed from the base of the PR and between dc36dc5 and da212d0.

📒 Files selected for processing (3)
  • crates/grpc_client/python/smg_grpc_proto/proto
  • crates/protocols/src/rerank.rs
  • grpc_servicer/smg_grpc_servicer/vllm/servicer.py

Comment on lines +300 to +306
# Pass token_type_ids via pooling_params if present
pair_pooling_params = pooling_params
if "token_type_ids" in encoded:
pair_pooling_params = pooling_params.clone()
pair_pooling_params.extra_kwargs = {
"token_type_ids": encoded["token_type_ids"],
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "*.py" | grep -E "(servicer|vllm)" | head -20

Repository: lightseekorg/smg

Length of output: 655


🏁 Script executed:

cat -n grpc_servicer/smg_grpc_servicer/vllm/servicer.py | sed -n '290,320p'

Repository: lightseekorg/smg

Length of output: 1709


🏁 Script executed:

rg "compress_token_type_ids" --type py

Repository: lightseekorg/smg

Length of output: 42


🏁 Script executed:

rg "token_type_ids" grpc_servicer/smg_grpc_servicer/vllm/ --type py -B 3 -A 3

Repository: lightseekorg/smg

Length of output: 1147


🏁 Script executed:

rg "pooling_params" grpc_servicer/smg_grpc_servicer/vllm/ --type py -B 2 -A 2

Repository: lightseekorg/smg

Length of output: 2711


🏁 Script executed:

rg "compressed_token_type_ids\|compress_token_type_ids" --type py

Repository: lightseekorg/smg

Length of output: 42


🌐 Web query:

vLLM pooling_params extra_kwargs token_type_ids compressed_token_type_ids

💡 Result:

In vLLM's pooling API (used for tasks like embeddings, classification, scoring with models like cross-encoders), PoolingParams is a configuration class that includes an 'extra_kwargs' field: dict[str, Any] | None = None. This field stores additional parameters for internal use. token_type_ids is an optional field in TokensPrompt (vllm.inputs), a list[int] used for cross-encoder models to distinguish query and document tokens (typically [0]*query_len + [1]*doc_len). When token_type_ids is provided in the engine prompt (e.g., during LLM.score for cross-encoder scoring), vLLM processes it as follows: - Pops "token_type_ids" from the prompt. - Clones the default PoolingParams. - Computes compressed_token_type_ids = compress_token_type_ids(token_type_ids), where compress_token_type_ids returns the index of the first 1 in the list (position where document tokens start, assuming format of zeros followed by ones) or the list length if no 1s. - Sets pooling_params.extra_kwargs = {"compressed_token_type_ids": compressed} This compressed value (an int) is passed via extra_kwargs to the pooling layer (e.g., for token-wise pooling in cross-encoder score computation), likely to efficiently identify the boundary without storing the full list. Sources: vLLM GitHub (pooling_params.py, llm.py, score_utils.py), docs (inputs, score_utils, pooling score serving).

Citations:


🌐 Web query:

vLLM score task token_type_ids handling official documentation

💡 Result:

vLLM supports handling of token_type_ids primarily for cross-encoder models used in score tasks. TokensPrompt includes an optional token_type_ids field: "A list of token type IDs to pass to the cross encoder model." This allows users to provide pre-tokenized inputs with segment IDs (e.g., 0 for query, 1 for document in BERT-like models). In the LLM class (offline inference), when token_type_ids are present in the prompt, they are extracted, compressed via compress_token_type_ids (likely run-length encoding for efficiency), and passed as extra_kwargs={"compressed_token_type_ids": compressed} to PoolingParams for the score/classify task. For score tasks (cross-encoders), users typically provide text pairs (query, document), which vLLM tokenizes internally using get_score_prompt, applying score templates. These may generate token_type_ids automatically during tokenization if the tokenizer supports it (e.g., BERT tokenizer with text_pair). However, users can bypass internal tokenization by providing TokensPrompt with explicit token_type_ids. There was a bug in AsyncLLMEngine (online serving) where token_type_ids from TokensPrompt were lost for score tasks (v0.8.5.post1, issue #17743, closed Aug 2025), requiring a patch to InputPreprocessor to forward them. This suggests support exists but had an async-specific issue, likely fixed post-closure. Score task uses pooling runner (PoolingParams with task="score", deprecated in recent versions favoring classify for cross-encoders). No explicit PoolingParams field for token_type_ids; handled internally via extra_kwargs. Overall, token_type_ids handling is integrated for cross-encoder score/classify tasks via TokensPrompt and passed to model executor, with compression for efficiency. Primarily useful when providing pre-tokenized inputs; internal text tokenization may handle automatically for supported models.

Citations:


🌐 Web query:

vLLM InputPreprocessor _process_tokens token_type_ids

💡 Result:

In vLLM's InputPreprocessor class (vllm.inputs.preprocess.py), the _process_tokens method processes a TokensPrompt object. It extracts prompt_token_ids from parsed_content["prompt_token_ids"] and optionally retrieves token_type_ids using parsed_content.get("token_type_ids"). If multi_modal_data is present in parsed_content, it calls _process_multimodal (or async equivalent) with prompt_token_ids and other data. Otherwise, it constructs a TokenInputs object passing both prompt_token_ids and token_type_ids. token_type_ids supports models like cross-encoders (e.g., for query-document scoring) that require distinguishing token types, similar to BERT-style token_type_ids (0 for first sequence, 1 for second). There was a bug (GitHub issue #17743, fixed by Aug 2025) where token_type_ids were lost in async processing, but it's resolved in recent versions. Example code snippet from docs: def _process_tokens(self, parsed_content: TokensPrompt, ...): prompt_token_ids = parsed_content["prompt_token_ids"] token_type_ids = parsed_content.get("token_type_ids") ... else: inputs = token_inputs( prompt_token_ids=prompt_token_ids, token_type_ids=token_type_ids, ) TokensPrompt schema: prompt_token_ids: list[int], token_type_ids: list[int] (optional).

Citations:


Pass compressed token_type_ids via extra_kwargs, not raw list.

vLLM's cross-encoder score/classify path expects pooling_params.extra_kwargs["compressed_token_type_ids"] as an integer (the position of the first document token), not the raw token_type_ids list. vLLM computes this by calling compress_token_type_ids(token_type_ids), which returns the index of the first 1 in the list. Passing the raw list here will cause incorrect score computation. (vLLM pooling_params.py, vLLM score_utils)

🐛 Proposed fix
                 if "token_type_ids" in encoded:
                     pair_pooling_params = pooling_params.clone()
                     pair_pooling_params.extra_kwargs = {
-                        "token_type_ids": encoded["token_type_ids"],
+                        "compressed_token_type_ids": compress_token_type_ids(
+                            encoded["token_type_ids"]
+                        ),
                     }

Import compress_token_type_ids from vLLM's score utilities.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@grpc_servicer/smg_grpc_servicer/vllm/servicer.py` around lines 300 - 306, The
code is passing the raw token_type_ids list into pooling_params.extra_kwargs,
but vLLM expects a compressed integer index named "compressed_token_type_ids";
update the branch in servicer.py that handles encoded["token_type_ids"] to
import and call compress_token_type_ids(token_type_ids) and set
pair_pooling_params.extra_kwargs = {"compressed_token_type_ids":
compressed_index} (keeping the rest of pooling_params cloning logic and names
like pooling_params, pair_pooling_params, and encoded unchanged).

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Apr 3, 2026

Hi @ppraneth, this PR has merge conflicts that must be resolved before it can be merged. Please rebase your branch:

git fetch origin main
git rebase origin/main
# resolve any conflicts, then:
git push --force-with-lease

@mergify mergify bot added the needs-rebase PR has merge conflicts that need to be resolved label Apr 3, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@grpc_servicer/smg_grpc_servicer/vllm/servicer.py`:
- Around line 327-332: The ScoreResult construction unnecessarily wraps
score_value in float() even though score_value is already a float; in the
results.append call that creates vllm_engine_pb2.ScoreResult (the block
referencing results.append and vllm_engine_pb2.ScoreResult), remove the
redundant float() conversion and pass score_value directly as the score argument
(i.e., use score=score_value).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 417aa3d4-c4d9-4c0e-a613-40cf6f9ffcb4

📥 Commits

Reviewing files that changed from the base of the PR and between da212d0 and 288797a.

📒 Files selected for processing (1)
  • grpc_servicer/smg_grpc_servicer/vllm/servicer.py

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 288797a7e6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

(ModelType::IMAGE_GEN, "image_gen"),
(ModelType::AUDIO, "audio"),
(ModelType::MODERATION, "moderation"),
(ModelType::SCORE, "score"),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep ModelType schema in sync with new score capability

Adding ModelType::SCORE to CAPABILITY_NAMES makes runtime serialization emit "score", but the manual JsonSchema enum list in the same file still omits "score". That creates a schema/runtime mismatch where generated OpenAPI/JSON-schema validation can reject payloads that the code itself now produces for score-capable models.

Useful? React with 👍 / 👎.

Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/protocols/src/rerank.rs`:
- Around line 245-251: The truncate_prompt_tokens field is declared on the
public Rerank/score request but is ignored on the native gRPC path; update the
native plumbing to carry it or remove it until supported. Either (preferred) add
truncate_prompt_tokens to the native request/message and propagate it in
ScoreNativeStage (and any ScoreNativeRequest/ScoreNativeResponse structs, gRPC
proto/messages, and the native transport marshal/unmarshal code) so the native
path forwards and honors the truncation flag exactly like the HTTP path, or
remove truncate_prompt_tokens from the public struct in rerank.rs so the
contract is consistent; make sure to update any tests and code that build/parse
the native score request to reference the truncate_prompt_tokens symbol and not
silently drop it.

In `@grpc_servicer/smg_grpc_servicer/vllm/servicer.py`:
- Around line 314-316: The per-result score dump currently uses logger.info with
an expensive repr(data) and runs for every final_output.outputs.data (see the
logger.info line and raw = data), which floods INFO on large rerank requests;
change that log to logger.debug and avoid building the repr unless debug is
enabled (e.g. check logger.isEnabledFor(logging.DEBUG) before computing
repr(data)), or remove the log entirely if not needed, leaving raw = data
unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 3c420d23-42af-47bb-a216-546f1a37e34d

📥 Commits

Reviewing files that changed from the base of the PR and between 288797a and 0a0fed8.

📒 Files selected for processing (4)
  • crates/grpc_client/proto/vllm_engine.proto
  • crates/protocols/src/rerank.rs
  • grpc_servicer/smg_grpc_servicer/vllm/servicer.py
  • model_gateway/src/routers/grpc/regular/stages/score/mod.rs

Comment on lines +245 to +251
/// Optional encoding format for the response
#[serde(skip_serializing_if = "Option::is_none")]
pub encoding_format: Option<String>,

/// Whether to truncate the input
#[serde(skip_serializing_if = "Option::is_none")]
pub truncate_prompt_tokens: Option<u32>,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

truncate_prompt_tokens is exposed here but ignored on the native path.

This field is part of the public /v1/score contract now, but the native gRPC transport and ScoreNativeStage only forward request_id, text_1, and text_2. That means HTTP passthrough can honor truncation while native gRPC silently drops it, so the same request behaves differently between connection modes. Please either plumb it end-to-end or remove the field until it is supported.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/protocols/src/rerank.rs` around lines 245 - 251, The
truncate_prompt_tokens field is declared on the public Rerank/score request but
is ignored on the native gRPC path; update the native plumbing to carry it or
remove it until supported. Either (preferred) add truncate_prompt_tokens to the
native request/message and propagate it in ScoreNativeStage (and any
ScoreNativeRequest/ScoreNativeResponse structs, gRPC proto/messages, and the
native transport marshal/unmarshal code) so the native path forwards and honors
the truncation flag exactly like the HTTP path, or remove truncate_prompt_tokens
from the public struct in rerank.rs so the contract is consistent; make sure to
update any tests and code that build/parse the native score request to reference
the truncate_prompt_tokens symbol and not silently drop it.

Comment on lines +314 to +316
data = final_output.outputs.data
logger.info("Score data type=%s repr=%s", type(data).__name__, repr(data)[:200])
raw = data
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Move the per-result score dump off INFO.

This runs once per text_2 item, so large rerank requests will emit one INFO log per candidate and serialize repr(data) on the hot path. Please downgrade it to DEBUG or remove it after bring-up.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@grpc_servicer/smg_grpc_servicer/vllm/servicer.py` around lines 314 - 316, The
per-result score dump currently uses logger.info with an expensive repr(data)
and runs for every final_output.outputs.data (see the logger.info line and raw =
data), which floods INFO on large rerank requests; change that log to
logger.debug and avoid building the repr unless debug is enabled (e.g. check
logger.isEnabledFor(logging.DEBUG) before computing repr(data)), or remove the
log entirely if not needed, leaving raw = data unchanged.

@mergify mergify bot removed the needs-rebase PR has merge conflicts that need to be resolved label Apr 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Dependency updates grpc gRPC client and router changes model-gateway Model gateway crate changes protocols Protocols crate changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Support vLLM /v1/score endpoint for reranker models

1 participant