Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions crates/grpc_client/proto/vllm_engine.proto
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,9 @@ service VllmEngine {
// Submit an embedding request
rpc Embed(EmbedRequest) returns (EmbedResponse);

// Submit a scoring/reranking request
rpc Score(ScoreRequest) returns (ScoreResponse);

// Health check
rpc HealthCheck(HealthCheckRequest) returns (HealthCheckResponse);

Expand Down Expand Up @@ -265,6 +268,29 @@ message EmbedResponse {
uint32 embedding_dim = 3;
}

// =====================
// Score/Rerank Request
// =====================

message ScoreRequest {
string request_id = 1;
string text_1 = 2;
repeated string text_2 = 3;
}
Comment on lines +275 to +279
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The ScoreRequest message is missing the truncate_prompt_tokens field, which is present in the ScoreRequest protocol definition in crates/protocols/src/rerank.rs. Without this field in the proto, the gateway cannot pass truncation settings to the vLLM worker.

Suggested change
message ScoreRequest {
string request_id = 1;
string text_1 = 2;
repeated string text_2 = 3;
}
message ScoreRequest {
string request_id = 1;
string text_1 = 2;
repeated string text_2 = 3;
optional uint32 truncate_prompt_tokens = 4;
}
References
  1. For protocol data structures that mirror an external API (e.g., OpenAI), prioritize alignment with the external specification over internal consistency.


message ScoreResult {
uint32 index = 1;
float score = 2;
}

message ScoreResponse {
repeated ScoreResult data = 1;
uint32 prompt_tokens = 2;
uint32 total_tokens = 3;
string request_id = 4;
int64 created = 5;
}

// =====================
// Management Operations
// =====================
Expand Down
2 changes: 1 addition & 1 deletion crates/grpc_client/python/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ description = "SMG gRPC proto definitions for SGLang, vLLM, and TRT-LLM"
requires-python = ">=3.10"
dependencies = [
"grpcio>=1.78.0",
"protobuf>=5.26.0",
"protobuf>=5.26.0,<7.0.0",
]
readme = "README.md"
license = { text = "Apache-2.0" }
Expand Down
2 changes: 1 addition & 1 deletion crates/grpc_client/python/smg_grpc_proto/proto
34 changes: 34 additions & 0 deletions crates/grpc_client/src/vllm_engine.rs
Original file line number Diff line number Diff line change
Expand Up @@ -668,6 +668,40 @@ impl VllmEngineClient {
Ok(response.into_inner())
}

/// Build a ScoreRequest for cross-encoder reranking
#[expect(
clippy::unused_self,
reason = "method receiver kept for consistent public API across gRPC backends"
)]
pub fn build_score_request(
&self,
request_id: String,
text_1: String,
text_2: Vec<String>,
) -> proto::ScoreRequest {
proto::ScoreRequest {
request_id,
text_1,
text_2,
}
}
Comment on lines +676 to +687
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Update build_score_request to include the truncate_prompt_tokens parameter to match the updated proto and protocol definitions.

Suggested change
pub fn build_score_request(
&self,
request_id: String,
text_1: String,
text_2: Vec<String>,
) -> proto::ScoreRequest {
proto::ScoreRequest {
request_id,
text_1,
text_2,
}
}
pub fn build_score_request(
&self,
request_id: String,
text_1: String,
text_2: Vec<String>,
truncate_prompt_tokens: Option<u32>,
) -> proto::ScoreRequest {
proto::ScoreRequest {
request_id,
text_1,
text_2,
truncate_prompt_tokens,
}
}
References
  1. For builder methods that construct data structures mapping directly to a wire format, it is acceptable to have many arguments if they correspond one-to-one with the wire-format fields.


/// Submit a scoring request
pub async fn score(
&self,
req: proto::ScoreRequest,
) -> Result<proto::ScoreResponse, tonic::Status> {
let mut client = self.client.clone();
let mut request = Request::new(req);

if let Err(e) = self.trace_injector.inject(request.metadata_mut()) {
warn!("Failed to inject trace context: {}", e);
}

let response = client.score(request).await?;
Ok(response.into_inner())
}

fn build_grpc_sampling_params_from_completion(
request: &CompletionRequest,
) -> Result<proto::SamplingParams, String> {
Expand Down
25 changes: 25 additions & 0 deletions crates/protocols/src/model_type.rs
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,8 @@ bitflags! {
const AUDIO = 1 << 10;
/// Content moderation models
const MODERATION = 1 << 11;
/// Score/cross-encoder reranker models (vLLM /v1/score)
const SCORE = 1 << 12;

Comment on lines +37 to 39
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add "score" to ModelType JSON schema enum values.

CAPABILITY_NAMES/serde now accept "score", but the manual JsonSchema enum list still omits it. This can cause schema-based validation or generated clients to reject valid configs.

🛠️ Proposed fix
@@
             enum_values: Some(vec![
                 "chat".into(),
                 "completions".into(),
                 "responses".into(),
                 "embeddings".into(),
                 "rerank".into(),
                 "generate".into(),
                 "vision".into(),
                 "tools".into(),
                 "reasoning".into(),
                 "image_gen".into(),
                 "audio".into(),
                 "moderation".into(),
+                "score".into(),
             ]),

Also applies to: 87-87

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/protocols/src/model_type.rs` around lines 37 - 39, The JSON schema
enum for ModelType is missing the "score" variant even though ModelType includes
the SCORE flag and CAPABILITY_NAMES/serde accept "score"; update the ModelType
JsonSchema implementation to add "score" to the enum list (and mirror this
change wherever the manual enum list is duplicated, e.g., the other
JsonSchema/enum generation block referenced at the second location). Locate the
ModelType definition (including the const SCORE) and the impl JsonSchema for
ModelType and add "score" to the returned enum values so schema-based validation
and generated clients accept the score capability.

/// Standard LLM: chat + completions + responses + tools
const LLM = Self::CHAT.bits() | Self::COMPLETIONS.bits()
Expand Down Expand Up @@ -62,6 +64,9 @@ bitflags! {

/// Content moderation model only
const MODERATION_MODEL = Self::MODERATION.bits();

/// Score / cross-encoder reranker model only (vLLM /v1/score)
const SCORE_MODEL = Self::SCORE.bits();
}
}

Expand All @@ -79,6 +84,7 @@ const CAPABILITY_NAMES: &[(ModelType, &str)] = &[
(ModelType::IMAGE_GEN, "image_gen"),
(ModelType::AUDIO, "audio"),
(ModelType::MODERATION, "moderation"),
(ModelType::SCORE, "score"),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep ModelType schema in sync with new score capability

Adding ModelType::SCORE to CAPABILITY_NAMES makes runtime serialization emit "score", but the manual JsonSchema enum list in the same file still omits "score". That creates a schema/runtime mismatch where generated OpenAPI/JSON-schema validation can reject payloads that the code itself now produces for score-capable models.

Useful? React with 👍 / 👎.

];

impl ModelType {
Expand Down Expand Up @@ -154,6 +160,12 @@ impl ModelType {
self.contains(Self::MODERATION)
}

/// Check if this model type supports the score endpoint (vLLM /v1/score)
#[inline]
pub fn supports_score(self) -> bool {
self.contains(Self::SCORE)
}

/// Check if this model type supports a given endpoint
pub fn supports_endpoint(self, endpoint: Endpoint) -> bool {
match endpoint {
Expand All @@ -162,6 +174,7 @@ impl ModelType {
Endpoint::Responses => self.supports_responses(),
Endpoint::Embeddings => self.supports_embeddings(),
Endpoint::Rerank => self.supports_rerank(),
Endpoint::Score => self.supports_score(),
Endpoint::Generate => self.supports_generate(),
Endpoint::Models => true,
}
Expand Down Expand Up @@ -196,6 +209,12 @@ impl ModelType {
self.supports_rerank() && !self.supports_chat()
}

/// Check if this is a score/cross-encoder model (supports /v1/score)
#[inline]
pub fn is_score_model(self) -> bool {
self.supports_score() && !self.supports_chat()
}

/// Check if this is an image generation model
#[inline]
pub fn is_image_model(self) -> bool {
Expand Down Expand Up @@ -344,6 +363,8 @@ pub enum Endpoint {
Embeddings,
/// Rerank endpoint (/v1/rerank)
Rerank,
/// Score / cross-encoder endpoint (/v1/score)
Score,
/// SGLang generate endpoint (/generate)
Generate,
/// Models listing endpoint (/v1/models)
Expand All @@ -359,6 +380,7 @@ impl Endpoint {
Endpoint::Responses => "/v1/responses",
Endpoint::Embeddings => "/v1/embeddings",
Endpoint::Rerank => "/v1/rerank",
Endpoint::Score => "/v1/score",
Endpoint::Generate => "/generate",
Endpoint::Models => "/v1/models",
}
Expand All @@ -373,6 +395,7 @@ impl Endpoint {
"/v1/responses" => Some(Endpoint::Responses),
"/v1/embeddings" => Some(Endpoint::Embeddings),
"/v1/rerank" => Some(Endpoint::Rerank),
"/v1/score" => Some(Endpoint::Score),
"/generate" => Some(Endpoint::Generate),
"/v1/models" => Some(Endpoint::Models),
_ => None,
Expand All @@ -387,6 +410,7 @@ impl Endpoint {
Endpoint::Responses => Some(ModelType::RESPONSES),
Endpoint::Embeddings => Some(ModelType::EMBEDDINGS),
Endpoint::Rerank => Some(ModelType::RERANK),
Endpoint::Score => Some(ModelType::SCORE),
Endpoint::Generate => Some(ModelType::GENERATE),
Endpoint::Models => None,
}
Expand All @@ -401,6 +425,7 @@ impl std::fmt::Display for Endpoint {
Endpoint::Responses => write!(f, "responses"),
Endpoint::Embeddings => write!(f, "embeddings"),
Endpoint::Rerank => write!(f, "rerank"),
Endpoint::Score => write!(f, "score"),
Endpoint::Generate => write!(f, "generate"),
Endpoint::Models => write!(f, "models"),
}
Expand Down
129 changes: 129 additions & 0 deletions crates/protocols/src/rerank.rs
Original file line number Diff line number Diff line change
Expand Up @@ -212,3 +212,132 @@ impl From<V1RerankReqInput> for RerankRequest {
}
}
}

// ============================================================================
// Score API (vLLM /v1/score)
// ============================================================================

/// vLLM-compatible score request for cross-encoder reranker models.
///
/// Matches the vLLM `/v1/score` request schema which uses `text_1`/`text_2`
/// pairs rather than the classic `query`/`documents` style.
///
/// # Example
/// ```json
/// {
/// "model": "modernbert-reranker",
/// "text_1": "What is the capital of France?",
/// "text_2": ["Paris is the capital.", "London is in England."]
/// }
/// ```
#[derive(Debug, Clone, Deserialize, Serialize, schemars::JsonSchema)]
pub struct ScoreRequest {
/// The model to use for scoring
pub model: String,

/// The query/source text (single string)
pub text_1: String,

/// The document(s) to score against the query.
/// Can be a single string or a list of strings.
pub text_2: StringOrVec,

/// Optional encoding format for the response
#[serde(skip_serializing_if = "Option::is_none")]
pub encoding_format: Option<String>,

/// Whether to truncate the input
#[serde(skip_serializing_if = "Option::is_none")]
pub truncate_prompt_tokens: Option<u32>,
Comment on lines +245 to +251
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

truncate_prompt_tokens is exposed here but ignored on the native path.

This field is part of the public /v1/score contract now, but the native gRPC transport and ScoreNativeStage only forward request_id, text_1, and text_2. That means HTTP passthrough can honor truncation while native gRPC silently drops it, so the same request behaves differently between connection modes. Please either plumb it end-to-end or remove the field until it is supported.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/protocols/src/rerank.rs` around lines 245 - 251, The
truncate_prompt_tokens field is declared on the public Rerank/score request but
is ignored on the native gRPC path; update the native plumbing to carry it or
remove it until supported. Either (preferred) add truncate_prompt_tokens to the
native request/message and propagate it in ScoreNativeStage (and any
ScoreNativeRequest/ScoreNativeResponse structs, gRPC proto/messages, and the
native transport marshal/unmarshal code) so the native path forwards and honors
the truncation flag exactly like the HTTP path, or remove truncate_prompt_tokens
from the public struct in rerank.rs so the contract is consistent; make sure to
update any tests and code that build/parse the native score request to reference
the truncate_prompt_tokens symbol and not silently drop it.

}
Comment on lines +233 to +252
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider adding validation for ScoreRequest.

Unlike RerankRequest which has validation for non-empty query and documents, ScoreRequest lacks validation. Consider adding:

  • Non-empty text_1 validation
  • Non-empty text_2 validation (at least one document to score)

This would provide consistent error handling at the protocol layer rather than at the backend.

♻️ Example validation addition
+use validator::Validate;
+
 #[derive(Debug, Clone, Deserialize, Serialize, schemars::JsonSchema)]
+#[derive(Validate)]
 pub struct ScoreRequest {
     /// The model to use for scoring
     pub model: String,

     /// The query/source text (single string)
+    #[validate(custom(function = "validate_text_1"))]
     pub text_1: String,

     /// The document(s) to score against the query.
     /// Can be a single string or a list of strings.
+    #[validate(custom(function = "validate_text_2"))]
     pub text_2: StringOrVec,
     // ... rest unchanged
 }
+
+fn validate_text_1(text: &str) -> Result<(), validator::ValidationError> {
+    if text.trim().is_empty() {
+        return Err(validator::ValidationError::new("text_1 cannot be empty"));
+    }
+    Ok(())
+}
+
+fn validate_text_2(text_2: &StringOrVec) -> Result<(), validator::ValidationError> {
+    if text_2.is_empty() {
+        return Err(validator::ValidationError::new("text_2 cannot be empty"));
+    }
+    Ok(())
+}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#[derive(Debug, Clone, Deserialize, Serialize, schemars::JsonSchema)]
pub struct ScoreRequest {
/// The model to use for scoring
pub model: String,
/// The query/source text (single string)
pub text_1: String,
/// The document(s) to score against the query.
/// Can be a single string or a list of strings.
pub text_2: StringOrVec,
/// Optional encoding format for the response
#[serde(skip_serializing_if = "Option::is_none")]
pub encoding_format: Option<String>,
/// Whether to truncate the input
#[serde(skip_serializing_if = "Option::is_none")]
pub truncate_prompt_tokens: Option<u32>,
}
use validator::Validate;
#[derive(Debug, Clone, Deserialize, Serialize, schemars::JsonSchema, Validate)]
pub struct ScoreRequest {
/// The model to use for scoring
pub model: String,
/// The query/source text (single string)
#[validate(custom(function = "validate_text_1"))]
pub text_1: String,
/// The document(s) to score against the query.
/// Can be a single string or a list of strings.
#[validate(custom(function = "validate_text_2"))]
pub text_2: StringOrVec,
/// Optional encoding format for the response
#[serde(skip_serializing_if = "Option::is_none")]
pub encoding_format: Option<String>,
/// Whether to truncate the input
#[serde(skip_serializing_if = "Option::is_none")]
pub truncate_prompt_tokens: Option<u32>,
}
fn validate_text_1(text: &str) -> Result<(), validator::ValidationError> {
if text.trim().is_empty() {
return Err(validator::ValidationError::new("text_1 cannot be empty"));
}
Ok(())
}
fn validate_text_2(text_2: &StringOrVec) -> Result<(), validator::ValidationError> {
if text_2.is_empty() {
return Err(validator::ValidationError::new("text_2 cannot be empty"));
}
Ok(())
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/protocols/src/rerank.rs` around lines 233 - 252, Add validation for
ScoreRequest similar to RerankRequest by implementing a validation method (e.g.,
impl ScoreRequest::validate or implementing the same Validate trait used by
RerankRequest) that checks that text_1 is not empty and that text_2 contains at
least one non-empty document (handle both String and Vec variants of
StringOrVec). Return a suitable error type on failure and call this validation
where other request types are validated so protocol-layer errors are consistent.


impl ScoreRequest {
/// Return text_2 as a slice of string references for routing/hashing.
pub fn texts(&self) -> Vec<&str> {
match &self.text_2 {
StringOrVec::Single(s) => vec![s.as_str()],
StringOrVec::Array(v) => v.iter().map(String::as_str).collect(),
}
}
}

impl GenerationRequest for ScoreRequest {
fn get_model(&self) -> Option<&str> {
Some(&self.model)
}

fn is_stream(&self) -> bool {
false // Score endpoint never streams
}

fn extract_text_for_routing(&self) -> String {
self.text_1.clone()
}
}

/// `text_2` field: either a single string or an array.
///
/// vLLM accepts both forms; we deserialize and normalize internally.
#[derive(Debug, Clone, Serialize, Deserialize, schemars::JsonSchema)]
#[serde(untagged)]
pub enum StringOrVec {
Single(String),
Array(Vec<String>),
}

impl StringOrVec {
/// Convert into an owned `Vec<String>` regardless of variant.
pub fn into_vec(self) -> Vec<String> {
match self {
Self::Single(s) => vec![s],
Self::Array(v) => v,
}
}

/// Return the number of texts.
pub fn len(&self) -> usize {
match self {
Self::Single(_) => 1,
Self::Array(v) => v.len(),
}
}

/// Return true if empty.
pub fn is_empty(&self) -> bool {
match self {
Self::Single(_) => false,
Self::Array(v) => v.is_empty(),
}
Comment thread
ppraneth marked this conversation as resolved.
}
}

/// An individual score result from the vLLM score API.
#[derive(Debug, Clone, Serialize, Deserialize, schemars::JsonSchema)]
pub struct ScoreData {
/// Always `"score"` (vLLM compat)
pub object: String,
/// The relevance score as a float
pub score: f64,
/// 0-based index of this text in `text_2`
pub index: usize,
}

/// Response from the vLLM `/v1/score` endpoint.
///
/// Mirrors the structure returned by vLLM's `ScoringResponse`.
#[derive(Debug, Clone, Serialize, Deserialize, schemars::JsonSchema)]
pub struct ScoreResponse {
/// Unique identifier for this score response
pub id: String,
/// Always `"list"`
pub object: String,
/// Unix timestamp (seconds) when the response was created
pub created: i64,
/// The scored results, one per input in `text_2`
pub data: Vec<ScoreData>,
/// The model that produced the scores
pub model: String,
/// Usage information (if provided by backend)
#[serde(skip_serializing_if = "Option::is_none")]
pub usage: Option<UsageInfo>,
}
Loading
Loading