feat(render): add render endpoints for chat template + tokenization without generation#819
feat(render): add render endpoints for chat template + tokenization without generation#819hyeongyun0916 wants to merge 3 commits intolightseekorg:mainfrom
Conversation
…/response schemas Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
…endpoints Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
…completion and chat core Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses the need for disaggregated serving by providing dedicated endpoints for chat template application and prompt tokenization, decoupled from GPU inference. It introduces new HTTP and gRPC interfaces that leverage existing Rust-based tokenizer and templating components, allowing users to obtain token IDs efficiently without initiating a full generation process. This approach streamlines the prompt preparation phase for language models and aligns with previous review feedback to integrate this functionality directly into the existing system. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces new 'render' endpoints for applying chat templates and tokenizing prompts without GPU inference, available via both HTTP and gRPC. The implementation is well-structured, separating core logic from the transport layers. The addition of a --server CLI flag to switch between HTTP and gRPC modes is a good approach. My review has identified a few areas for improvement: a minor documentation duplication, a potential issue with handling a required field in the gRPC service logic, and a missing implementation for multi-modal content that could lead to silent data loss. Addressing these points will improve the robustness and completeness of the new feature.
| fn proto_message_to_chat_message( | ||
| msg: &proto::ChatCompletionMessage, | ||
| ) -> Result<ChatMessage, String> { |
There was a problem hiding this comment.
This function currently ignores the content_parts field of proto::ChatCompletionMessage, which is used for multi-modal inputs. This can lead to silent data loss if a client sends multi-modal content. The function should be updated to either correctly process content_parts into openai_protocol::chat::MessageContent::Array or return an InvalidArgument error to signal that this feature is not yet supported. This aligns with the principle of avoiding silent failures.
References
- Instead of silently ignoring potential failures (e.g., from serialization), log them as warnings to aid in debugging. In Rust, prefer using
unwrap_or_elseto log an error overunwrap_or_defaultwhich would fail silently.
| } | ||
| "tool" => Ok(ChatMessage::Tool { | ||
| content: content_from_str(&msg.content), | ||
| tool_call_id: msg.tool_call_id.clone().unwrap_or_default(), |
There was a problem hiding this comment.
For messages with role: "tool", the tool_call_id is a required field according to the OpenAI API specification. Currently, if it's missing in the gRPC request, it defaults to an empty string, which could lead to issues downstream. It would be more robust to return an InvalidArgument error if tool_call_id is not provided for a tool message. This prioritizes alignment with the external OpenAI API specification.
| tool_call_id: msg.tool_call_id.clone().unwrap_or_default(), | |
| tool_call_id: msg.tool_call_id.clone().ok_or_else(|| "`tool_call_id` is required for tool messages".to_string())?, |
References
- For protocol data structures that mirror an external API (e.g., OpenAI), prioritize alignment with the external specification over internal consistency.
| /// Server protocol mode: http (REST API) or grpc (gRPC render service) | ||
| /// Server protocol mode: http (REST API) or grpc (gRPC render service) |
There was a problem hiding this comment.
This documentation comment is duplicated. Please remove one of the lines to improve code clarity.
| /// Server protocol mode: http (REST API) or grpc (gRPC render service) | |
| /// Server protocol mode: http (REST API) or grpc (gRPC render service) | |
| /// Server protocol mode: http (REST API) or grpc (gRPC render service) |
Description
Problem
Disaggregated serving requires a way to apply chat templates and tokenize prompts without running GPU inference. Previously, this was attempted via a GPU-less vLLM Python server (PR #784), but SMG already has all the necessary components in Rust: chat template engine (
crates/tokenizer), model-specific tool parsers (crates/tool_parser), and Harmony encoding for gpt-oss models.Per slin1237's review feedback on PR #784, this implements the "early return" approach — exposing SMG's existing pipeline as render endpoints rather than adding a separate vLLM dependency.
Solution
Add two render endpoints accessible via both HTTP REST and gRPC:
HTTP:
POST /v1/chat/completions/render— messages → chat template → tokenize → token_idsPOST /v1/completions/render— prompt → tokenize (withadd_special_tokenssupport) → token_idsgRPC:
smg.grpc.render.RenderService/RenderChatsmg.grpc.render.RenderService/RenderCompletionServer mode is selected via
--server http|grpcCLI flag (same port, default: http).Both protocols share core logic (
render_chat_core,render_completion_core) with thin HTTP/gRPC adapter layers. Harmony (gpt-oss) models are auto-detected viaHarmonyDetectorand routed toHarmonyBuilder.Changes
crates/grpc_client/proto/render_service.proto— New proto withRenderService(RenderChat, RenderCompletion RPCs)crates/grpc_client/build.rs— Compile render_service.proto with serde derive for all typescrates/grpc_client/src/lib.rs— Exportrender_service_protomodulecrates/protocols/src/tokenize.rs— HTTP request/response types (RenderChatRequest, RenderCompletionRequest, RenderResponse)model_gateway/src/routers/tokenize/handlers.rs— Shared core functions + HTTP handlersmodel_gateway/src/routers/tokenize/grpc_service.rs— gRPC handler with proto → ChatCompletionRequest conversionmodel_gateway/src/routers/tokenize/mod.rs— Module exportsmodel_gateway/src/server.rs— HTTP routes + gRPC server mode (ServerMode enum)model_gateway/src/main.rs—--server http|grpcCLI optionbindings/python/src/lib.rs— Defaultserver_mode: Httpfor Python bindingcrates/grpc_client/python/smg_grpc_proto/__init__.py— Python gRPC stubs exportTest Plan
Checklist
cargo +nightly fmtpassescargo clippy --all-targets --all-features -- -D warningspasses