Skip to content

feat(render): add render endpoints for chat template + tokenization without generation#819

Draft
hyeongyun0916 wants to merge 3 commits intolightseekorg:mainfrom
hyeongyun0916:feat/tokenize-endpoint
Draft

feat(render): add render endpoints for chat template + tokenization without generation#819
hyeongyun0916 wants to merge 3 commits intolightseekorg:mainfrom
hyeongyun0916:feat/tokenize-endpoint

Conversation

@hyeongyun0916
Copy link
Copy Markdown

Description

Problem

Disaggregated serving requires a way to apply chat templates and tokenize prompts without running GPU inference. Previously, this was attempted via a GPU-less vLLM Python server (PR #784), but SMG already has all the necessary components in Rust: chat template engine (crates/tokenizer), model-specific tool parsers (crates/tool_parser), and Harmony encoding for gpt-oss models.

Per slin1237's review feedback on PR #784, this implements the "early return" approach — exposing SMG's existing pipeline as render endpoints rather than adding a separate vLLM dependency.

Solution

Add two render endpoints accessible via both HTTP REST and gRPC:

HTTP:

  • POST /v1/chat/completions/render — messages → chat template → tokenize → token_ids
  • POST /v1/completions/render — prompt → tokenize (with add_special_tokens support) → token_ids

gRPC:

  • smg.grpc.render.RenderService/RenderChat
  • smg.grpc.render.RenderService/RenderCompletion

Server mode is selected via --server http|grpc CLI flag (same port, default: http).

Both protocols share core logic (render_chat_core, render_completion_core) with thin HTTP/gRPC adapter layers. Harmony (gpt-oss) models are auto-detected via HarmonyDetector and routed to HarmonyBuilder.

Changes

  • crates/grpc_client/proto/render_service.proto — New proto with RenderService (RenderChat, RenderCompletion RPCs)
  • crates/grpc_client/build.rs — Compile render_service.proto with serde derive for all types
  • crates/grpc_client/src/lib.rs — Export render_service_proto module
  • crates/protocols/src/tokenize.rs — HTTP request/response types (RenderChatRequest, RenderCompletionRequest, RenderResponse)
  • model_gateway/src/routers/tokenize/handlers.rs — Shared core functions + HTTP handlers
  • model_gateway/src/routers/tokenize/grpc_service.rs — gRPC handler with proto → ChatCompletionRequest conversion
  • model_gateway/src/routers/tokenize/mod.rs — Module exports
  • model_gateway/src/server.rs — HTTP routes + gRPC server mode (ServerMode enum)
  • model_gateway/src/main.rs--server http|grpc CLI option
  • bindings/python/src/lib.rs — Default server_mode: Http for Python binding
  • crates/grpc_client/python/smg_grpc_proto/__init__.py — Python gRPC stubs export

Test Plan

# Unit tests (20 tests)
cargo test -p smg -- tokenize
# Includes:
# - gRPC proto conversion: system/user/assistant/tool message, unknown role, tool calls
# - prompt extraction: text, texts, token_ids rejected, missing
# - handler core: model not found errors

# HTTP manual test
cargo run --bin smg -- launch --model-path "openai/gpt-oss-20b"
curl -X POST http://localhost:30000/v1/chat/completions/render \
  -H "Content-Type: application/json" \
  -d '{"model":"openai/gpt-oss-20b","messages":[{"role":"user","content":"hello"}]}'

curl -X POST http://localhost:30000/v1/completions/render \
  -H "Content-Type: application/json" \
  -d '{"model":"openai/gpt-oss-20b","prompt":"hello world"}'

# gRPC manual test
cargo run --bin smg -- launch --server grpc --model-path "openai/gpt-oss-20b"
grpcurl -plaintext -proto crates/grpc_client/proto/render_service.proto \
  -d '{"model":"openai/gpt-oss-20b","messages":[{"role":"user","content":"hello"}]}' \
  localhost:30000 smg.grpc.render.RenderService/RenderChat
Checklist
  • cargo +nightly fmt passes
  • cargo clippy --all-targets --all-features -- -D warnings passes
  • (Optional) Documentation updated
  • (Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

…/response schemas

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
…endpoints

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
…completion and chat core

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 19, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 67554ea3-399c-425f-bede-7f5df7416227

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses the need for disaggregated serving by providing dedicated endpoints for chat template application and prompt tokenization, decoupled from GPU inference. It introduces new HTTP and gRPC interfaces that leverage existing Rust-based tokenizer and templating components, allowing users to obtain token IDs efficiently without initiating a full generation process. This approach streamlines the prompt preparation phase for language models and aligns with previous review feedback to integrate this functionality directly into the existing system.

Highlights

  • New Render Endpoints: Introduced two new endpoints, /v1/chat/completions/render and /v1/completions/render, for applying chat templates and tokenizing prompts without requiring GPU inference or full generation.
  • Dual Protocol Support: The new render functionality is accessible via both HTTP REST and gRPC, with a new --server http|grpc CLI flag to select the server mode.
  • Leveraging Existing Components: The solution reuses existing Rust components for chat templating (crates/tokenizer), tool parsing (crates/tool_parser), and Harmony encoding, avoiding a new vLLM dependency.
  • Shared Core Logic: Implemented shared core functions (render_chat_core, render_completion_core) that are utilized by both HTTP and gRPC handlers, ensuring consistent logic across protocols.
  • gRPC Service Definition: Added a new Protocol Buffer definition (render_service.proto) for the RenderService, including RPCs for RenderChat and RenderCompletion and their associated message types.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions github-actions bot added python-bindings Python bindings changes model-gateway Model gateway crate changes labels Mar 19, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces new 'render' endpoints for applying chat templates and tokenizing prompts without GPU inference, available via both HTTP and gRPC. The implementation is well-structured, separating core logic from the transport layers. The addition of a --server CLI flag to switch between HTTP and gRPC modes is a good approach. My review has identified a few areas for improvement: a minor documentation duplication, a potential issue with handling a required field in the gRPC service logic, and a missing implementation for multi-modal content that could lead to silent data loss. Addressing these points will improve the robustness and completeness of the new feature.

Comment on lines +132 to +134
fn proto_message_to_chat_message(
msg: &proto::ChatCompletionMessage,
) -> Result<ChatMessage, String> {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This function currently ignores the content_parts field of proto::ChatCompletionMessage, which is used for multi-modal inputs. This can lead to silent data loss if a client sends multi-modal content. The function should be updated to either correctly process content_parts into openai_protocol::chat::MessageContent::Array or return an InvalidArgument error to signal that this feature is not yet supported. This aligns with the principle of avoiding silent failures.

References
  1. Instead of silently ignoring potential failures (e.g., from serialization), log them as warnings to aid in debugging. In Rust, prefer using unwrap_or_else to log an error over unwrap_or_default which would fail silently.

}
"tool" => Ok(ChatMessage::Tool {
content: content_from_str(&msg.content),
tool_call_id: msg.tool_call_id.clone().unwrap_or_default(),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

For messages with role: "tool", the tool_call_id is a required field according to the OpenAI API specification. Currently, if it's missing in the gRPC request, it defaults to an empty string, which could lead to issues downstream. It would be more robust to return an InvalidArgument error if tool_call_id is not provided for a tool message. This prioritizes alignment with the external OpenAI API specification.

Suggested change
tool_call_id: msg.tool_call_id.clone().unwrap_or_default(),
tool_call_id: msg.tool_call_id.clone().ok_or_else(|| "`tool_call_id` is required for tool messages".to_string())?,
References
  1. For protocol data structures that mirror an external API (e.g., OpenAI), prioritize alignment with the external specification over internal consistency.

Comment on lines +146 to +147
/// Server protocol mode: http (REST API) or grpc (gRPC render service)
/// Server protocol mode: http (REST API) or grpc (gRPC render service)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This documentation comment is duplicated. Please remove one of the lines to improve code clarity.

Suggested change
/// Server protocol mode: http (REST API) or grpc (gRPC render service)
/// Server protocol mode: http (REST API) or grpc (gRPC render service)
/// Server protocol mode: http (REST API) or grpc (gRPC render service)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

model-gateway Model gateway crate changes python-bindings Python bindings changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant