feat(render): add render endpoints for chat template + tokenization without generation by hyeongyun0916 · Pull Request #819 · lightseekorg/smg

hyeongyun0916 · 2026-03-19T09:19:16Z

Description

Problem

Disaggregated serving requires a way to apply chat templates and tokenize prompts without running GPU inference. Previously, this was attempted via a GPU-less vLLM Python server (PR #784), but SMG already has all the necessary components in Rust: chat template engine (crates/tokenizer), model-specific tool parsers (crates/tool_parser), and Harmony encoding for gpt-oss models.

Per slin1237's review feedback on PR #784, this implements the "early return" approach — exposing SMG's existing pipeline as render endpoints rather than adding a separate vLLM dependency.

Solution

Add two render endpoints accessible via both HTTP REST and gRPC:

HTTP:

POST /v1/chat/completions/render — messages → chat template → tokenize → token_ids
POST /v1/completions/render — prompt → tokenize (with add_special_tokens support) → token_ids

gRPC:

smg.grpc.render.RenderService/RenderChat
smg.grpc.render.RenderService/RenderCompletion

Server mode is selected via --server http|grpc CLI flag (same port, default: http).

Both protocols share core logic (render_chat_core, render_completion_core) with thin HTTP/gRPC adapter layers. Harmony (gpt-oss) models are auto-detected via HarmonyDetector and routed to HarmonyBuilder.

Changes

crates/grpc_client/proto/render_service.proto — New proto with RenderService (RenderChat, RenderCompletion RPCs)
crates/grpc_client/build.rs — Compile render_service.proto with serde derive for all types
crates/grpc_client/src/lib.rs — Export render_service_proto module
crates/protocols/src/tokenize.rs — HTTP request/response types (RenderChatRequest, RenderCompletionRequest, RenderResponse)
model_gateway/src/routers/tokenize/handlers.rs — Shared core functions + HTTP handlers
model_gateway/src/routers/tokenize/grpc_service.rs — gRPC handler with proto → ChatCompletionRequest conversion
model_gateway/src/routers/tokenize/mod.rs — Module exports
model_gateway/src/server.rs — HTTP routes + gRPC server mode (ServerMode enum)
model_gateway/src/main.rs — --server http|grpc CLI option
bindings/python/src/lib.rs — Default server_mode: Http for Python binding
crates/grpc_client/python/smg_grpc_proto/__init__.py — Python gRPC stubs export

Test Plan

# Unit tests (20 tests)
cargo test -p smg -- tokenize
# Includes:
# - gRPC proto conversion: system/user/assistant/tool message, unknown role, tool calls
# - prompt extraction: text, texts, token_ids rejected, missing
# - handler core: model not found errors

# HTTP manual test
cargo run --bin smg -- launch --model-path "openai/gpt-oss-20b"
curl -X POST http://localhost:30000/v1/chat/completions/render \
  -H "Content-Type: application/json" \
  -d '{"model":"openai/gpt-oss-20b","messages":[{"role":"user","content":"hello"}]}'

curl -X POST http://localhost:30000/v1/completions/render \
  -H "Content-Type: application/json" \
  -d '{"model":"openai/gpt-oss-20b","prompt":"hello world"}'

# gRPC manual test
cargo run --bin smg -- launch --server grpc --model-path "openai/gpt-oss-20b"
grpcurl -plaintext -proto crates/grpc_client/proto/render_service.proto \
  -d '{"model":"openai/gpt-oss-20b","messages":[{"role":"user","content":"hello"}]}' \
  localhost:30000 smg.grpc.render.RenderService/RenderChat

Checklist

cargo +nightly fmt passes
cargo clippy --all-targets --all-features -- -D warnings passes
(Optional) Documentation updated
(Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

…/response schemas Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

…endpoints Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

…completion and chat core Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

coderabbitai · 2026-03-19T09:19:22Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 67554ea3-399c-425f-bede-7f5df7416227

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-03-19T09:19:44Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses the need for disaggregated serving by providing dedicated endpoints for chat template application and prompt tokenization, decoupled from GPU inference. It introduces new HTTP and gRPC interfaces that leverage existing Rust-based tokenizer and templating components, allowing users to obtain token IDs efficiently without initiating a full generation process. This approach streamlines the prompt preparation phase for language models and aligns with previous review feedback to integrate this functionality directly into the existing system.

Highlights

New Render Endpoints: Introduced two new endpoints, /v1/chat/completions/render and /v1/completions/render, for applying chat templates and tokenizing prompts without requiring GPU inference or full generation.
Dual Protocol Support: The new render functionality is accessible via both HTTP REST and gRPC, with a new --server http|grpc CLI flag to select the server mode.
Leveraging Existing Components: The solution reuses existing Rust components for chat templating (crates/tokenizer), tool parsing (crates/tool_parser), and Harmony encoding, avoiding a new vLLM dependency.
Shared Core Logic: Implemented shared core functions (render_chat_core, render_completion_core) that are utilized by both HTTP and gRPC handlers, ensuring consistent logic across protocols.
gRPC Service Definition: Added a new Protocol Buffer definition (render_service.proto) for the RenderService, including RPCs for RenderChat and RenderCompletion and their associated message types.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces new 'render' endpoints for applying chat templates and tokenizing prompts without GPU inference, available via both HTTP and gRPC. The implementation is well-structured, separating core logic from the transport layers. The addition of a --server CLI flag to switch between HTTP and gRPC modes is a good approach. My review has identified a few areas for improvement: a minor documentation duplication, a potential issue with handling a required field in the gRPC service logic, and a missing implementation for multi-modal content that could lead to silent data loss. Addressing these points will improve the robustness and completeness of the new feature.

gemini-code-assist · 2026-03-19T09:26:11Z

model_gateway/src/routers/tokenize/grpc_service.rs

+fn proto_message_to_chat_message(
+    msg: &proto::ChatCompletionMessage,
+) -> Result<ChatMessage, String> {


This function currently ignores the content_parts field of proto::ChatCompletionMessage, which is used for multi-modal inputs. This can lead to silent data loss if a client sends multi-modal content. The function should be updated to either correctly process content_parts into openai_protocol::chat::MessageContent::Array or return an InvalidArgument error to signal that this feature is not yet supported. This aligns with the principle of avoiding silent failures.

References

Instead of silently ignoring potential failures (e.g., from serialization), log them as warnings to aid in debugging. In Rust, prefer using unwrap_or_else to log an error over unwrap_or_default which would fail silently.

gemini-code-assist · 2026-03-19T09:26:11Z

model_gateway/src/routers/tokenize/grpc_service.rs

+        }
+        "tool" => Ok(ChatMessage::Tool {
+            content: content_from_str(&msg.content),
+            tool_call_id: msg.tool_call_id.clone().unwrap_or_default(),


For messages with role: "tool", the tool_call_id is a required field according to the OpenAI API specification. Currently, if it's missing in the gRPC request, it defaults to an empty string, which could lead to issues downstream. It would be more robust to return an InvalidArgument error if tool_call_id is not provided for a tool message. This prioritizes alignment with the external OpenAI API specification.

Suggested change

tool_call_id: msg.tool_call_id.clone().unwrap_or_default(),

tool_call_id: msg.tool_call_id.clone().ok_or_else(|| "`tool_call_id` is required for tool messages".to_string())?,

References

For protocol data structures that mirror an external API (e.g., OpenAI), prioritize alignment with the external specification over internal consistency.

gemini-code-assist · 2026-03-19T09:26:11Z

model_gateway/src/main.rs

+    /// Server protocol mode: http (REST API) or grpc (gRPC render service)
+    /// Server protocol mode: http (REST API) or grpc (gRPC render service)


This documentation comment is duplicated. Please remove one of the lines to improve code clarity.

Suggested change

/// Server protocol mode: http (REST API) or grpc (gRPC render service)

/// Server protocol mode: http (REST API) or grpc (gRPC render service)

/// Server protocol mode: http (REST API) or grpc (gRPC render service)

hyeongyun0916 added 3 commits March 19, 2026 05:44

feat(tokenize): add render chat and completion endpoints with request…

cda95aa

…/response schemas Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

feat(render): implement gRPC render service with chat and completion …

7afe1d1

…endpoints Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

test(tokenize): add unit tests for handling missing models in render …

d445b76

…completion and chat core Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

github-actions bot added python-bindings Python bindings changes model-gateway Model gateway crate changes labels Mar 19, 2026

gemini-code-assist bot reviewed Mar 19, 2026

View reviewed changes

hyeongyun0916 mentioned this pull request Mar 19, 2026

[Frontend] Add gRPC server support for vllm launch render vllm-project/vllm#36102

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(render): add render endpoints for chat template + tokenization without generation#819

feat(render): add render endpoints for chat template + tokenization without generation#819
hyeongyun0916 wants to merge 3 commits intolightseekorg:mainfrom
hyeongyun0916:feat/tokenize-endpoint

hyeongyun0916 commented Mar 19, 2026

Uh oh!

coderabbitai bot commented Mar 19, 2026

Review skipped

Uh oh!

gemini-code-assist bot commented Mar 19, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 19, 2026

Uh oh!

gemini-code-assist bot Mar 19, 2026

Uh oh!

gemini-code-assist bot Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	tool_call_id: msg.tool_call_id.clone().unwrap_or_default(),
	tool_call_id: msg.tool_call_id.clone().ok_or_else(\|\| "`tool_call_id` is required for tool messages".to_string())?,

		/// Server protocol mode: http (REST API) or grpc (gRPC render service)
		/// Server protocol mode: http (REST API) or grpc (gRPC render service)

Conversation

hyeongyun0916 commented Mar 19, 2026

Description

Problem

Solution

Changes

Test Plan

Uh oh!

coderabbitai bot commented Mar 19, 2026

Review skipped

Uh oh!

gemini-code-assist bot commented Mar 19, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant