-
Notifications
You must be signed in to change notification settings - Fork 1
UPSTREAM PR #18851: enforce response_format and json_schema for Kimi K2 #925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
UPSTREAM PR #18851: enforce response_format and json_schema for Kimi K2 #925
Conversation
|
Explore the complete analysis inside the Version Insights Performance Review ReportSummaryThis review analyzes commit 53ddc91 "enforce response_format and json_schema for Kimi K2" affecting 123 files (78 modified, 39 added, 6 deleted). The changes introduce minor performance regressions in two utility binaries—llama-tts (+414ns) and llama-cvector-generator (+230ns)—caused by compiler optimization differences in STL template instantiations rather than algorithmic changes. Performance ImpactPower Consumption:
Affected Functions: AnalysisThe commit modifies build configuration by promoting LLAMA_CURL to default ON and setting LLAMA_OPENSSL to default OFF. These changes affect template instantiation and inlining decisions for C++ standard library containers (std::map, std::vector, std::function) used in non-critical utility paths like JSON schema parsing and argument handling. Key observations:
The performance-critical components identified in project insights (GEMM operations, attention mechanisms, KV cache, quantization kernels) remain unaffected, with core libraries showing zero power consumption change. |
e5e5b27 to
d2ed99a
Compare
|
@loci-dev can you show me the code for function common_chat_params_init_kimi_k2 |
|
Based on the analysis, the Function PurposeThis function initializes chat parameters specifically for Kimi K2 models. It was modified in commit 53ddc91 to enforce validation rules for Key Changes MadeThe function was enhanced with:
Performance ImpactSince this is an initialization function called once per chat session setup (not during inference), the added validation logic has zero impact on inference performance. The function executes outside the token generation loop, so the additional checks and schema conversion only affect startup time, not the performance-critical inference path. The function is located in Would you like me to explain any specific aspect of this function's behavior or its relationship to the overall performance changes? |
|
@loci-dev can you give me a code snippet of the source code of function common_chat_params_init_kimi_k2 |
|
I don't have access to the actual source code files to provide you with a code snippet. The analysis tools I have access to work with compiled binaries and provide information about what the code does, but they don't return the raw source code itself. From the analysis, I can tell you what the function does conceptually: What the function contains:
To view the actual code:You would need to look at the source file directly:
If you have access to the repository, you can view the changes in that commit to see the exact code modifications. Is there a specific aspect of the function's logic or behavior you'd like me to explain in more detail? |
85c5a15 to
8c6ef1b
Compare
4f9fac2 to
cbda11a
Compare
Mirrored from ggml-org/llama.cpp#18851
This PR aims to fix an issue I encountered using response_format with Kimi K2 Instruct 0905.
Using the /v1/chat/completions endpoint in llama-server I noticed that I was receiving responses which were not adhereing to the submitted json_schema.
Simplest reproduction:
./build/bin/llama-server \ --host 127.0.0.1 --port 5840 \ --model /path/to/Kimi-K2-Instruct-0905-...-00001-of-00013.gguf \ --ctx-size 8192 \ --n-gpu-layers 0You are likely to receive a response like the following, backticks and json declaration included:
This should not be possible if grammar is being created and enforced.
The issue is that for chat completions, when the Kimi format is detected it routes to the kimi based handler (common_chat_params_init_kimi_k2). This handler did not follow the same behavior as the generic handler which would generate a grammar for a schema in response_format, it only handled tool grammars.
This revealed a second issue with the Kimi flags which included open bracket in tool separator and closing bracket in tool end. Due to those characters being in template tag definitions, the trim_suffix call was removing the ending bracket and producing invalid JSON strings, e.g.
{"ok": true. I have modified the trim_suffix approach, but it is ugly and I'm hoping someone with better intuition will have a better solution. I see there is an Autoparser PR (#18675) but I have tested it and it does not resolve the original issue.AI was used in the following ways for this PR:
As requested, I ran the whole test suite which passed. Perplexity obviously not affected.