Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
479 changes: 466 additions & 13 deletions docs/specs/openapi-props.yaml

Large diffs are not rendered by default.

353 changes: 328 additions & 25 deletions docs/specs/props-endpoint.md

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion docs/specs/thinking-budget.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ Fields:
| `verified_at` | ISO date the values were last checked against the source. |
| `max_tokens` | The card's standard recommended combined cap. Drives `default_max_tokens`. |
| `complex_problem_max_tokens` | Optional. The card's recommendation for hard reasoning / benchmark workloads. Drives the `x-high` and `max` effort tiers, which sit *above* `default_max_tokens` when this field is present — they are admissible as long as they fit under `max_ctx − hard_limit_reply_budget`. If omitted, both collapse to the `high` tier value. |
| `hard_limit_reply_budget` | Optional. Tokens reserved post-`</think>` for the visible answer phase, used both to derive `think_max_tokens = max_tokens − hard_limit_reply_budget` and as the force-close trigger inside `do_ar_decode` / `do_spec_decode` (when `n_gen − generated ≤ hard_limit_reply_budget`, the engine overrides the next sampled token with `</think>`). Default 4096 (raised from 512 on 2026-05-25). The original 512 came from `ds4_eval.c`, sized for DeepSeek-V4-flash's terse style, but it silently truncated almost every other model mid-answer — bench results from `server/docs/experiments/gemma4-26b-thinking-control-2026-05-25.md` showed every force-closed thinking probe getting cut off mid-coordinate-geometry-proof at 512. Without priors on a specific model, 4096 is the safer default; terse models should override down. Qwen3.6, Gemma 4 26B, Gemma 4 31B all ship 4096 in their sidecars. |
| `hard_limit_reply_budget` | Optional. Tokens reserved post-`</think>` for the visible answer phase, used both to derive `think_max_tokens = max_tokens − hard_limit_reply_budget` and as the force-close trigger inside `do_ar_decode` / `do_spec_decode` (when `n_gen − generated ≤ hard_limit_reply_budget`, the engine overrides the next sampled token with `</think>`). Default 4096 (raised from 512 on 2026-05-25). The original 512 came from `ds4_eval.c`, sized for DeepSeek-V4-flash's terse style, but it silently truncated almost every other model mid-answer — bench results from `docs/experiments/gemma4-26b-thinking-control-2026-05-25.md` showed every force-closed thinking probe getting cut off mid-coordinate-geometry-proof at 512. Without priors on a specific model, 4096 is the safer default; terse models should override down. Qwen3.6, Gemma 4 26B, Gemma 4 31B all ship 4096 in their sidecars. |
| `sampling` | Recommended sampler params. Used as defaults when the request doesn't pin sampler values. |
| `reasoning_effort_tiers` | Explicit phase-1 budgets per tier. Override any computed default. Whichever tiers are present win; missing tiers fall through to the computed defaults below. |

Expand Down
82 changes: 78 additions & 4 deletions server/src/server/chat_template.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -51,14 +51,18 @@ ChatFormat chat_format_for_arch(const std::string & arch) {
return ChatFormat::QWEN3;
}

std::string render_chat_template(
PromptRenderResult render_chat_template(
const std::vector<ChatMessage> & messages,
ChatFormat format,
bool add_generation_prompt,
bool enable_thinking,
const std::string & tools_json)
{
std::string result;
// `started_in_thinking` is derived deterministically from the template
// branch + render flags below. Set per format inside the switch so a
// future format addition can't silently miss the wiring.
bool started_in_thinking = false;
bool has_tools = !tools_json.empty() && tools_json != "[]" && tools_json != "null";

switch (format) {
Expand Down Expand Up @@ -141,6 +145,14 @@ std::string render_chat_template(
// even when the client opts in, defeating the thinking-budget
// mechanism entirely.
result += "<think>\n";
// The prompt suffix pre-opens `<think>` — the model's very
// first generated token is reasoning, never preceded by an
// explicit `<think>` opener in the stream. Callers must
// start the SSE state machine in REASONING mode and pass
// `started_in_thinking=true` to parse_reasoning() so that
// reasoning text routes to reasoning_content instead of
// leaking into content.
started_in_thinking = true;
}
}
break;
Expand Down Expand Up @@ -224,6 +236,11 @@ std::string render_chat_template(
result += "<assistant>\n";
if (enable_thinking) {
result += "<think>";
// Same situation as Qwen3.6: Laguna XS.2's enable_thinking
// generation prompt ends with `<think>` so the model starts
// emitting reasoning tokens with no explicit opener in the
// stream. Route subsequent tokens to the reasoning channel.
started_in_thinking = true;
} else {
// Empty think block — model jumps straight to answer.
result += "</think>";
Expand Down Expand Up @@ -311,11 +328,17 @@ std::string render_chat_template(
result += "<|channel>thought\n<channel|>";
}
}
// Gemma4 does NOT pre-open `<think>` from the prompt; its
// reasoning channel is opened by the model emitting `<|channel>`
// which http_server forwards into the SseEmitter as the text
// `<think>` — so the emitter's existing CONTENT→REASONING
// transition fires on that synthesized opener. started_in_thinking
// stays false (initial CONTENT mode is correct).
break;
}
}

return result;
return PromptRenderResult{std::move(result), started_in_thinking};
}

// ─── Jinja path ─────────────────────────────────────────────────────────
Expand Down Expand Up @@ -353,7 +376,29 @@ static std::shared_ptr<jinja::program> get_or_parse(const std::string & template

} // namespace

std::string render_chat_template_jinja(
// Sniff a rendered prompt for a trailing `<think>` opener so the caller
// can route subsequent stream tokens to the reasoning channel. Accepts
// optional whitespace after the opener (Qwen3.6 emits `<think>\n`).
// True positive ⇒ caller should treat the prompt as having pre-opened
// the reasoning channel (and the renderer warns loudly so a model-card
// mismatch is visible at runtime).
static bool prompt_ends_with_think_open(const std::string & s) {
static const std::string OPEN = "<think>";
// Walk back over trailing ASCII whitespace.
size_t end = s.size();
while (end > 0) {
char c = s[end - 1];
if (c == ' ' || c == '\n' || c == '\r' || c == '\t') {
end--;
} else {
break;
}
}
if (end < OPEN.size()) return false;
return s.compare(end - OPEN.size(), OPEN.size(), OPEN) == 0;
}

PromptRenderResult render_chat_template_jinja(
const std::string & template_src,
const std::vector<ChatMessage> & messages,
const std::string & bos_token,
Expand Down Expand Up @@ -407,14 +452,43 @@ std::string render_chat_template_jinja(
throw std::runtime_error(std::string("jinja global_from_json: ") + e.what());
}

std::string rendered;
try {
jinja::runtime rt(ctx);
jinja::value results = rt.execute(*prog);
auto parts = jinja::runtime::gather_string_parts(results);
return parts->as_string().str();
rendered = parts->as_string().str();
} catch (const std::exception & e) {
throw std::runtime_error(std::string("jinja runtime: ") + e.what());
}

// Jinja path: we don't know which template family the caller passed
// in, so derive `started_in_thinking` by sniffing the rendered tail
// for a `<think>` opener. This catches the common Qwen3.6 / Laguna
// chat templates that end with `<think>\n` when enable_thinking is
// honored, plus any custom template that follows the same convention.
//
// The sniff is the source of truth — if the rendered prompt ends
// with `<think>`, the model's first generated token is reasoning
// regardless of the `enable_thinking` flag we passed in. A template
// that hard-codes `<think>` even with enable_thinking=false will
// still pre-open the channel, and we must route accordingly to
// avoid leaking reasoning into the content stream.
//
// Warn only on the mismatch case (sniff=true, enable_thinking=false)
// so a template/model-card disagreement surfaces in server logs
// without spamming the normal-success path.
bool started_in_thinking =
add_generation_prompt && prompt_ends_with_think_open(rendered);
if (started_in_thinking && !enable_thinking) {
std::fprintf(stderr,
"[WARN] render_chat_template_jinja: rendered prompt ends with "
"`<think>` opener despite enable_thinking=false — treating as "
"started_in_thinking=true. Check the template's enable_thinking "
"branch or the model card's reasoning configuration.\n");
}

return PromptRenderResult{std::move(rendered), started_in_thinking};
}

} // namespace dflash::common
21 changes: 19 additions & 2 deletions server/src/server/chat_template.h
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,23 @@ enum class ChatFormat {
GEMMA4, // <bos><|turn>role\n...<turn|>\n
};

// Provenance for a rendered prompt. `text` is the byte string that gets
// tokenized; `started_in_thinking` records whether the prompt suffix
// pre-opens a `<think>` block (or equivalent reasoning-channel marker)
// that the model is expected to continue into.
//
// Callers route this into the SseEmitter's initial mode and into
// parse_reasoning()'s `started_in_thinking` argument so reasoning text
// emitted before any explicit `<think>` opener is still attributed to
// the reasoning channel. Without this plumbing, Qwen3.6 / Laguna
// enable_thinking prompts (which pre-open `<think>\n` in the assistant
// turn) cause the model to emit reasoning straight into the content
// channel, leaving `reasoning_content` empty.
struct PromptRenderResult {
std::string text; // rendered prompt text, ready to tokenize
bool started_in_thinking; // prompt suffix opens reasoning channel
};

// Render chat messages into the model-specific prompt string.
// The result is plain text ready to be tokenized.
//
Expand All @@ -40,7 +57,7 @@ enum class ChatFormat {
// `tools_json` is an optional JSON string containing the tool definitions
// array. When non-empty, the Qwen3/3.5 template injects a tool preamble
// into the system message instructing the model how to emit <tool_call> tags.
std::string render_chat_template(
PromptRenderResult render_chat_template(
const std::vector<ChatMessage> & messages,
ChatFormat format,
bool add_generation_prompt = true,
Expand All @@ -67,7 +84,7 @@ ChatFormat chat_format_for_arch(const std::string & arch);
// Internally caches the most recently parsed program per thread (avoids
// re-parsing the template on every request). Throws std::runtime_error on
// lexer/parser/runtime failure (caller should surface a 500 response).
std::string render_chat_template_jinja(
PromptRenderResult render_chat_template_jinja(
const std::string & template_src,
const std::vector<ChatMessage> & messages,
const std::string & bos_token,
Expand Down
Loading
Loading