feat: make Tavily web search query context-aware when PDF is uploaded by nkmohit · Pull Request #258 · THU-MAIC/OpenMAIC

nkmohit · 2026-03-24T18:46:22Z

Summary

This PR makes Tavily web search queries context-aware when a PDF is attached, and aligns both active generation paths with the same rewrite behavior.

Previously, the preview flow sent the raw requirement directly to /api/web-search, so vague prompts like “Tell me about this paper” were passed to Tavily unchanged even when parsed PDF text was available. This PR adds a shared server-side query rewrite step that can use uploaded PDF text and also improves overly long raw queries before they reach Tavily.

Related Issues

Fixes #246

Changes

Added a shared server-side search query builder that:
- rewrites queries when PDF text is present
- rewrites queries when the raw requirement exceeds 400 characters
- uses the existing prompt system and JSON parsing flow
- falls back to the normalized raw requirement if rewrite is unavailable or unusable
Added a new prompt template pair for web-search-query-rewrite
Updated /api/web-search to:
- accept optional pdfText
- resolve the model from request headers, matching other preview-generation routes
- rewrite the search query before calling Tavily
Updated generation-preview to send parsed pdfText to /api/web-search
Updated generateClassroom(...) to reuse the same shared query-rewrite helper instead of keeping separate local rewrite logic
Kept Tavily request/response behavior unchanged apart from the improved query input
Kept web-search rewrite as best-effort: if rewrite model resolution or output fails, web search falls back to the raw query instead of failing the entire flow

Type of Change

New feature (non-breaking change that adds functionality)

Verification

Steps to reproduce / test

Start the app locally with pnpm dev
Go through the homepage -> generation-preview flow with:
- web search enabled
- a PDF attached
- a vague requirement such as “Tell me about this paper”
Confirm /api/web-search is called with pdfText and that the request succeeds instead of sending only the raw vague requirement
Optionally call /api/generate-classroom directly with pdfContent.text and enableWebSearch: true to exercise the server classroom-generation path as well

What you personally verified

Verified that the active preview flow now passes pdfText into /api/web-search
Verified that /api/web-search uses the same header-based model resolution pattern as the other preview generation APIs
Verified that preview web search logs show:
- hasPdfContext: true
- rewriteAttempted: true
Verified that the preview flow continues successfully through outlines / scene-content / scene-actions after the web-search change
Verified that the shared helper is reused by both /api/web-search and generateClassroom(...)
Verified that web-search rewrite failure does not hard-fail preview; it falls back to the raw requirement
Did not fully resolve the separate pre-existing inconsistency in generateClassroom(...) base model selection, where that job path still uses server-default model resolution and may fail if the server default provider is not configured

Evidence

CI passes (pnpm check && pnpm lint && npx tsc --noEmit)
Manually tested locally
Screenshots / recordings attached (if UI changes)

PDF attached: web search rewrite uses PDF context and returns relevant sources.

No PDF: long requirement is rewritten before Tavily and still returns relevant sources.

Checklist

My code follows the project's coding style
I have performed a self-review of my code
I have added/updated documentation as needed
My changes do not introduce new warnings

cosarah

PR #258 Review: Context-Aware Web Search Query Rewrite

What Was Done Well

Clean extraction of a shared buildSearchQuery helper — eliminates duplication between /api/web-search and generateClassroom.
Solid best-effort / graceful-degradation pattern: every failure path falls back to the raw query with logging.
SearchQueryBuildResult metadata is well-designed for observability.
Prompt templates follow existing project conventions.

Issues

Important

1. maxOutputTokens set to model's full outputWindow for a tiny query-rewrite task
app/api/web-search/route.ts:56

maxOutputTokens: modelInfo?.outputWindow,

The rewrite prompt asks for a single JSON object with a query string under 320 chars. The model's outputWindow can be 128k tokens. Some providers allocate resources or charge based on the requested maxOutputTokens. Consider capping to a small value (e.g., 256–512 tokens) to avoid waste and speed up responses. The same pattern in classroom-generation.ts:200 is reasonable since it generates full scenes, but for a query rewrite it's excessive.

2. Double-normalization in shouldRewriteSearchQuery
lib/server/search-query-builder.ts:34-36

shouldRewriteSearchQuery normalizes its inputs internally, but buildSearchQuery on line 47 passes already-normalized values into it. Not a bug (result is the same), but confusing and wasteful. The function should either accept raw inputs or skip re-normalization.

3. PDF text accepted without size limits at the API boundary
app/api/web-search/route.ts:29

pdfText is accepted with no validation or truncation at the API layer. normalizePdfExcerpt truncates to 7000 chars before the LLM, but a client could still send an arbitrarily large payload that gets parsed and held in memory. Consider adding an explicit check/truncation at the API boundary, or at minimum documenting the reliance on the framework body limit.

Minor

4. Content-Type header set twice
app/generation-preview/page.tsx:312-314

getApiHeaders() already sets 'Content-Type': 'application/json'. The explicit override on line 314 is redundant — remove it.

5. User prompt template contradicts itself on code fences
lib/generation/prompts/templates/web-search-query-rewrite/user.md:13-18

Says "no code fences" then immediately shows a code-fenced JSON example. This could confuse the LLM. Show the example without fences, or rephrase the instruction.

6. Inconsistent callLLM calling convention
app/api/web-search/route.ts:51-54 uses { system, prompt } while classroom-generation.ts:196-199 uses { messages: [...] } for the same logical operation. Minor consistency gap.

Summary

Well-structured PR with clean architecture and robust fallback behavior. The most actionable item is Issue #1 — capping maxOutputTokens for the rewrite call would be a meaningful cost/latency improvement for what is effectively a one-line JSON generation. The other issues are minor consistency and hygiene items.

Verdict: request changes (for issue #1 primarily)

nkmohit · 2026-03-25T07:46:48Z

Thanks for the detailed review, @cosarah. I pushed a follow-up commit addressing the review items.

Changes made:

Capped rewrite-only maxOutputTokens to 256 in both /api/web-search and generateClassroom(...)
Removed the double-normalization in shouldRewriteSearchQuery
Added explicit pdfText truncation at the /api/web-search boundary before entering the rewrite flow
Removed the redundant Content-Type override in generation-preview
Fixed the prompt JSON examples so they no longer conflict with the no-code-fences instruction
Aligned the rewrite callLLM usage to messages: [...] in both paths

The rewrite behavior itself is unchanged: it is still best-effort and still falls back to the normalized raw requirement if the rewrite step is unavailable or unusable.

One small note on the pdfText clamp: it now limits what the rewrite flow sees, but it still happens after req.json(), so it does not change request-parse memory behavior.

Ready for re-review when you have time.

fix(web-search): rewrite search query for pdf and long requirements

b242544

nkmohit mentioned this pull request Mar 24, 2026

[Feature]: Context-aware web search query when files are uploaded #246

Open

YizukiAme mentioned this pull request Mar 24, 2026

fix: context-aware web search query when PDF files are uploaded #259

Closed

cosarah reviewed Mar 25, 2026

View reviewed changes

fix(web-search): address query rewrite review feedback

faeaba5

Merge branch 'main' into feature/context-aware-web-search-query

b6bcaf3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: make Tavily web search query context-aware when PDF is uploaded#258

feat: make Tavily web search query context-aware when PDF is uploaded#258
nkmohit wants to merge 3 commits intoTHU-MAIC:mainfrom
nkmohit:feature/context-aware-web-search-query

nkmohit commented Mar 24, 2026

Uh oh!

cosarah left a comment

Uh oh!

nkmohit commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nkmohit commented Mar 24, 2026

Summary

Related Issues

Changes

Type of Change

Verification

Steps to reproduce / test

What you personally verified

Evidence

Checklist

Uh oh!

cosarah left a comment

Choose a reason for hiding this comment

PR #258 Review: Context-Aware Web Search Query Rewrite

What Was Done Well

Issues

Important

Minor

Summary

Uh oh!

nkmohit commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants