-
Notifications
You must be signed in to change notification settings - Fork 118
feat: Add LLM improvements and feedback signal API #100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Add LLM improvements and feedback signal API #100
Conversation
- Add Anthropic as LLM provider with full async support - Add LM Studio provider for local model inference - Fix JSON response format compatibility for local models - Update .env.example with configuration examples - Update docstrings with all supported providers Tested with: - Claude Sonnet 4 (claude-sonnet-4-20250514) - Claude Haiku 4.5 (claude-haiku-4-5-20251001) - Qwen 30B via LM Studio
Add configurable timeout support for LLM API calls: - Environment variable override via HINDSIGHT_API_LLM_TIMEOUT - Dynamic heuristic for lmstudio/ollama: 20 mins for large models (30b, 33b, 34b, 65b, 70b, 72b, 8x7b, 8x22b), 5 mins for others - Pass timeout to Anthropic, OpenAI, and local model clients 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Remove CLAUDE.md from .gitignore (should stay in repository) - Pass max_completion_tokens to _call_anthropic instead of hardcoding 4096 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Provides project context and development commands for AI-assisted coding. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add docker-compose.yml for local development - Add test_internal.py for local testing - Sync uv.lock and llm_wrapper.py changes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Move LLM config to config.py with HINDSIGHT_API_ prefix - Add HINDSIGHT_API_LLM_MAX_CONCURRENT (default: 32) - Add HINDSIGHT_API_LLM_TIMEOUT (default: 120s) - Remove fragile model-size timeout heuristic - Apply markdown JSON extraction to all providers, not just local - Fix Anthropic markdown extraction bug (missing split) - Change LLM request/response logs from info to debug level 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Remove test_internal.py (debug file) - Remove docker-compose.yml (to be moved to hindsight-cookbook repo) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Implements a feedback signal system that tracks which recalled facts
are actually useful, enabling usefulness-boosted recall.
API Endpoints:
- POST /v1/default/banks/{bank_id}/signal - Submit feedback signals
- GET /v1/default/banks/{bank_id}/facts/{fact_id}/stats - Fact stats
- GET /v1/default/banks/{bank_id}/stats/usefulness - Bank stats
Features:
- Signal types: used, ignored, helpful, not_helpful
- Time-decayed scoring (5% decay per week)
- Usefulness-boosted recall with configurable weight
- Query pattern tracking for analytics
Database:
- fact_usefulness: Aggregate scores per fact
- usefulness_signals: Individual signal records
- query_pattern_stats: Pattern tracking
Documentation:
- Full API reference in hindsight-docs
- Python, Node.js, and cURL examples
- Updated recall.mdx with new parameters
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <[email protected]>
Adds a startup script that waits for dependencies (database and LLM Studio) before launching the Hindsight API. Retries indefinitely by default, allowing the container to start before LM Studio is available. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Strip <think>, <thinking>, <reasoning>, and |startthink|/|endthink| tags from reasoning model outputs to enable proper JSON parsing. This allows local reasoning models like Qwen3 to work with Hindsight's structured extraction pipeline. Also adds slow call logging for Ollama native function and updates reasoning model detection to include qwq family. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
…artup Container now waits for database and LLM Studio to be accessible before starting Hindsight. Configurable via environment variables: - HINDSIGHT_RETRY_MAX: Max retries (0 = infinite, default) - HINDSIGHT_RETRY_INTERVAL: Seconds between retries (default 10) Applied to all three Docker stages: api-only, cp-only, and standalone. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
When HINDSIGHT_API_DATABASE_URL is not set, the standalone container uses embedded pg0 which starts with start-all.sh. The retry script now detects this and skips the external database check. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Documents testing of Qwen3 8B/14B, Gemma 3, and NuExtract models for Hindsight memory extraction on Apple Silicon. Includes: - Benchmark results and performance comparisons - Configuration recommendations - Docker setup with retry-start script - Troubleshooting guide 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
# Conflicts: # hindsight-api/hindsight_api/engine/llm_wrapper.py # uv.lock
- Add llm-comparison.py benchmark script to compare LLM providers - Reduce max_completion_tokens from 65000 to 8192 for better local LLM compatibility - Include benchmark results for Qwen3-8B and Claude Haiku 4.5 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add thinking_level parameter support (LOW/MEDIUM/HIGH) for Gemini 3 models - Add temperature and max_output_tokens support for Gemini - Improve rate limit handling with 10-120s backoff for 429 errors - Add HINDSIGHT_API_LLM_THINKING_LEVEL env var (default: low) - Include benchmark results comparing thinking levels Performance with thinking_level=medium: - 4.3x faster retain (4.5s vs 19.4s per memory) - 92% fact extraction quality retained (33 vs 36 facts) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- LLM wrapper: detect and retry on empty responses (None or empty string) - LLM wrapper: add detailed logging with finish_reason and safety ratings - Docker startup: skip endpoint check for cloud providers (openai, anthropic, gemini, groq) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Resolved conflict in llm_wrapper.py: - Kept upstream's more descriptive comment for thinking tag stripping - Preserved local empty response handling with retry logic 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Implements a feedback signal system that tracks which recalled facts
are actually useful, enabling usefulness-boosted recall.
API Endpoints:
- POST /v1/default/banks/{bank_id}/signal - Submit feedback signals
- GET /v1/default/banks/{bank_id}/facts/{fact_id}/stats - Fact stats
- GET /v1/default/banks/{bank_id}/stats/usefulness - Bank stats
Features:
- Signal types: used, ignored, helpful, not_helpful
- Time-decayed scoring (5% decay per week)
- Usefulness-boosted recall with configurable weight
- Query pattern tracking for analytics
Database:
- fact_usefulness: Aggregate scores per fact
- usefulness_signals: Individual signal records
- query_pattern_stats: Pattern tracking
Documentation:
- Full API reference in hindsight-docs
- Python, Node.js, and cURL examples
- Updated recall.mdx with new parameters
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add thinking_level parameter support (LOW/MEDIUM/HIGH) for Gemini 3 models - Add temperature and max_output_tokens support for Gemini - Improve rate limit handling with 10-120s backoff for 429 errors - Add HINDSIGHT_API_LLM_THINKING_LEVEL env var (default: low) - Include benchmark results comparing thinking levels Performance with thinking_level=medium: - 4.3x faster retain (4.5s vs 19.4s per memory) - 92% fact extraction quality retained (33 vs 36 facts) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- LLM wrapper: detect and retry on empty responses (None or empty string) - LLM wrapper: add detailed logging with finish_reason and safety ratings - Docker startup: skip endpoint check for cloud providers (openai, anthropic, gemini, groq) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
The retry logic is already integrated into start-all.sh, making this separate script unnecessary. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
These are local test results that shouldn't be in the upstream PR. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
nicoloboschi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @csfet9 thanks for this big addition!
I've a more general question about this:
isn't the helpfulness related to the recall query and other parameters?
for example, if I ask the bank about "Alice" I might say that Bob facts weren't helpful but if I ask about Bob, that fact is indeed helpful (maybe)
So I think we're missing something that connects the helpfulness to the actual query
|
Hey @nicoloboschi, great feedback! You're absolutely right - the helpfulness signal should be tied to the query context. I've implemented a query-context aware scoring system: Changes
Example Fact: "Bob works at TechCorp" Query "Who works at TechCorp?" → marked helpful → score = 0.65 Recall with "Tell me about TechCorp employees" → uses 0.65 (similar to first query) This way, the same fact can have different usefulness scores depending on the query context. Currently testing locally - will push once verified. Want me to adjust anything? |
thanks, that looks better! can you share your use case for this feature and what is the pattern you're building? it looks like you want some human-in-the-loop mechanism and I'd love to hear more about how you intend to use hindsight there |
|
Thanks, Here's the use case: I'm building an automatic feedback loop for Claude Code where the system learns from Claude's actual behavior. How it works
Why query-context matters A fact like "Bob is a senior engineer" might be helpful for "Who can help with code?" but irrelevant for "What's the deadline?". Without query-context, one "helpful" signal would boost it for ALL queries. The semantic matching ensures similar queries share scores while different query types stay separate. The human-in-the-loop is mostly implicit - Claude's usage patterns provide the signal automatically, no manual thumbs up/down needed. |
- Make query field required in SignalItem for context-aware scoring - Add query_fact_usefulness table with HNSW index for semantic matching - Store query embeddings with signals for similarity-based grouping - Use query-specific scores with global fallback during recall - Similar queries (cosine similarity >= 0.85) share scores - Update documentation and examples with required query field 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Replace separate env var with existing reasoning_effort parameter for Gemini 3 thinking_level configuration. This unifies the config across providers. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
Updates pushed: Query-Context Aware Scoring (addresses your feedback)
Review comment fixes Ready for re-review! |
Resolve conflict in config.py: - Keep Groq service tier from upstream (vectorize-io#102) - Remove unused LLM thinking level env var (now using reasoning_effort) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Read HINDSIGHT_API_LLM_THINKING_LEVEL from environment in all LLMProvider factory methods (for_memory, for_answer_generation, for_judge) instead of hardcoding the reasoning_effort value. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Features merged: - Query-context aware feedback scoring for improved recall relevance - Groq service tier configuration (ENV_LLM_GROQ_SERVICE_TIER) - Configurable thinking level for Gemini 3 via reasoning_effort - OpenAI embeddings support with configurable dimensions - GitHub issue templates 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
The file was referenced in Dockerfile but was missing from the repo. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
The retry logic was merged into start-all.sh. Updated Dockerfile from upstream which uses start-all.sh directly. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
… 500 Add Pydantic field_validator to SignalItem.fact_id to validate UUID format at the API layer. Invalid UUIDs now receive a proper 422 Validation Error instead of causing a 500 Internal Server Error when the database rejects them. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
…tion Merged three changes: - llm_wrapper.py: Combined Gemini temperature/max_completion_tokens (from HEAD) with return_usage parameter (from upstream) - fact_extraction.py: Use config.retain_max_completion_tokens instead of hardcoded 8192 for configurable token limits Co-Authored-By: Claude Opus 4.5 <[email protected]>
The `<0.5s` was being interpreted as a JSX tag by the MDX parser. Co-Authored-By: Claude Opus 4.5 <[email protected]>
Summary
This PR adds several high-value improvements to the Hindsight memory system:
1. Feedback Signal API
POST /feedback)2. Gemini 3 Flash Preview Optimizations
3. Empty LLM Response Handling
Test plan
cd hindsight-api && uv run pytest tests/🤖 Generated with Claude Code