fix: filter programming keywords from entity detection (#348)#349
Conversation
web3guru888
left a comment
There was a problem hiding this comment.
The CODE_KEYWORDS blocklist approach is pragmatic and the scope of coverage looks right — Rust derive macros and standard types are particularly bad offenders since they're capitalized and high-frequency by construction.
One thing we hit in our integration that might be worth considering: domain-specific technical terms that are not language keywords can still flood entity detection when mining research codebases. Symbols like "Hypothesis", "Evidence", "Cycle", "Agent" appear frequently in both code and prose but mean different things in each context. The current fix is correctly scoped to programming language constructs — just flagging this as a potential follow-on if users report similar false positives from non-code technical vocabulary.
The full-pipeline test with code-heavy + prose files is the right regression test. The 101-pass baseline is reassuring.
|
Thanks for the review @web3guru888! Good point on domain-specific terms like "Hypothesis", "Evidence", "Cycle" — those are trickier since they're legitimate entity candidates in some contexts. Intentionally kept this PR scoped to unambiguous programming-language constructs. If users report false positives from research/scientific vocabulary, a follow-up with a configurable blocklist (e.g., |
|
@matrix9neonebuchadnezzar2199-sketch — makes sense to keep this scoped to unambiguous programming constructs. A configurable The false positive surface for those terms is inherently context-dependent in a way that Good PR, clean fix. Thanks for engaging on the follow-on. |
- pyproject.toml: widen chromadb to <2.0 for Python 3.14 compat (MemPalace#302) - config.py + miner/convo_miner/mcp_server: add hnsw:space=cosine so similarity = 1 - distance stays in [0,1] instead of negative L2 (MemPalace#304) - searcher.py + layers.py: guard against ChromaDB 1.x empty-outer query results (IndexError on fresh collections) (MemPalace#305) - mcp_server.py: redirect stdout→stderr at import to protect JSON-RPC wire from chromadb/posthog chatter (MemPalace#306) - mcp_server.py: replace 10k-limited col.get with paginating _iter_all_metadatas helper; stop swallowing errors silently (MemPalace#307) - mcp_server.py: drop undocumented wait_for_previous arg injected by Gemini MCP clients (MemPalace#322) - searcher.py + hybrid_searcher.py + mcp_server.py: add min_similarity threshold filter so callers get a clean "no results" signal (MemPalace#350) - entity_detector.py: add CODE_KEYWORDS blocklist (~80 terms) to stop Rust types / React / framework names being detected as entities (MemPalace#349) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Heads up: #156 changes the regex on the line right above this (widens the character class to Latin-1 for accented names). The two diffs don't overlap conceptually but they're adjacent in |
|
Thanks for the heads-up @mvalentsev — noted. Happy to rebase against #156 if it lands first. The changes are in adjacent lines but independent, so a clean rebase should be straightforward. |
|
hey @matrix9neonebuchadnezzar2199-sketch — thanks for this. closing because v4 is bringing local NLP providers (#507) which will replace the regex-based entity_detector entirely with proper NER. a blocklist approach would be a stopgap that gets thrown away shortly after. the underlying issue (#348) will be solved by the NLP upgrade. |
Closes #348
Summary
Add a
CODE_KEYWORDSblocklist toentity_detector.pyso programming types, traits, frameworks, and language names are no longer misdetected as projects or uncertain entities duringmempalace init.Problem
When
mempalace initscans a directory containing code files (Rust, TypeScript, React, etc.), common programming keywords are detected as entities:Users must manually remove every false entry. If they press Enter to accept all, Wing classification is polluted from the start.
Solution
Add a
CODE_KEYWORDSset (~120 terms) covering:The check is applied in
extract_candidates()alongside the existingSTOPWORDSfilter — one additionalnot incheck per candidate word.What is NOT filtered
Testing
tests/test_code_keywords.pycovering: