Skip to content

fix: filter programming keywords from entity detection (#348)#349

Closed
matrix9neonebuchadnezzar2199-sketch wants to merge 1 commit intoMemPalace:developfrom
matrix9neonebuchadnezzar2199-sketch:fix/entity-detector-code-keywords
Closed

fix: filter programming keywords from entity detection (#348)#349
matrix9neonebuchadnezzar2199-sketch wants to merge 1 commit intoMemPalace:developfrom
matrix9neonebuchadnezzar2199-sketch:fix/entity-detector-code-keywords

Conversation

@matrix9neonebuchadnezzar2199-sketch
Copy link
Copy Markdown

Closes #348

Summary

Add a CODE_KEYWORDS blocklist to entity_detector.py so programming types, traits, frameworks, and language names are no longer misdetected as projects or uncertain entities during mempalace init.

Problem

When mempalace init scans a directory containing code files (Rust, TypeScript, React, etc.), common programming keywords are detected as entities:

  • False projects: React, Tauri, Rust, Node, Tree
  • False uncertain: String (31×), Vec (19×), Debug (13×), Deserialize (12×), Serialize (12×), Clone (12×)

Users must manually remove every false entry. If they press Enter to accept all, Wing classification is polluted from the start.

Solution

Add a CODE_KEYWORDS set (~120 terms) covering:

  • Rust types/traits/derive macros (String, Vec, Debug, Clone, Serialize, Deserialize, ...)
  • JS/TS/React keywords (React, Vue, Angular, Node, Component, Props, ...)
  • Python framework names (Django, Flask, FastAPI, Pytest, ...)
  • Go keywords (Goroutine, Channel, Defer, ...)
  • General programming patterns (Tree, Graph, Queue, Handler, Middleware, ...)
  • Language/runtime names (Rust, Python, Kotlin, Swift, ...)
  • Build tools and frameworks (Cargo, Tauri, Electron, Docker, ...)

The check is applied in extract_candidates() alongside the existing STOPWORDS filter — one additional not in check per candidate word.

What is NOT filtered

  • Actual project names (CodeMAP, MalCheck, etc.)
  • Actual person names (Alice, Bob, etc.)
  • Any term not in the blocklist

Testing

  • 10 new tests in tests/test_code_keywords.py covering:
    • Rust types, derive macros, framework names, language names, common code patterns excluded
    • Real project and person names NOT excluded
    • All CODE_KEYWORDS entries are lowercase (consistency check)
    • Full pipeline test with code-heavy + prose files
  • All existing tests pass (101 passed, 2 pre-existing Windows-only failures)

Copy link
Copy Markdown

@web3guru888 web3guru888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CODE_KEYWORDS blocklist approach is pragmatic and the scope of coverage looks right — Rust derive macros and standard types are particularly bad offenders since they're capitalized and high-frequency by construction.

One thing we hit in our integration that might be worth considering: domain-specific technical terms that are not language keywords can still flood entity detection when mining research codebases. Symbols like "Hypothesis", "Evidence", "Cycle", "Agent" appear frequently in both code and prose but mean different things in each context. The current fix is correctly scoped to programming language constructs — just flagging this as a potential follow-on if users report similar false positives from non-code technical vocabulary.

The full-pipeline test with code-heavy + prose files is the right regression test. The 101-pass baseline is reassuring.

@matrix9neonebuchadnezzar2199-sketch
Copy link
Copy Markdown
Author

Thanks for the review @web3guru888!

Good point on domain-specific terms like "Hypothesis", "Evidence", "Cycle" — those are trickier since they're legitimate entity candidates in some contexts. Intentionally kept this PR scoped to unambiguous programming-language constructs. If users report false positives from research/scientific vocabulary, a follow-up with a configurable blocklist (e.g., ~/.mempalace/custom_stopwords.txt) might be the right approach. Happy to track that separately if it comes up.

@web3guru888
Copy link
Copy Markdown

@matrix9neonebuchadnezzar2199-sketch — makes sense to keep this scoped to unambiguous programming constructs. A configurable ~/.mempalace/custom_stopwords.txt would be a clean solution for the research vocabulary case — users who map "Hypothesis" or "Evidence" as drawer content would know to add them there.

The false positive surface for those terms is inherently context-dependent in a way that function, const, async isn't — "Hypothesis" is definitely an entity in a biology paper but definitely a metavariable in a logic textbook. A blocklist that the user curates is probably the right locus of that judgment rather than hardcoding heuristics.

Good PR, clean fix. Thanks for engaging on the follow-on.

Perseusxrltd added a commit to Perseusxrltd/mnemion that referenced this pull request Apr 9, 2026
- pyproject.toml: widen chromadb to <2.0 for Python 3.14 compat (MemPalace#302)
- config.py + miner/convo_miner/mcp_server: add hnsw:space=cosine so
  similarity = 1 - distance stays in [0,1] instead of negative L2 (MemPalace#304)
- searcher.py + layers.py: guard against ChromaDB 1.x empty-outer query
  results (IndexError on fresh collections) (MemPalace#305)
- mcp_server.py: redirect stdout→stderr at import to protect JSON-RPC
  wire from chromadb/posthog chatter (MemPalace#306)
- mcp_server.py: replace 10k-limited col.get with paginating
  _iter_all_metadatas helper; stop swallowing errors silently (MemPalace#307)
- mcp_server.py: drop undocumented wait_for_previous arg injected by
  Gemini MCP clients (MemPalace#322)
- searcher.py + hybrid_searcher.py + mcp_server.py: add min_similarity
  threshold filter so callers get a clean "no results" signal (MemPalace#350)
- entity_detector.py: add CODE_KEYWORDS blocklist (~80 terms) to stop
  Rust types / React / framework names being detected as entities (MemPalace#349)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@mvalentsev
Copy link
Copy Markdown
Contributor

Heads up: #156 changes the regex on the line right above this (widens the character class to Latin-1 for accented names). The two diffs don't overlap conceptually but they're adjacent in extract_candidates, so whoever merges second will need a one-line rebase.

@matrix9neonebuchadnezzar2199-sketch
Copy link
Copy Markdown
Author

Thanks for the heads-up @mvalentsev — noted. Happy to rebase against #156 if it lands first. The changes are in adjacent lines but independent, so a clean rebase should be straightforward.

@bensig
Copy link
Copy Markdown
Collaborator

bensig commented Apr 12, 2026

hey @matrix9neonebuchadnezzar2199-sketch — thanks for this.

closing because v4 is bringing local NLP providers (#507) which will replace the regex-based entity_detector entirely with proper NER. a blocklist approach would be a stopgap that gets thrown away shortly after. the underlying issue (#348) will be solved by the NLP upgrade.

@bensig bensig closed this Apr 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: entity detector false positives — programming keywords detected as projects/uncertain

4 participants