fix: filter programming keywords from entity detection (#348) by matrix9neonebuchadnezzar2199-sketch · Pull Request #349 · MemPalace/mempalace

matrix9neonebuchadnezzar2199-sketch · 2026-04-09T07:39:26Z

Closes #348

Summary

Add a CODE_KEYWORDS blocklist to entity_detector.py so programming types, traits, frameworks, and language names are no longer misdetected as projects or uncertain entities during mempalace init.

Problem

When mempalace init scans a directory containing code files (Rust, TypeScript, React, etc.), common programming keywords are detected as entities:

False projects: React, Tauri, Rust, Node, Tree
False uncertain: String (31×), Vec (19×), Debug (13×), Deserialize (12×), Serialize (12×), Clone (12×)

Users must manually remove every false entry. If they press Enter to accept all, Wing classification is polluted from the start.

Solution

Add a CODE_KEYWORDS set (~120 terms) covering:

Rust types/traits/derive macros (String, Vec, Debug, Clone, Serialize, Deserialize, ...)
JS/TS/React keywords (React, Vue, Angular, Node, Component, Props, ...)
Python framework names (Django, Flask, FastAPI, Pytest, ...)
Go keywords (Goroutine, Channel, Defer, ...)
General programming patterns (Tree, Graph, Queue, Handler, Middleware, ...)
Language/runtime names (Rust, Python, Kotlin, Swift, ...)
Build tools and frameworks (Cargo, Tauri, Electron, Docker, ...)

The check is applied in extract_candidates() alongside the existing STOPWORDS filter — one additional not in check per candidate word.

What is NOT filtered

Actual project names (CodeMAP, MalCheck, etc.)
Actual person names (Alice, Bob, etc.)
Any term not in the blocklist

Testing

10 new tests in tests/test_code_keywords.py covering:
- Rust types, derive macros, framework names, language names, common code patterns excluded
- Real project and person names NOT excluded
- All CODE_KEYWORDS entries are lowercase (consistency check)
- Full pipeline test with code-heavy + prose files
All existing tests pass (101 passed, 2 pre-existing Windows-only failures)

web3guru888

The CODE_KEYWORDS blocklist approach is pragmatic and the scope of coverage looks right — Rust derive macros and standard types are particularly bad offenders since they're capitalized and high-frequency by construction.

One thing we hit in our integration that might be worth considering: domain-specific technical terms that are not language keywords can still flood entity detection when mining research codebases. Symbols like "Hypothesis", "Evidence", "Cycle", "Agent" appear frequently in both code and prose but mean different things in each context. The current fix is correctly scoped to programming language constructs — just flagging this as a potential follow-on if users report similar false positives from non-code technical vocabulary.

The full-pipeline test with code-heavy + prose files is the right regression test. The 101-pass baseline is reassuring.

matrix9neonebuchadnezzar2199-sketch · 2026-04-09T08:56:06Z

Thanks for the review @web3guru888!

Good point on domain-specific terms like "Hypothesis", "Evidence", "Cycle" — those are trickier since they're legitimate entity candidates in some contexts. Intentionally kept this PR scoped to unambiguous programming-language constructs. If users report false positives from research/scientific vocabulary, a follow-up with a configurable blocklist (e.g., ~/.mempalace/custom_stopwords.txt) might be the right approach. Happy to track that separately if it comes up.

web3guru888 · 2026-04-09T09:11:45Z

@matrix9neonebuchadnezzar2199-sketch — makes sense to keep this scoped to unambiguous programming constructs. A configurable ~/.mempalace/custom_stopwords.txt would be a clean solution for the research vocabulary case — users who map "Hypothesis" or "Evidence" as drawer content would know to add them there.

The false positive surface for those terms is inherently context-dependent in a way that function, const, async isn't — "Hypothesis" is definitely an entity in a biology paper but definitely a metavariable in a logic textbook. A blocklist that the user curates is probably the right locus of that judgment rather than hardcoding heuristics.

Good PR, clean fix. Thanks for engaging on the follow-on.

- pyproject.toml: widen chromadb to <2.0 for Python 3.14 compat (MemPalace#302) - config.py + miner/convo_miner/mcp_server: add hnsw:space=cosine so similarity = 1 - distance stays in [0,1] instead of negative L2 (MemPalace#304) - searcher.py + layers.py: guard against ChromaDB 1.x empty-outer query results (IndexError on fresh collections) (MemPalace#305) - mcp_server.py: redirect stdout→stderr at import to protect JSON-RPC wire from chromadb/posthog chatter (MemPalace#306) - mcp_server.py: replace 10k-limited col.get with paginating _iter_all_metadatas helper; stop swallowing errors silently (MemPalace#307) - mcp_server.py: drop undocumented wait_for_previous arg injected by Gemini MCP clients (MemPalace#322) - searcher.py + hybrid_searcher.py + mcp_server.py: add min_similarity threshold filter so callers get a clean "no results" signal (MemPalace#350) - entity_detector.py: add CODE_KEYWORDS blocklist (~80 terms) to stop Rust types / React / framework names being detected as entities (MemPalace#349) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

mvalentsev · 2026-04-09T11:47:03Z

Heads up: #156 changes the regex on the line right above this (widens the character class to Latin-1 for accented names). The two diffs don't overlap conceptually but they're adjacent in extract_candidates, so whoever merges second will need a one-line rebase.

matrix9neonebuchadnezzar2199-sketch · 2026-04-09T11:49:59Z

Thanks for the heads-up @mvalentsev — noted. Happy to rebase against #156 if it lands first. The changes are in adjacent lines but independent, so a clean rebase should be straightforward.

bensig · 2026-04-12T06:59:53Z

hey @matrix9neonebuchadnezzar2199-sketch — thanks for this.

closing because v4 is bringing local NLP providers (#507) which will replace the regex-based entity_detector entirely with proper NER. a blocklist approach would be a stopgap that gets thrown away shortly after. the underlying issue (#348) will be solved by the NLP upgrade.

fix: filter programming keywords from entity detection (MemPalace#348)

612c8ee

web3guru888 reviewed Apr 9, 2026

View reviewed changes

mvalentsev mentioned this pull request Apr 10, 2026

fix: entity detector false positives — programming keywords detected as projects/uncertain #348

Open

bensig changed the base branch from main to develop April 11, 2026 22:22

bensig requested review from bensig, igorls and milla-jovovich as code owners April 11, 2026 22:22

bensig closed this Apr 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: filter programming keywords from entity detection (#348)#349

fix: filter programming keywords from entity detection (#348)#349
matrix9neonebuchadnezzar2199-sketch wants to merge 1 commit intoMemPalace:developfrom
matrix9neonebuchadnezzar2199-sketch:fix/entity-detector-code-keywords

matrix9neonebuchadnezzar2199-sketch commented Apr 9, 2026

Uh oh!

web3guru888 left a comment

Uh oh!

matrix9neonebuchadnezzar2199-sketch commented Apr 9, 2026

Uh oh!

web3guru888 commented Apr 9, 2026

Uh oh!

mvalentsev commented Apr 9, 2026

Uh oh!

matrix9neonebuchadnezzar2199-sketch commented Apr 9, 2026

Uh oh!

bensig commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

matrix9neonebuchadnezzar2199-sketch commented Apr 9, 2026

Summary

Problem

Solution

What is NOT filtered

Testing

Uh oh!

web3guru888 left a comment

Choose a reason for hiding this comment

Uh oh!

matrix9neonebuchadnezzar2199-sketch commented Apr 9, 2026

Uh oh!

web3guru888 commented Apr 9, 2026

Uh oh!

mvalentsev commented Apr 9, 2026

Uh oh!

matrix9neonebuchadnezzar2199-sketch commented Apr 9, 2026

Uh oh!

bensig commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants