Skip to content

Commit 439ca22

Browse files
authored
fix: sort diff matches by similarity
Sort diff matches by final Similarity before selecting the overall suggestion so keyword-order candidates cannot mask stronger Jaccard matches. Includes a regression test for the ADD-over-UPDATE masking case. Validation: go test ./...; go build -o mnemon .; make vet.
1 parent bd0fbe9 commit 439ca22

3 files changed

Lines changed: 38 additions & 0 deletions

File tree

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/).
88
## [Unreleased]
99

1010
### Fixed
11+
- `Diff` now sorts matches by `Similarity` descending before selecting the overall
12+
suggestion. Previously, `KeywordSearch` ordered candidates by token overlap score,
13+
so a high-keyword-score candidate classified as ADD could mask a lower-keyword-score
14+
candidate with higher Jaccard similarity that should have been UPDATE or DUPLICATE.
1115
- Deduplication false positives on scientific and domain-specific text:
1216
- Removed bare `"not"` from negation words — it appears in virtually all
1317
scientific prose and caused unrelated records to be classified as CONFLICT.

internal/search/diff.go

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -159,6 +159,14 @@ func Diff(insights []*model.Insight, newContent string, opts DiffOptions) DiffRe
159159
}
160160
}
161161

162+
// Sort by similarity descending so matches[0] is always the strongest candidate.
163+
// KeywordSearch orders by token overlap score, which can differ from the final
164+
// Jaccard-based Similarity — a high-keyword-score ADD would otherwise mask a
165+
// lower-keyword-score UPDATE or DUPLICATE from a more similar candidate.
166+
sort.Slice(matches, func(i, j int) bool {
167+
return matches[i].Similarity > matches[j].Similarity
168+
})
169+
162170
// Overall suggestion: take the strongest match
163171
overall := DiffAdd
164172
if len(matches) > 0 {

internal/search/diff_test.go

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -162,3 +162,29 @@ func TestDiff_LimitDefault(t *testing.T) {
162162
t.Errorf("default limit 5: got %d matches", len(result.Matches))
163163
}
164164
}
165+
166+
func TestDiff_LowerKeywordScoreUpdateNotMasked(t *testing.T) {
167+
// insightA: all of new's tokens are present (keyword score = 5/5 = 1.0),
168+
// but Jaccard = 5/14 ≈ 0.36 → ADD. KeywordSearch puts this first.
169+
insightA := &model.Insight{
170+
ID: "a",
171+
Content: "project uses redis for caching database monitoring alerting logging tracing scaling replication failover clustering sharding",
172+
}
173+
// insightB: keyword score = 4/5 = 0.8 (ranks second), Jaccard = 4/6 ≈ 0.67 → UPDATE.
174+
// Without sorting by Similarity, insightA's ADD masks this UPDATE.
175+
insightB := &model.Insight{
176+
ID: "b",
177+
Content: "project uses redis postgresql caching",
178+
}
179+
180+
result := Diff(
181+
[]*model.Insight{insightA, insightB},
182+
"project uses redis for caching database",
183+
DiffOptions{},
184+
)
185+
186+
if result.Suggestion != DiffUpdate {
187+
t.Errorf("want UPDATE (insightB is more similar by Jaccard), got %s — "+
188+
"high-keyword-score ADD from insightA masked the UPDATE", result.Suggestion)
189+
}
190+
}

0 commit comments

Comments
 (0)