Skip to content

selected_with_duplicates don't return all objects when there is an exact match #70

@Lhemamou

Description

@Lhemamou

selected_with_duplicates return only one object when there is an exact match.

from semhash import SemHash
create a test subset

test_records = [
{"name": "Adhesion", "idx": 0},
{"name": "Adhesion", "idx": 1},
{"name": "Adhesion", "idx": 2},
{"name": "Adhesion", "idx": 3},
{"name": "Adhesion", "idx": 4},
{"name": "Adhesion", "idx": 5},
{"name": "Adhesion", "idx": 6},
{"name": "Adhesion", "idx": 7},
{"name": "Adhesion", "idx": 8},
{"name": "Adhesion", "idx": 9},
]

semhash_test = SemHash.from_records(records=test_records, columns=["name"])

deduplicated_texts = semhash_test.self_deduplicate()
select_with_dup = deduplicated_texts.selected_with_duplicates
print(select_with_dup)

print(select_with_dup)
[SelectedWithDuplicates(record={'name': 'Adhesion', 'idx': 0}, duplicates=[({'name': 'Adhesion', 'idx': 9}, 1.0)])]

I suspect the hash function to work only on name and not on the complete record.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions