selected_with_duplicates return only one object when there is an exact match.
from semhash import SemHash
create a test subset
test_records = [
{"name": "Adhesion", "idx": 0},
{"name": "Adhesion", "idx": 1},
{"name": "Adhesion", "idx": 2},
{"name": "Adhesion", "idx": 3},
{"name": "Adhesion", "idx": 4},
{"name": "Adhesion", "idx": 5},
{"name": "Adhesion", "idx": 6},
{"name": "Adhesion", "idx": 7},
{"name": "Adhesion", "idx": 8},
{"name": "Adhesion", "idx": 9},
]
semhash_test = SemHash.from_records(records=test_records, columns=["name"])
deduplicated_texts = semhash_test.self_deduplicate()
select_with_dup = deduplicated_texts.selected_with_duplicates
print(select_with_dup)
print(select_with_dup)
[SelectedWithDuplicates(record={'name': 'Adhesion', 'idx': 0}, duplicates=[({'name': 'Adhesion', 'idx': 9}, 1.0)])]
I suspect the hash function to work only on name and not on the complete record.
selected_with_duplicates return only one object when there is an exact match.
I suspect the hash function to work only on name and not on the complete record.